prepare_commit_mutex
as the main hindrance to accomplish group commits—which it indeed is—and proposes to remove it with the motivation that FLUSH TABLES WITH READ LOCK
can be used to get a good binlog position instead. That is a solution—but not really a good solution—as Kristian points out in the last post.
The prepare_commit_mutex
is used to ensure that the
order of transactions in the binary log is the same as the order of
transactions in the InnoDB log—and keeping the same order in the logs is critical for getting a true on-line backup to work, so
removing it is not really an option, which Kristian points out in his third article. In other words, it is necessary to ensure that the InnoDB transaction log and the binary log have the same order of
transactions.
To understand how to solve the problem, it is necessary to take a closer look at the XA commit procedure and see how we can change it to implement a group commit of the binary log.
The transaction data is stored in a per-thread transaction cache and the transaction size is the size of the data in the transaction cache. In addition, each transaction will have a transaction binlog position (or just transaction position) where the transaction data is written in the binary log.
The procedure can be outlined in the following steps:
- Prepare InnoDB [
ha_prepare
]: - Write prepare record to log buffer
fsync()
log file to disk (this can currently do group commit)- Take
prepare_commit_mutex
- Log transaction to binary log [
TC_LOG_BINLOG::log_xid
]: - Lock binary log
- Write transaction data to binary log
- Sync binary log based on
sync_binlog
. This forces the binlog to alwaysfsync()
(no group commit) due toprepare_commit_mutex
- Unlock binary log
- Commit InnoDB:
- Release
prepare_commit_mutex
- Write commit record to log buffer
- Sync log buffer to disk (this can currently do group commit)
- InnoDB locks are released
- The InnoDB row level and table level locks are released very late in the sequence, which affects concurrency. Ideally, we need to release the locks very early, preferably as soon as we have prepared InnoDB.
- It is not possible to perform a group commit in step 2
prepare_commit_mutex
is removed it
is possible for another thread to overtake a transaction so that the
prepare and writing to the binary log is done in different order.To solve this, Mark suggests using a queue or a ticket system to ensure that transactions are committed in the same order, but we actually already have such a system that we can use to assign tickets: namely the binary log.
The idea is to allocate space in the binary log for the transaction to be written. This gives us a sequence number that we can use to order the transactions.
In the worklog on binary log group commits you will find the complete description as well as the status of the evolving work.
In this post, I will outline an approach that Harrison and I have discussed, which we think will solve the problems mentioned above. In this post, I will outline the procedure during normal operations, in the next post I will discuss recovery, and in the third post (but likely not the last on the subject), I will discuss some optimizations that can be done.
I want to emphasize that the fact that we have a worklog does not involve any guarantees or promises of what, when, or even if any patches will be pushed to any release of MySQL.
In Worklog #4007 an approach for writing the binary log is suggested where space is allocated for the transaction in the binary log before actually starting to write it. In addition to avoiding unnecessary locking of the binary log, it also allow us to use the binary log to order the transactions in-place. We will use this idea of reserving space in the binary log to implement the binary log group commit.
By re-structuring the procedure above slightly, we can ensure that the transactions are written in the same order in both the InnoDB transaction log and the binary log.
There are two ways to re-structure the code: one simple and one more complicated that potentially can render better performance. To simplify the presentation, it is assumed that pre-allocation is handled elsewhere, for example using Worklog #4925. In a real implementation, pre-allocation can either be handled when a new binlog file is created, or when transaction data is being written to the binary log.
The sequential write approach
fsync()
.
To describe the algorithm, three shared variables are introduced to
keep track of the status of replication:
Next_Available
- This variable keeps track of where a new transaction can be written
Last_Committed
- This variable keeps track of the last committed transaction, meaning that all transactions preceding this position is actually on disc. This variable is not necessary in the real implementation, but it is kept here to simplify the presentation of the algorithm.
Last_Complete
- This variable keeps track of the last complete transaction. All transactions preceding this point is actually written to the binary log, but are not necessarily flushed to disc yet.
Last_Committed
≤
Last_Complete
≤ Next_Available
.
The procedure can be described in the following steps:
- Lock the binary log
- Save value of
Next_Available
in a variable Trans_Pos and increaseNext_Available
with the size of the transaction. - Prepare InnoDB:
- Write prepare record to log buffer (but do not
fsync()
buffer here) - Release row locks
- Unlock binary log
- Post prepare InnoDB:
fsync()
log file to disk, which can now be done using group commit since no mutex is held.- Log transaction to binary log:
- Wait until
Last_Complete
=Trans_Pos
. (This can be implemented using a condition variable and a mutex.) - Write transaction data to binary log using
pwrite
. At this point, it is not really necessary to usepwrite
since the transaction data is simply appended, but it will be used in the second algorithm, so we introduce it here. - Update
Last_Complete
toTrans_Pos
+ transaction size. - Broadcast the the new position to all waiting threads to wake them up.
- Call
fsync()
to persist binary log on disk. This can now be group committed. - Commit InnoDB:
- Write commit record to log buffer
- Sync log buffer to disk, which currently can be group committed.
fsync()
, after
which it broadcasts the fact that data has been flushed to disc to
other waiting threads so that they can skip this. Typically, the code
looks something along these lines (we ignore checking error codes here
to simplify the description):
pthread_mutex_lock(&binlog_lock); while (Last_Complete ≥ Last_Committed) { struct timespec timeout; gettimeofday(&timeout, NULL); timeout.tv_usec += 1000; /* 1 msec */ int error= pthread_cond_timedwait(&binlog_flush, &binlog_lock, &timeout); if (error == ETIMEDOUT) { fsync(&binlog_file); Last_Committed = Last_Complete; pthread_cond_broadcast(&binlog_flush); } } pthread_mutex_unlock(&binlog_lock);There are a few observations regarding this approach:
- Step 6a requires a condition variable and a mutex when waiting
for
Last_Complete
to reachTrans_Pos
. Since there is just a single condition variable, it is necessary to broadcast a wakeup to all waiting threads, which each will evaluate the condition just to find a single thread that should continue, while the other threads go to sleep again.This means that the condition will be checked O(N2) times to commit N transactions. This is a waste of resources, especially if there is a lot of threads waiting, and if we can avoid this, we can gain performance.
- Since the thread has a good position in the binary log where it could write, it could just as well start writing instead of waiting. It will not interfere with any other threads, regardless if locks are kept or not.
A parallel write approach
pwrite
since the space for the
transaction data has already been allocated when preparing the
engines. Figure 2 illustrates how the binary log is filled in (grey
areas) by multiple threads at the same time. Similar to the
sequential write approach, we still have the
Last_Complete
, Last_Committed
, and
Next_Available
variables.Each thread does not have to wait for other threads before writing, but it does have to wait for the other threads to commit. This is necessary since we required the order of commits in the InnoDB log and the binary log to be the same. In reality, this does not pose a problem since the I/O is buffered, hence the writes are done to in-memory file buffers.
The algorithms look quite similar to the sequential write approach,
but notice that in step 6, the transaction data is simply written to
the binary log using pwrite
.
- Lock the binary log
- Save value of
Next_Available
in a local variableTrans_Pos
and increaseNext_Available
with the size of the transaction. - Prepare InnoDB:
- Write prepare record to log buffer (but do not
fsync()
buffer here) - Release row locks
- Unlock binary log
- Post prepare InnoDB:
fsync()
log file to disk, which can now be done using group commit since no mutex is held.- Log transaction to binary log:
- Write transaction data to binary log using
pwrite
. There is no need to keep a lock to protect the binary log here since all threads will write to different positions. - Wait until
Last_Complete
=Trans_Pos
. - Update
Last_Complete
toTrans_Pos
+ transaction size. - Broadcast the the new position to all waiting threads to wake them up.
- Call
fsync()
to persist binary log on disk. This can now be group committed. - Commit InnoDB:
- Write commit record to log
- Sync log file to disk
- When a transaction is committed, it is guaranteed that
Trans_Pos
≥Last_Committed
for all threads (recall thatTrans_Pos
is a thread-local variable). - Writes are done in parallel, but when waiting for the condition in step 6b still requires a broadcast to wake up all waiting threads, while only one will be allowed to proceed. This means that we still have the O(N2) complexity of the sequential algorithm. However, for the parallel algorithm it is possible to improve the performance significantly, which we will demonstrate in the third part where we will discuss optimizations to the algorithms.
- Recovery in the sequential algorithm is comparably simple since there are no partially written transactions. If you consider that a crash can occur in the situation described in Figure 2, it is necessary to device a method for correctly recovering. This we will discuss in the second part of these posts.