prepare_commit_mutex
as the main hindrance to accomplish group commits—which it indeed is—and proposes to remove it with the motivation that FLUSH TABLES WITH READ LOCK
can be used to get a good binlog position instead. That is a solution—but not really a good solution—as Kristian points out in the last post.
The prepare_commit_mutex
is used to ensure that the
order of transactions in the binary log is the same as the order of
transactions in the InnoDB log—and keeping the same order in the logs is critical for getting a true on-line backup to work, so
removing it is not really an option, which Kristian points out in his third article. In other words, it is necessary to ensure that the InnoDB transaction log and the binary log have the same order of
transactions.
To understand how to solve the problem, it is necessary to take a closer look at the XA commit procedure and see how we can change it to implement a group commit of the binary log.
The transaction data is stored in a per-thread transaction cache and the transaction size is the size of the data in the transaction cache. In addition, each transaction will have a transaction binlog position (or just transaction position) where the transaction data is written in the binary log.
The procedure can be outlined in the following steps:
- Prepare InnoDB [
ha_prepare
]: - Write prepare record to log buffer
fsync()
log file to disk (this can currently do group commit)- Take
prepare_commit_mutex
- Log transaction to binary log [
TC_LOG_BINLOG::log_xid
]: - Lock binary log
- Write transaction data to binary log
- Sync binary log based on
sync_binlog
. This forces the binlog to alwaysfsync()
(no group commit) due toprepare_commit_mutex
- Unlock binary log
- Commit InnoDB:
- Release
prepare_commit_mutex
- Write commit record to log buffer
- Sync log buffer to disk (this can currently do group commit)
- InnoDB locks are released
- The InnoDB row level and table level locks are released very late in the sequence, which affects concurrency. Ideally, we need to release the locks very early, preferably as soon as we have prepared InnoDB.
- It is not possible to perform a group commit in step 2
prepare_commit_mutex
is removed it
is possible for another thread to overtake a transaction so that the
prepare and writing to the binary log is done in different order.To solve this, Mark suggests using a queue or a ticket system to ensure that transactions are committed in the same order, but we actually already have such a system that we can use to assign tickets: namely the binary log.
The idea is to allocate space in the binary log for the transaction to be written. This gives us a sequence number that we can use to order the transactions.
In the worklog on binary log group commits you will find the complete description as well as the status of the evolving work.
In this post, I will outline an approach that Harrison and I have discussed, which we think will solve the problems mentioned above. In this post, I will outline the procedure during normal operations, in the next post I will discuss recovery, and in the third post (but likely not the last on the subject), I will discuss some optimizations that can be done.
I want to emphasize that the fact that we have a worklog does not involve any guarantees or promises of what, when, or even if any patches will be pushed to any release of MySQL.
In Worklog #4007 an approach for writing the binary log is suggested where space is allocated for the transaction in the binary log before actually starting to write it. In addition to avoiding unnecessary locking of the binary log, it also allow us to use the binary log to order the transactions in-place. We will use this idea of reserving space in the binary log to implement the binary log group commit.
By re-structuring the procedure above slightly, we can ensure that the transactions are written in the same order in both the InnoDB transaction log and the binary log.
There are two ways to re-structure the code: one simple and one more complicated that potentially can render better performance. To simplify the presentation, it is assumed that pre-allocation is handled elsewhere, for example using Worklog #4925. In a real implementation, pre-allocation can either be handled when a new binlog file is created, or when transaction data is being written to the binary log.
The sequential write approach
fsync()
.
To describe the algorithm, three shared variables are introduced to
keep track of the status of replication:
Next_Available
- This variable keeps track of where a new transaction can be written
Last_Committed
- This variable keeps track of the last committed transaction, meaning that all transactions preceding this position is actually on disc. This variable is not necessary in the real implementation, but it is kept here to simplify the presentation of the algorithm.
Last_Complete
- This variable keeps track of the last complete transaction. All transactions preceding this point is actually written to the binary log, but are not necessarily flushed to disc yet.
Last_Committed
≤
Last_Complete
≤ Next_Available
.
The procedure can be described in the following steps:
- Lock the binary log
- Save value of
Next_Available
in a variable Trans_Pos and increaseNext_Available
with the size of the transaction. - Prepare InnoDB:
- Write prepare record to log buffer (but do not
fsync()
buffer here) - Release row locks
- Unlock binary log
- Post prepare InnoDB:
fsync()
log file to disk, which can now be done using group commit since no mutex is held.- Log transaction to binary log:
- Wait until
Last_Complete
=Trans_Pos
. (This can be implemented using a condition variable and a mutex.) - Write transaction data to binary log using
pwrite
. At this point, it is not really necessary to usepwrite
since the transaction data is simply appended, but it will be used in the second algorithm, so we introduce it here. - Update
Last_Complete
toTrans_Pos
+ transaction size. - Broadcast the the new position to all waiting threads to wake them up.
- Call
fsync()
to persist binary log on disk. This can now be group committed. - Commit InnoDB:
- Write commit record to log buffer
- Sync log buffer to disk, which currently can be group committed.
fsync()
, after
which it broadcasts the fact that data has been flushed to disc to
other waiting threads so that they can skip this. Typically, the code
looks something along these lines (we ignore checking error codes here
to simplify the description):
pthread_mutex_lock(&binlog_lock); while (Last_Complete ≥ Last_Committed) { struct timespec timeout; gettimeofday(&timeout, NULL); timeout.tv_usec += 1000; /* 1 msec */ int error= pthread_cond_timedwait(&binlog_flush, &binlog_lock, &timeout); if (error == ETIMEDOUT) { fsync(&binlog_file); Last_Committed = Last_Complete; pthread_cond_broadcast(&binlog_flush); } } pthread_mutex_unlock(&binlog_lock);There are a few observations regarding this approach:
- Step 6a requires a condition variable and a mutex when waiting
for
Last_Complete
to reachTrans_Pos
. Since there is just a single condition variable, it is necessary to broadcast a wakeup to all waiting threads, which each will evaluate the condition just to find a single thread that should continue, while the other threads go to sleep again.This means that the condition will be checked O(N2) times to commit N transactions. This is a waste of resources, especially if there is a lot of threads waiting, and if we can avoid this, we can gain performance.
- Since the thread has a good position in the binary log where it could write, it could just as well start writing instead of waiting. It will not interfere with any other threads, regardless if locks are kept or not.
A parallel write approach
pwrite
since the space for the
transaction data has already been allocated when preparing the
engines. Figure 2 illustrates how the binary log is filled in (grey
areas) by multiple threads at the same time. Similar to the
sequential write approach, we still have the
Last_Complete
, Last_Committed
, and
Next_Available
variables.Each thread does not have to wait for other threads before writing, but it does have to wait for the other threads to commit. This is necessary since we required the order of commits in the InnoDB log and the binary log to be the same. In reality, this does not pose a problem since the I/O is buffered, hence the writes are done to in-memory file buffers.
The algorithms look quite similar to the sequential write approach,
but notice that in step 6, the transaction data is simply written to
the binary log using pwrite
.
- Lock the binary log
- Save value of
Next_Available
in a local variableTrans_Pos
and increaseNext_Available
with the size of the transaction. - Prepare InnoDB:
- Write prepare record to log buffer (but do not
fsync()
buffer here) - Release row locks
- Unlock binary log
- Post prepare InnoDB:
fsync()
log file to disk, which can now be done using group commit since no mutex is held.- Log transaction to binary log:
- Write transaction data to binary log using
pwrite
. There is no need to keep a lock to protect the binary log here since all threads will write to different positions. - Wait until
Last_Complete
=Trans_Pos
. - Update
Last_Complete
toTrans_Pos
+ transaction size. - Broadcast the the new position to all waiting threads to wake them up.
- Call
fsync()
to persist binary log on disk. This can now be group committed. - Commit InnoDB:
- Write commit record to log
- Sync log file to disk
- When a transaction is committed, it is guaranteed that
Trans_Pos
≥Last_Committed
for all threads (recall thatTrans_Pos
is a thread-local variable). - Writes are done in parallel, but when waiting for the condition in step 6b still requires a broadcast to wake up all waiting threads, while only one will be allowed to proceed. This means that we still have the O(N2) complexity of the sequential algorithm. However, for the parallel algorithm it is possible to improve the performance significantly, which we will demonstrate in the third part where we will discuss optimizations to the algorithms.
- Recovery in the sequential algorithm is comparably simple since there are no partially written transactions. If you consider that a crash can occur in the situation described in Figure 2, it is necessary to device a method for correctly recovering. This we will discuss in the second part of these posts.
8 comments:
Looks like the usage of binlog for replication just causes a lot of problems and performance issues.
Basically you have 1 database server but yet are forced to deal with distributed transaction, which will always be slow.
What about just getting rid of the binlog altogether? InnoDB already has its own log. I don't see any point in having the binlog at all.
Unfortunately, InnoDB does not store the DDL in it's log, nor changes done in other engines.
Although there is some cost to writing the binary log, it is possible to reduce the impact of writing to the binary log. Since little time has been invested in improving the performance of writing the binary log, there is potential of improving performance here.
Note that the biggest (and mostly mentioned) performance issue is not the binary log (on the master), it is applying the changes on the slave.
Hi Mats, great article!
I have a couple of questions:
1. In your description you release InnoDB row locks in InnoDB prepare. I am
wondering if this is safe (and if so, why it is safe)?
One problem is that changes done in a transaction are only made visible to
other transactions in InnoDB commit. So what if another transaction does
UPDATE ... SET a=a+1 on the same row, how will it be able to set the correct
value if it does not wait until the first transaction commits and makes its
changes visible?
Or are changes made visible to other transactions in InnoDB prepare when the
row locks are released? This seems to create another possible problem in case
the transaction is rolled back instead of committed (which can still happen
after InnoDB prepare succeeds if binlog commit (or another engine commit)
fails). In that case, other transactions may get to see changes that were
never committed?
(But I am really interested in understanding how to safely release row locks
early).
2. InnoDB commit order is determined by the order in which InnoDB commit is
called, not by the order in which InnoDB prepare is called (I'm pretty sure
...). So it seems to me your algorithm does not guarantee same commit order in
binlog and InnoDB ... ? You get InnoDB to prepare in the same order as binlog,
but not commit.
So it seems we can still get the problem that for example the binlog has
transaction A followed by B (and A was prepared before B), but InnoDB has
committed transaction B first (and not yet committed transaction A). And at
this point InnoDB hotbackup snapshots the system, resulting in a state, with B
committed and A not, which does not correspond to any binlog position?
Or am I missing something?
Hi Kristian,
For question 1, when an engine does a prepare, it assures that the transaction commit can and will succeed. So after all engines have done the prepare, then the commit should always work and hence, in theory, the row level locks can be released.
Really, the lock releasing should wait until after all engines involved have prepared (not just InnoDB if others are involved).
If the commit is failing for some reason, then the two phase commit protocol is being violated and things go wrong for several reasons (such as transactions may be committed in some resources but not others).
Practically speaking commit can fail for several reasons, such as disks dying, disk full, etc... The only problem that would occur from these set of failures is if the commits could fail selectively, such as transaction A updates a row, transaction B uses the new value due to the early lock release, but A failed to commit and B works. The above failures possibilities are general failures and will effect all transactions equally and hence aren't a problem.
2. In our new solution, it would be necessary to have the binary logs on hand (at least the active one) as well for using any snapshot technology. They wouldn't necessarily be needed to be snapshotted at the exact same time, just copied after the innodb/log files.
A transaction is committed when it is written to the binary log. On crash recovery, the prepared transactions get committed and hence InnoDB could still output the latest coordinates (with a little code to determine latest as each commits).
In theory, it may be possible to do the recovery without the binary logs, but it would require InnoDB finishing commits based on the order of the transactions in the prepare phase, which I'm not 100% sure on it and am not entirely sure how safe it would be.
Step 6a doesn't require O(N*N) wakeups. A simple implementation of step 6a does. A better implementation can use an array of condvars and broadcast on condvars[next_position % sizeof(condvars)]
For more on early release of row locks see http://bazaar.launchpad.net/~mysqlatfacebook/mysqlatfacebook/5.1/revision/3545
Doesn't the parallel write algorithm still have an O(N*N) step in 6d as broadcast is done there. And the optimization for the sequential write approach can also be used here.
Hi Mark,
Yes, it is possible to reduce the number of wakeups by introducing an array of condition variables like you propose.
The broadcast does indeed wake up all threads, but normally all of the threads are able to proceed, so each broadcast typically wakes up O(N) threads, giving a complexity of O(N^2) / O(N) = O(N).
Post a Comment