Friday, April 30, 2010

Binary Log Group Commit - An Implementation Proposal

It is with interest that I read Kristian's three blogs on the binary log group commit. In the article, he mentions InnoDB's prepare_commit_mutex as the main hindrance to accomplish group commits—which it indeed is—and proposes to remove it with the motivation that FLUSH TABLES WITH READ LOCK can be used to get a good binlog position instead. That is a solution—but not really a good solution—as Kristian points out in the last post.

The prepare_commit_mutex is used to ensure that the order of transactions in the binary log is the same as the order of transactions in the InnoDB log—and keeping the same order in the logs is critical for getting a true on-line backup to work, so removing it is not really an option, which Kristian points out in his third article. In other words, it is necessary to ensure that the InnoDB transaction log and the binary log have the same order of transactions.

To understand how to solve the problem, it is necessary to take a closer look at the XA commit procedure and see how we can change it to implement a group commit of the binary log.

The transaction data is stored in a per-thread transaction cache and the transaction size is the size of the data in the transaction cache. In addition, each transaction will have a transaction binlog position (or just transaction position) where the transaction data is written in the binary log.

The procedure can be outlined in the following steps:

  1. Prepare InnoDB [ha_prepare]:
    1. Write prepare record to log buffer
    2. fsync() log file to disk (this can currently do group commit)
    3. Take prepare_commit_mutex
  2. Log transaction to binary log [TC_LOG_BINLOG::log_xid]:
    1. Lock binary log
    2. Write transaction data to binary log
    3. Sync binary log based on sync_binlog. This forces the binlog to always fsync() (no group commit) due to prepare_commit_mutex
    4. Unlock binary log
  3. Commit InnoDB:
    1. Release prepare_commit_mutex
    2. Write commit record to log buffer
    3. Sync log buffer to disk (this can currently do group commit)
    4. InnoDB locks are released
There are mainly two problems with this approach:
  • The InnoDB row level and table level locks are released very late in the sequence, which affects concurrency. Ideally, we need to release the locks very early, preferably as soon as we have prepared InnoDB.
  • It is not possible to perform a group commit in step 2
As you can see here, the prepare of the storage engines (in this case just InnoDB) is done before the binary log mutex is taken, and that means that if the prepare_commit_mutex is removed it is possible for another thread to overtake a transaction so that the prepare and writing to the binary log is done in different order.

To solve this, Mark suggests using a queue or a ticket system to ensure that transactions are committed in the same order, but we actually already have such a system that we can use to assign tickets: namely the binary log.

The idea is to allocate space in the binary log for the transaction to be written. This gives us a sequence number that we can use to order the transactions.

In the worklog on binary log group commits you will find the complete description as well as the status of the evolving work.

In this post, I will outline an approach that Harrison and I have discussed, which we think will solve the problems mentioned above. In this post, I will outline the procedure during normal operations, in the next post I will discuss recovery, and in the third post (but likely not the last on the subject), I will discuss some optimizations that can be done.

I want to emphasize that the fact that we have a worklog does not involve any guarantees or promises of what, when, or even if any patches will be pushed to any release of MySQL.

In Worklog #4007 an approach for writing the binary log is suggested where space is allocated for the transaction in the binary log before actually starting to write it. In addition to avoiding unnecessary locking of the binary log, it also allow us to use the binary log to order the transactions in-place. We will use this idea of reserving space in the binary log to implement the binary log group commit.

By re-structuring the procedure above slightly, we can ensure that the transactions are written in the same order in both the InnoDB transaction log and the binary log.

There are two ways to re-structure the code: one simple and one more complicated that potentially can render better performance. To simplify the presentation, it is assumed that pre-allocation is handled elsewhere, for example using Worklog #4925. In a real implementation, pre-allocation can either be handled when a new binlog file is created, or when transaction data is being written to the binary log.

The sequential write approach

Figure 1. Sequential binary log group commit
In the sequential write approach, the transactions are still written to the binary log in order and the code is just re-ordered to avoid keeping mutexes when calling fsync(). To describe the algorithm, three shared variables are introduced to keep track of the status of replication:
Next_Available
This variable keeps track of where a new transaction can be written
Last_Committed
This variable keeps track of the last committed transaction, meaning that all transactions preceding this position is actually on disc. This variable is not necessary in the real implementation, but it is kept here to simplify the presentation of the algorithm.
Last_Complete
This variable keeps track of the last complete transaction. All transactions preceding this point is actually written to the binary log, but are not necessarily flushed to disc yet.
You can see an illustration of how the variables are used with the binary log in Figure 1 where you can also see three threads each waiting to write a transaction. Both variables are initially is set to the beginning of the binary log and it is always true that Last_CommittedLast_CompleteNext_Available . The procedure can be described in the following steps:
  1. Lock the binary log
  2. Save value of Next_Available in a variable Trans_Pos and increase Next_Available with the size of the transaction.
  3. Prepare InnoDB:
    1. Write prepare record to log buffer (but do not fsync() buffer here)
    2. Release row locks
  4. Unlock binary log
  5. Post prepare InnoDB:
    1. fsync() log file to disk, which can now be done using group commit since no mutex is held.
  6. Log transaction to binary log:
    1. Wait until Last_Complete = Trans_Pos. (This can be implemented using a condition variable and a mutex.)
    2. Write transaction data to binary log using pwrite. At this point, it is not really necessary to use pwrite since the transaction data is simply appended, but it will be used in the second algorithm, so we introduce it here.
    3. Update Last_Complete to Trans_Pos + transaction size.
    4. Broadcast the the new position to all waiting threads to wake them up.
    5. Call fsync() to persist binary log on disk. This can now be group committed.
  7. Commit InnoDB:
    1. Write commit record to log buffer
    2. Sync log buffer to disk, which currently can be group committed.
To implement group commit, it is sufficient to have a condition variable and wait for that for a specified interval. Once the interval has passed, the transaction data can call fsync(), after which it broadcasts the fact that data has been flushed to disc to other waiting threads so that they can skip this. Typically, the code looks something along these lines (we ignore checking error codes here to simplify the description):
pthread_mutex_lock(&binlog_lock);
while (Last_Complete ≥ Last_Committed) {
  struct timespec timeout;
  gettimeofday(&timeout, NULL);
  timeout.tv_usec += 1000;    /* 1 msec */
  int error= pthread_cond_timedwait(&binlog_flush, &binlog_lock, &timeout);
  if (error == ETIMEDOUT) {
    fsync(&binlog_file);
    Last_Committed = Last_Complete;
    pthread_cond_broadcast(&binlog_flush);
  }
}
pthread_mutex_unlock(&binlog_lock);
There are a few observations regarding this approach:
  • Step 6a requires a condition variable and a mutex when waiting for Last_Complete to reach Trans_Pos. Since there is just a single condition variable, it is necessary to broadcast a wakeup to all waiting threads, which each will evaluate the condition just to find a single thread that should continue, while the other threads go to sleep again.

    This means that the condition will be checked O(N2) times to commit N transactions. This is a waste of resources, especially if there is a lot of threads waiting, and if we can avoid this, we can gain performance.

  • Since the thread has a good position in the binary log where it could write, it could just as well start writing instead of waiting. It will not interfere with any other threads, regardless if locks are kept or not.
These observations lead us to the second approach, that of writing transaction data to the binary log in parallel.

A parallel write approach

Figure 2. Parallel binary log group commit
In this approach, each session is allowed to write to the binary log at the same time using pwrite since the space for the transaction data has already been allocated when preparing the engines. Figure 2 illustrates how the binary log is filled in (grey areas) by multiple threads at the same time. Similar to the sequential write approach, we still have the Last_Complete, Last_Committed, and Next_Available variables.

Each thread does not have to wait for other threads before writing, but it does have to wait for the other threads to commit. This is necessary since we required the order of commits in the InnoDB log and the binary log to be the same. In reality, this does not pose a problem since the I/O is buffered, hence the writes are done to in-memory file buffers.

The algorithms look quite similar to the sequential write approach, but notice that in step 6, the transaction data is simply written to the binary log using pwrite.

  1. Lock the binary log
  2. Save value of Next_Available in a local variable Trans_Pos and increase Next_Available with the size of the transaction.
  3. Prepare InnoDB:
    1. Write prepare record to log buffer (but do not fsync() buffer here)
    2. Release row locks
  4. Unlock binary log
  5. Post prepare InnoDB:
    1. fsync() log file to disk, which can now be done using group commit since no mutex is held.
  6. Log transaction to binary log:
    1. Write transaction data to binary log using pwrite. There is no need to keep a lock to protect the binary log here since all threads will write to different positions.
    2. Wait until Last_Complete = Trans_Pos.
    3. Update Last_Complete to Trans_Pos + transaction size.
    4. Broadcast the the new position to all waiting threads to wake them up.
    5. Call fsync() to persist binary log on disk. This can now be group committed.
  7. Commit InnoDB:
    1. Write commit record to log
    2. Sync log file to disk
This new algorithm has some advantages, but there are a few things to note:
  • When a transaction is committed, it is guaranteed that Trans_PosLast_Committed for all threads (recall that Trans_Pos is a thread-local variable).
  • Writes are done in parallel, but when waiting for the condition in step 6b still requires a broadcast to wake up all waiting threads, while only one will be allowed to proceed. This means that we still have the O(N2) complexity of the sequential algorithm. However, for the parallel algorithm it is possible to improve the performance significantly, which we will demonstrate in the third part where we will discuss optimizations to the algorithms.
  • Recovery in the sequential algorithm is comparably simple since there are no partially written transactions. If you consider that a crash can occur in the situation described in Figure 2, it is necessary to device a method for correctly recovering. This we will discuss in the second part of these posts.

8 comments:

Anonymous said...

Looks like the usage of binlog for replication just causes a lot of problems and performance issues.

Basically you have 1 database server but yet are forced to deal with distributed transaction, which will always be slow.

What about just getting rid of the binlog altogether? InnoDB already has its own log. I don't see any point in having the binlog at all.

Mats Kindahl said...

Unfortunately, InnoDB does not store the DDL in it's log, nor changes done in other engines.

Although there is some cost to writing the binary log, it is possible to reduce the impact of writing to the binary log. Since little time has been invested in improving the performance of writing the binary log, there is potential of improving performance here.

Note that the biggest (and mostly mentioned) performance issue is not the binary log (on the master), it is applying the changes on the slave.

Kristian Nielsen said...

Hi Mats, great article!

I have a couple of questions:

1. In your description you release InnoDB row locks in InnoDB prepare. I am
wondering if this is safe (and if so, why it is safe)?

One problem is that changes done in a transaction are only made visible to
other transactions in InnoDB commit. So what if another transaction does
UPDATE ... SET a=a+1 on the same row, how will it be able to set the correct
value if it does not wait until the first transaction commits and makes its
changes visible?

Or are changes made visible to other transactions in InnoDB prepare when the
row locks are released? This seems to create another possible problem in case
the transaction is rolled back instead of committed (which can still happen
after InnoDB prepare succeeds if binlog commit (or another engine commit)
fails). In that case, other transactions may get to see changes that were
never committed?

(But I am really interested in understanding how to safely release row locks
early).

2. InnoDB commit order is determined by the order in which InnoDB commit is
called, not by the order in which InnoDB prepare is called (I'm pretty sure
...). So it seems to me your algorithm does not guarantee same commit order in
binlog and InnoDB ... ? You get InnoDB to prepare in the same order as binlog,
but not commit.

So it seems we can still get the problem that for example the binlog has
transaction A followed by B (and A was prepared before B), but InnoDB has
committed transaction B first (and not yet committed transaction A). And at
this point InnoDB hotbackup snapshots the system, resulting in a state, with B
committed and A not, which does not correspond to any binlog position?

Or am I missing something?

Harrison said...

Hi Kristian,

For question 1, when an engine does a prepare, it assures that the transaction commit can and will succeed. So after all engines have done the prepare, then the commit should always work and hence, in theory, the row level locks can be released.

Really, the lock releasing should wait until after all engines involved have prepared (not just InnoDB if others are involved).

If the commit is failing for some reason, then the two phase commit protocol is being violated and things go wrong for several reasons (such as transactions may be committed in some resources but not others).


Practically speaking commit can fail for several reasons, such as disks dying, disk full, etc... The only problem that would occur from these set of failures is if the commits could fail selectively, such as transaction A updates a row, transaction B uses the new value due to the early lock release, but A failed to commit and B works. The above failures possibilities are general failures and will effect all transactions equally and hence aren't a problem.



2. In our new solution, it would be necessary to have the binary logs on hand (at least the active one) as well for using any snapshot technology. They wouldn't necessarily be needed to be snapshotted at the exact same time, just copied after the innodb/log files.

A transaction is committed when it is written to the binary log. On crash recovery, the prepared transactions get committed and hence InnoDB could still output the latest coordinates (with a little code to determine latest as each commits).


In theory, it may be possible to do the recovery without the binary logs, but it would require InnoDB finishing commits based on the order of the transactions in the prepare phase, which I'm not 100% sure on it and am not entirely sure how safe it would be.

Mark Callaghan said...

Step 6a doesn't require O(N*N) wakeups. A simple implementation of step 6a does. A better implementation can use an array of condvars and broadcast on condvars[next_position % sizeof(condvars)]

Mark Callaghan said...

For more on early release of row locks see http://bazaar.launchpad.net/~mysqlatfacebook/mysqlatfacebook/5.1/revision/3545

Mark Callaghan said...

Doesn't the parallel write algorithm still have an O(N*N) step in 6d as broadcast is done there. And the optimization for the sequential write approach can also be used here.

Mats Kindahl said...

Hi Mark,

Yes, it is possible to reduce the number of wakeups by introducing an array of condition variables like you propose.

The broadcast does indeed wake up all threads, but normally all of the threads are able to proceed, so each broadcast typically wakes up O(N) threads, giving a complexity of O(N^2) / O(N) = O(N).