Friday, April 30, 2010

Binary Log Group Commit - An Implementation Proposal

It is with interest that I read Kristian's three blogs on the binary log group commit. In the article, he mentions InnoDB's prepare_commit_mutex as the main hindrance to accomplish group commits—which it indeed is—and proposes to remove it with the motivation that FLUSH TABLES WITH READ LOCK can be used to get a good binlog position instead. That is a solution—but not really a good solution—as Kristian points out in the last post.

The prepare_commit_mutex is used to ensure that the order of transactions in the binary log is the same as the order of transactions in the InnoDB log—and keeping the same order in the logs is critical for getting a true on-line backup to work, so removing it is not really an option, which Kristian points out in his third article. In other words, it is necessary to ensure that the InnoDB transaction log and the binary log have the same order of transactions.

To understand how to solve the problem, it is necessary to take a closer look at the XA commit procedure and see how we can change it to implement a group commit of the binary log.

The transaction data is stored in a per-thread transaction cache and the transaction size is the size of the data in the transaction cache. In addition, each transaction will have a transaction binlog position (or just transaction position) where the transaction data is written in the binary log.

The procedure can be outlined in the following steps:

  1. Prepare InnoDB [ha_prepare]:
    1. Write prepare record to log buffer
    2. fsync() log file to disk (this can currently do group commit)
    3. Take prepare_commit_mutex
  2. Log transaction to binary log [TC_LOG_BINLOG::log_xid]:
    1. Lock binary log
    2. Write transaction data to binary log
    3. Sync binary log based on sync_binlog. This forces the binlog to always fsync() (no group commit) due to prepare_commit_mutex
    4. Unlock binary log
  3. Commit InnoDB:
    1. Release prepare_commit_mutex
    2. Write commit record to log buffer
    3. Sync log buffer to disk (this can currently do group commit)
    4. InnoDB locks are released
There are mainly two problems with this approach:
  • The InnoDB row level and table level locks are released very late in the sequence, which affects concurrency. Ideally, we need to release the locks very early, preferably as soon as we have prepared InnoDB.
  • It is not possible to perform a group commit in step 2
As you can see here, the prepare of the storage engines (in this case just InnoDB) is done before the binary log mutex is taken, and that means that if the prepare_commit_mutex is removed it is possible for another thread to overtake a transaction so that the prepare and writing to the binary log is done in different order.

To solve this, Mark suggests using a queue or a ticket system to ensure that transactions are committed in the same order, but we actually already have such a system that we can use to assign tickets: namely the binary log.

The idea is to allocate space in the binary log for the transaction to be written. This gives us a sequence number that we can use to order the transactions.

In the worklog on binary log group commits you will find the complete description as well as the status of the evolving work.

In this post, I will outline an approach that Harrison and I have discussed, which we think will solve the problems mentioned above. In this post, I will outline the procedure during normal operations, in the next post I will discuss recovery, and in the third post (but likely not the last on the subject), I will discuss some optimizations that can be done.

I want to emphasize that the fact that we have a worklog does not involve any guarantees or promises of what, when, or even if any patches will be pushed to any release of MySQL.

In Worklog #4007 an approach for writing the binary log is suggested where space is allocated for the transaction in the binary log before actually starting to write it. In addition to avoiding unnecessary locking of the binary log, it also allow us to use the binary log to order the transactions in-place. We will use this idea of reserving space in the binary log to implement the binary log group commit.

By re-structuring the procedure above slightly, we can ensure that the transactions are written in the same order in both the InnoDB transaction log and the binary log.

There are two ways to re-structure the code: one simple and one more complicated that potentially can render better performance. To simplify the presentation, it is assumed that pre-allocation is handled elsewhere, for example using Worklog #4925. In a real implementation, pre-allocation can either be handled when a new binlog file is created, or when transaction data is being written to the binary log.

The sequential write approach

Figure 1. Sequential binary log group commit
In the sequential write approach, the transactions are still written to the binary log in order and the code is just re-ordered to avoid keeping mutexes when calling fsync(). To describe the algorithm, three shared variables are introduced to keep track of the status of replication:
Next_Available
This variable keeps track of where a new transaction can be written
Last_Committed
This variable keeps track of the last committed transaction, meaning that all transactions preceding this position is actually on disc. This variable is not necessary in the real implementation, but it is kept here to simplify the presentation of the algorithm.
Last_Complete
This variable keeps track of the last complete transaction. All transactions preceding this point is actually written to the binary log, but are not necessarily flushed to disc yet.
You can see an illustration of how the variables are used with the binary log in Figure 1 where you can also see three threads each waiting to write a transaction. Both variables are initially is set to the beginning of the binary log and it is always true that Last_CommittedLast_CompleteNext_Available . The procedure can be described in the following steps:
  1. Lock the binary log
  2. Save value of Next_Available in a variable Trans_Pos and increase Next_Available with the size of the transaction.
  3. Prepare InnoDB:
    1. Write prepare record to log buffer (but do not fsync() buffer here)
    2. Release row locks
  4. Unlock binary log
  5. Post prepare InnoDB:
    1. fsync() log file to disk, which can now be done using group commit since no mutex is held.
  6. Log transaction to binary log:
    1. Wait until Last_Complete = Trans_Pos. (This can be implemented using a condition variable and a mutex.)
    2. Write transaction data to binary log using pwrite. At this point, it is not really necessary to use pwrite since the transaction data is simply appended, but it will be used in the second algorithm, so we introduce it here.
    3. Update Last_Complete to Trans_Pos + transaction size.
    4. Broadcast the the new position to all waiting threads to wake them up.
    5. Call fsync() to persist binary log on disk. This can now be group committed.
  7. Commit InnoDB:
    1. Write commit record to log buffer
    2. Sync log buffer to disk, which currently can be group committed.
To implement group commit, it is sufficient to have a condition variable and wait for that for a specified interval. Once the interval has passed, the transaction data can call fsync(), after which it broadcasts the fact that data has been flushed to disc to other waiting threads so that they can skip this. Typically, the code looks something along these lines (we ignore checking error codes here to simplify the description):
pthread_mutex_lock(&binlog_lock);
while (Last_Complete ≥ Last_Committed) {
  struct timespec timeout;
  gettimeofday(&timeout, NULL);
  timeout.tv_usec += 1000;    /* 1 msec */
  int error= pthread_cond_timedwait(&binlog_flush, &binlog_lock, &timeout);
  if (error == ETIMEDOUT) {
    fsync(&binlog_file);
    Last_Committed = Last_Complete;
    pthread_cond_broadcast(&binlog_flush);
  }
}
pthread_mutex_unlock(&binlog_lock);
There are a few observations regarding this approach:
  • Step 6a requires a condition variable and a mutex when waiting for Last_Complete to reach Trans_Pos. Since there is just a single condition variable, it is necessary to broadcast a wakeup to all waiting threads, which each will evaluate the condition just to find a single thread that should continue, while the other threads go to sleep again.

    This means that the condition will be checked O(N2) times to commit N transactions. This is a waste of resources, especially if there is a lot of threads waiting, and if we can avoid this, we can gain performance.

  • Since the thread has a good position in the binary log where it could write, it could just as well start writing instead of waiting. It will not interfere with any other threads, regardless if locks are kept or not.
These observations lead us to the second approach, that of writing transaction data to the binary log in parallel.

A parallel write approach

Figure 2. Parallel binary log group commit
In this approach, each session is allowed to write to the binary log at the same time using pwrite since the space for the transaction data has already been allocated when preparing the engines. Figure 2 illustrates how the binary log is filled in (grey areas) by multiple threads at the same time. Similar to the sequential write approach, we still have the Last_Complete, Last_Committed, and Next_Available variables.

Each thread does not have to wait for other threads before writing, but it does have to wait for the other threads to commit. This is necessary since we required the order of commits in the InnoDB log and the binary log to be the same. In reality, this does not pose a problem since the I/O is buffered, hence the writes are done to in-memory file buffers.

The algorithms look quite similar to the sequential write approach, but notice that in step 6, the transaction data is simply written to the binary log using pwrite.

  1. Lock the binary log
  2. Save value of Next_Available in a local variable Trans_Pos and increase Next_Available with the size of the transaction.
  3. Prepare InnoDB:
    1. Write prepare record to log buffer (but do not fsync() buffer here)
    2. Release row locks
  4. Unlock binary log
  5. Post prepare InnoDB:
    1. fsync() log file to disk, which can now be done using group commit since no mutex is held.
  6. Log transaction to binary log:
    1. Write transaction data to binary log using pwrite. There is no need to keep a lock to protect the binary log here since all threads will write to different positions.
    2. Wait until Last_Complete = Trans_Pos.
    3. Update Last_Complete to Trans_Pos + transaction size.
    4. Broadcast the the new position to all waiting threads to wake them up.
    5. Call fsync() to persist binary log on disk. This can now be group committed.
  7. Commit InnoDB:
    1. Write commit record to log
    2. Sync log file to disk
This new algorithm has some advantages, but there are a few things to note:
  • When a transaction is committed, it is guaranteed that Trans_PosLast_Committed for all threads (recall that Trans_Pos is a thread-local variable).
  • Writes are done in parallel, but when waiting for the condition in step 6b still requires a broadcast to wake up all waiting threads, while only one will be allowed to proceed. This means that we still have the O(N2) complexity of the sequential algorithm. However, for the parallel algorithm it is possible to improve the performance significantly, which we will demonstrate in the third part where we will discuss optimizations to the algorithms.
  • Recovery in the sequential algorithm is comparably simple since there are no partially written transactions. If you consider that a crash can occur in the situation described in Figure 2, it is necessary to device a method for correctly recovering. This we will discuss in the second part of these posts.

Tuesday, April 13, 2010

MySQL Conference Replication tutorial: Article and Demo Software

The MySQL Conference and Expo started with me and Lars Thalmann doing the replication tutorial. Unfortunately, we cannot at this time distribute the slides (please watch the replication tutorial page at the conference site), but there is a replication tutorial package for easy setup of server to play around with—including some sample scripts—and a paper that both explains how the package can be used as well as giving some example setups.