Binlog Group Commit Experiments

It was a while ago since I talked about binary log group commit. I had to spend time on a few other things.

Since then, Kristian has released a version of binary log group commit that seems to work well. However, for a few reasons that will be outlined below, we decided to do experiments ourselves using the approach that I have described earlier. A very early version of what we will start doing benchmarks on are available at the MySQL labs. We have not done any any benchmarking on this approach before OSCON, so we we'll have to get back on that.

All of this started with Facebook pointing out a problem in how the group commit interacts with the binary log and proposed a way to handle the binary log group commit by demonstrating a patch to solve the problem.

What's in the patch

The patch involves implementing logic for handling binary log group commit and parallel writing of the binary log, including a minor change to the handler protocol by adding a persist callback. The extension of the handler interface is strictly speaking not necessary for the implementation, but it is natural to extend the interface in this manner and I belive that it can be used by storage engines to execute more efficiently). In addition to the new logic, three new options were added and one option was created as an alias of an old option.

binlog-sync-period=N: This is just a rename of the old sync-period option, which tell that fsync should be called for the binary log every N events. For many of the old options, it is not clear what they are configuring, so we are adding the binlog- prefix to options that affect the binary log. The old option is kept as an alias for this option.
binlog-sync-interval=msec: No transaction commit will wait for more than msec milliseconds before calling fsync on the binary log. If set to zero, it is disabled. You can set both this option and the binlog-sync-period option.
binlog-trx-committed={COMPLETE,DURABLE}: A transaction is considered committed when it is either in durable store or when it is completed. If set to DURABLE either binlog-sync-interval or binlog-sync-period has to be non-zero. If they are both zero, transactions will not be flushed to disk and hence they will never be considered durable.
master-trx-read=={COMPLETE,DURABLE}: A transaction is read from the binary log when it is completed or when it is durable. If set to DURABLE either binlog-sync-interval or binlog-sync-period has to be non-zero or an error will be generated. If it was possible for both zero, no transactions will ever be read from the binary log and hence never sent out.

The patch also contain code to eliminate the prepare_commit_mutex as well as moving release of row locks inside InnoDB (not completely applied yet, I will get it there as soon as possible) to the prepare phase. The focus on these changes is that we should maintain consistency, so we have not done any aggressive changes like moving the release of the write locks to the prepare phase: that could possibly lead to inconsistencies.

Figure 1. Binary log with transaction in different stages

The main changes are about how a transaction is committed. The details are explained in the previous articles, but for understanding the rest of this blog post, I'll briefly recapitulate how a transaction is committed in this solution. Each transaction pass through three states: prepared, completed (committed to memory), and durable (committed to disk), as seen in Figure 1. The transaction is pushed through these states using the following procedure:

The transaction is first prepared, which is now split into two steps:
1. In the reserve step, a slot is assigned for the transaction in the binary log and the storage engine is asked check if this transaction can be committed. At this point, the storage engine can abort the transaction if it is unable to fulfill the commit, but if it approves of the commit, the only thing that can abort the transaction after this point is a server crash. This check is currently done using the prepare call. This step is executed with a lock, but is intended to be short.
2. In the persist step, the persist function is called, which asks the storage engine to persist any data that it need to persist to guarantee that the transaction is fully prepared. After this step is complete, the transaction is fully prepared in the storage engine and in the event of a crash, it will be able to commit the transaction on recovery, if asked to do so. This step is executed without a lock and a storage engine that intend to handle group commit should defer any expensive operations to this step.
To record the decision, the transaction is written to the reserved slot in the binary log. Since the write is done to a dedicated place in the binary log reserved to this transaction, it is not necessary to hold any locks, which means that several threads can write the transaction to the binary log at the same time.
The commit phase is in turn split into two steps:
1. In the completion step, the thread waits for all preceeding transactions to be fully written to the binary log, after which the transaction is completed, which means that it is logically committed but not necessarily in durable storage.
2. In the durability, step, the thread waits for the transaction (and all preceeding transactions) to be written to disk. If this does not occur within the given time period, it will itself call fsync for the binary log. This will make all completed transactions durable.

After this procedure is complete, the transaction is fully committed and the thread can proceed with executing the next statement.

The different approaches

So, providing this patch begs the questions: why a third version of binary log group commit? There are three approaches: Facebook's patch (#1), Kristian's patch (#2), and my patch (#3). Before going over the rationale leading to a third version, it is necessary to understand how the Facebook patch and Krisian's patch work on a very high level. If you look at Figure 1, you see a principal diagram showing how the patches work. Both of them maintain a queue of threads with transactions to be written and will ensure that they are written in the correct order to the binary log.

The Facebook patch ensures that the transactions are written in the correct order by signalling each thread waiting in the queue in the correct order, after which the thread will take a lock on the binary log, append the transaction, and release the lock. When the decision to commit the outstanding transactions are made, fsync() is called. It has turned out that this lock-write-unlock loop can just be executed at a certain speed, which means that as the number threads waiting to write transactions increase, the system choke and is not able to keep up.

Kristian solves this by designating the first thread in the queue as the leader, and have it write the transactions for all threads in the queue instead of just having each thread do it individually and then broadcast to the other threads, who just return from the commit. This improves performance significantly as can be seen from the figures in the measurements that Mark did. Note, however, that a lock of the binary log is still kept while writing the transactions.

The approach we are experimenting with goes about this in another way and instead of queueing the data to be written, a place is immediately allocated in the binary log after which the thread proceed to write the data. This means that several threads can at the same time write in parallel to the binary log without needing to keep any locks. There is a need for a lock when allocating space in the binary log, but that is very short. Since the threads can finish writing in different order, it is necessary to keep logic around for deciding when a transaction is committed and when it's not. For details, you can look at the worklog (which is not entirely up to date, but I'll fix that). In this sense, the binary log itself is the queue (there is a queue in the implementation, but this is just for bookkeeping). The important differences leading us to a want to have a look at this third version are:

Approaches #1 and #2 keep a lock while writing the binary log while #3 doesn't.
Approaches #1 and #2 keep the transactions on the side (in the queue) and write them to the binary log when they are being committed. Approach #3 writes the transactions directly to the binary log, possibly before they are committed.

Figure 1. Sources of performance problems

Efficiently using Multiple Cores

Efficiently using a multi-threaded systems, especially one with multiple cores, is very hard. It requires knowledge of hardware issues, operating systems considerations, algorithms, and some luck. I will not cover all the issues revolving around designing a system for multi-core use, but I will focus on three of the parts that we are considering in this case. We split the sources of performance degradations when committing a transaction into three separate parts: CPU and memory issues, software lock contention, and I/O.

The CPU and memory issues has to do with how caches are handled on the CPU level, which can affect performance quite a lot. There some things that can be done, such as avoiding false sharing, handling data alignment, and checking the cache access patterns, but in general, it is hard to add as an afterthought and require quite a lot of work to get right. We are not considering this and view it as static.
The I/O can be reduced using either SSDs or use RAID solutions (which does not reduce latency, but improves the throughput and therefore reduce the I/O needed for each transaction). Also, reducing the number of accesses to disk using group commits will improve the situation significantly, which is what we're doing here.
To reduce the software lock contention there is only one solution: reduce the time each lock is kept. This can be as simple as moving the lock aquire and release, using atomic primitives instead of locks, but can also require re-designing algorithms to be able to run without locks.

So, assuming that we reduce the I/O portion of committing a transaction—and only I/O portion—as you can see in Figure 1, the software lock time start to become the problem and we need to start to work on reducing that. To do this, there are not many options except the approach described above. And if we take this approach to reduce lock contention, there's just a few additions to get the group commit as well.

Given this, it is rational to explore if this solution can solve the group commit problem as good as the other solutions and improve the scalability of the server at the same time.

Scaling out

One of the most central uses for replication is to achieve high-availability by duplicating masters and replicate between them to keep both up to date. For this reason, it is important to get the changes over to the other master as fast as possible. In this case, whether the data is durable on the original master or not is of a smaller concern since once the transaction has left the node, a crash will not cause the transaction to disappear since it has already been distributed. This means that for implementing multi-masters, we want replication to send transactions as soon as possible—and maybe even before that—since we can achive high-availablility by propagating the information as widely as possible.

On the other hand, transactions sent from the master to the slave might need to be durable on the master since otherwise the slave might be moving into an alternative future—a future where this transaction was committed—if the transactions sent to the slave are lost because of a crash. In this case, it is necessary for the master to not send out the transaction before it is in durable store. Having a master that is able to send out both completed transactions and durable transactions at the same time, all based on the requirements of the slave that connects, is a great feature and allow the implementation of both an efficient multi-master solution as well as slaves that does not diverge from the master even in the event of crashes. Currently, a master cannot both deliver transactions that are completed and transactions that are durable at the same time. With the patch presented in this article, it is possible to implement this, but in alternative #1 and #2 described above, all the transactions are kept "on the side" and not written to the binary log until they are being committed. This means that it is harder to support this scenario with the two other alternatives.

Concluding remarks

To sum up the discussion above: we are interested in exploring this approach since we think that it provides shorter lock time, hence scales better to multi-core machines, and in addition provide better scale-out capabilities, since it will be possible that the slaves can decide if they want to receive durable or completed transactions. Thanks to all in the community for the great work and discussions on binlog group commit. The next steps will be to benchmark this solution to see how it flies and it would be great to also get some feedback on this approach. As always, we are interested in getting a good and efficent solution that also can be maintained end evolved easily.

5 comments:

Anonymous said...: Mats, thanks for the post, interesting to read more about your approach.

How do you ensure that commit order in binlog and engine is the same, so
innobackup (or xtrabackup) will be consistent with the binlog?; 7:17 PM
Mark Callaghan said...: I like your implementation and the work done by Kristian. Both are definite improvements over what we have in the FB patch. I rewrote the FB version because it was broken in regards to the order it preserved and while doing performance testing I first had a parameter to control how long a transaction might wait for group commit.

And then I added fancy algorithms for dynamically adjusting that based on the observed latency of fsync and the recent commit rate. The hard problem to solve is avoiding waiting when there is little concurrency in the system. Otherwise a workload with one connection loading a lot of data will get very slow as each commit waits for the group commit timeout?

How do you avoid this problem?

In the yet to be published version in the FB patch I replaced the timeout with a simple rule. When a transaction is about to call log_xid() it determines the number of sessions that entered ha_commit_trans after it did. If there are more than X, than it will wait a short amount of time.

Kristian has a different solution that also avoids the group commit timeout.; 4:39 PM
Mats Kindahl said...: Kristian,

The transaction is effectively committed once it is written to the binary log (and all preceding transactions are also written) since a recovery at this point will find and commit the transaction. Therefore it is not necessary to maintain the commit order, but for {inno,xtra}backup to work correctly it is sufficient to get the file and position of the last committed transaction, which is readily available in last_complete or last_durable.; 1:39 PM
Mats Kindahl said...: Mark, thank you for your comments.

There is currently not any logic like that in the implementation. Adding a rule to check the length of the "in-progress" queue is easy, and if it is empty, performing an fsync would allow transactions to commit fast when there is little concurrency in the system.

We will perform benchmarks next to ensure that the patch performs as expected, and it is important that the response time is short both for high-concurrency and low-concurrency workloads.; 1:45 PM
Anonymous said...: Mats, xtrabackup/innobackup does not look at the binlog at all, they only copy
innodb transaction logs and tablespaces directly from disk. This is for the
non-blocking operation, there are scripts on top which do FLUSH TABLES WITH
READ LOCK, but then the backup can block the server arbitrarily long depending
on worklog.

Suppose we are committing transactions A,B,C,D,E.

Without controlling commit order, it can happen that when we take the
non-blocking {inno,xtra}backup we have:

- Binlog contains transactions A,B,C,D (in this order).

- InnoDB backup has transactions B,D committed, A,C,E prepared.

How are you going to restore the backup? There is no point in the master
binlog corresponding to only B and D committed. You need to know somehow to XA
commit A and C and XA rollback E (ie. effectively run the XA recovery at
restore time).

My question is, how do you handle this? The MariaDB group commit avoids this
by ensuring that InnoDB commits in the order A,B,C,D,E like the binlog, so
backup snatshots can never see (B,D). If you do not do this, do you then
extend innoback to also copy the binlogs and run XA recovery?; 2:33 PM

MySQL Musings

Wednesday, July 27, 2011