Tuesday, June 05, 2012

Binary Log Group Commit in MySQL 5.6

With the release of MySQL 5.6 binary log group commit is included, which is a feature focused on improving performance of a server when the binary log is enabled. In short, binary log group commit improve performance by grouping several writes to the binary log instead of writing them one by one, but let me digress a little on how transactions are logged to the binary log before going into the details. Before going into details about the problem and the implementation, let look at what you do to turn it on.

Nothing.

Well... we actually have a few options to tweak it, but nothing required to turn it on. It even works for existing engines since we did not have to extend the handlerton interface to implement the binary log group commit. However, InnoDB has some optimizations to take advantage of the binary log group commit implementation.

binlog_order_commits={0|1}
This is a global variable that can be set without stopping the server.

If this is off (0), transactions may be committed in parallel. In some circumstances, this might offer some performance boost. For the measurements we did, there were no significant improvement in throughput, but we decided to keep the option anyway since there are special cases were it can offer improvements.

binlog_max_flush_queue_time=microseconds
This variable controls when to stop skimming the flush queue (more about that below) and move on as soon as possible. Note that this is not a timeout on how often the binary log should be written to disk since grabbing the queue and writing it to disk takes time.

Transactions galore...

As the server executes transactions the server will collect the changes done by the transaction in a per-connection transaction cache. If statement-based replication is used, the statements will be written to the transaction cache, and if row-based replication is used, the actual rows changed will be written to the transaction cache. Once the transaction commits, the transaction cache is written to the binary log as one single block. This allow each session to execute independently of each others and only need to take a lock on the binary log when writing the transaction data to it. Since transactions are isolated from each others it is enough to serialize the transactions on commits. (Of course, this is in an ideal world. Transactions can see other transactions changes if you set a different transaction isolation level. You would never do that unless you knew exactly what you're doing... right?)
Figure 1. Two-Phase Commit Protocol (2PC)
In order to keep the storage engine and the binary log in sync, the server employs a two-phase commit protocol (or just 2PC) that you can see in Figure 1. As you can see, there is a call to write() and one call to fsync() in the diagram: I'll get to that is just a moment, so stay tuned.

The entire point of using a two-phase commit protocol is to be able to guarantee that the transaction is either both in the engine and the binary log (or in neither) even in the event that the server crashes after the prepare, and subsequently recovers. That is, it should not be possible that the transaction is in the engine but not in the binary log, or vice verse. Two-phase commit solves this by requiring that once a transaction is prepared in the engine, it can be either fully committed or fully rolled back even if the server crashes and recover. So, on recovery, the storage engine will then provide the server with all the transactions that are prepared but not yet committed, and the server will then commit the transaction if it can be found in the binary log, and roll it back otherwise.

This is, however, only possible if the transaction can be guaranteed to be persistent in the binary log before committing the transaction in the engine. Since disks are slow and memory fast, the operating system tries to improve performance by keeping part of the file in memory instead of writing directly to disk. Once enough changes have been written, or the memory is needed for something else, the changes are written to disk. This is good for the operating system (and also for anybody using the computer), but causes a problem for the database server since if the server crashes, it is possible that the transaction is committed in the storage engine, but there is no trace of it in the binary log.

For recovery to work properly, it is therefore necessary to ensure that the file is really on disk, which is why there is a call to fsync() in Figure 1, which makes the in-memory part of the file to be written to disk.

The Infamous prepare_commit_mutex

Figure 2. Missing un-committed transactions
When the server recovers, it has access to the binary log and can therefore decide what to commit and what to rollback, but what if there is no binary log?

In the general case, a recovery can just roll back all prepared transactions and start again. After all, the transactions that were just prepared but not committed are safe to roll back. They just move the database to the state it had just before starting those transactions. Any clients being connected has not got an indication that the transaction is committed, so they will realize that the transactions have to be re-executed.

There is another case where recovery is being used in this way and that is when using on-line backup methods such as InnoDB Hot Backup (which is used in the MySQL Enterprise Backup). These tools take a copy of the database files and InnoDB transaction logs directly—which is an easy way to take a backup—but it means that the transaction logs contain transactions that have just been prepared. On recovery, they roll back all the transactions and have a database in a consistent state.

Since these on-line backup methods are often used to bootstrap new slaves, the binary log position of the last committed transaction is written in the header of the InnoDB redo log. On recovery, the recovery program print the binary log position of the last committed transaction and you can use this information with the CHANGE MASTER command to start replicating from the correct position. For this to work correctly, it is necessary that all the transactions are committed in the same order as they are written to the binary log. If they are not, there can be "holes" where some transactions are written to the binary log, but not yet committed, which cause the slave to miss transactions that were not committed. The problematic case that can arise is what you see in Figure 3 below.

Figure 3. Committing in parallel
You can see an example of this in Figure 2, where replication will start from the last committed position, but there is a transaction that were just prepared and hence was rolled back when the backup was restored on the slave.

To solve this, InnoDB added a mutex called the prepare_commit_mutex that was taken when preparing a transaction and released when committing the transaction. This is a simple solution to the problem, but causes some problems that we will get to in a minute. Basically, the prepare_commit_mutex solve the problem by forcing the call sequence to be as in Figure 4.

Figure 4. Sequencing transactions

Steady as she goes... NOT!

Since disk writes are slow, writing every transaction to disk will affect performance quite a lot... well, actually very much...

To try to handle that, there is a server option sync_binlog that can be set to how often the binary log should be written to disk. If it is set to 0, the operating system will decide on when the file pages should be written to disk, if it is set to 1, then fsync() will be called after every transaction being written to the binary log. In general, if you set sync_binlog to N, you can at most lose N-1 transactions, so in practice there are just two useful settings: sync_binlog=0 means that you accept that some transactions can be lost and handle it some other way, and sync_binlog=1 means that you do not accept to lose any transactions at all. You could of course set it to some other value to get something in between, but in reality you can either handle transaction loss or not.

To improve performance, the common case is to bundle many writes with each sync: this is what the operating system does, and that is what we should be able to do. However, if you look at Figure 4 you see that there is no way to place a fsync() call in that sequence so that several transactions are written to disk at the same time. Why? Because at any point in that sequence there is at most one prepared and written transaction. However, if you go back to Figure 3, you can see that it would be possible to place an fsync() as shown and write several transactions to disk at the same time. If it was possible, then all transactions written to the binary log before the fsync() call would be written to disk at once. But this means that it is necessary to order the commits in the same order as the writes without using the prepare_commit_mutex.

So, how does all this work then...

The binary log group commit implementation used split the commit procedure into several stages as you can see in Figure 5. The stages are entirely internal to the binary log commit procedure and does not affect anything else. In theory, it would be possible to have another replication implementation with another policy for ordering commits. Since the commit procedure is separated into stages, there can be several threads processing transactions at the same time, which also improves throughput.
Figure 5. Commit Procedure Stages
For each stage, there is an input queue where sessions queue up for processing. If a thread registers in an empty queue, it is considered the stage leader otherwise, the session is a follower. Stage leaders will bring all the threads in the queue through the stage and then register the leader and all followers for the next stage. Followers will move off to the side and wait for a leader to signal that the entire commit is done. Since it is possible that a leader registers to a non-empty queue, a leader can decide to become a follower and go off waiting as well, but a follower can never become a leader.

When a leader enters a stage, it will grab the entire queue in one go and process it in order according to the stage. After the queue is grabbed, other sessions can register for the stage while the leader processes the old queue.

In the flush stage, all the threads that registered will have their caches written to the binary log. Since all the transactions are written to an internal I/O cache, the last part of the stage is writing the memory cache to disk (which means it is written to the file pages, also in memory).

In the sync stage, the binary log is synced to disk according to the settings of sync_binlog. If sync_binlog=1 all sessions that were flushed in the flush stage will be synced to disk each time.

In the commit stage, the sessions will that registered for the stage will be committed in the engine in the order they registered, all work is here done by the stage leader. Since order is preserved in each stage of the commit procedure, the writes and the commits will be made in the same order.

After the commit stage has finished executing, all threads that were in the queue for the commit stage will be marked as done and will be signaled that they can continue. Each session will then return from the commit procedure and continue executing.

Thanks to the fact that the leader registers for the next queue and is ready to become a follower, the stage that is slowest will accumulate the most work. This is typically the sync stage, at least for normal hard disks. However, it is critical to fill the flush stage with as many transactions as possible, so the flush stage is treated a little special.

In the flush stage, the leader will skim the the sessions one by one from the flush queue (the input queue for the flush stage). As long as the last session was not remove the from the queue, or the first session was unqueued more than binlog_max_flush_queue_time microseconds ago, this process will continue. There are two different conditions that can stop the process:

  • If the queue is empty, the leader immediately advanced to the next stage and registers all sessions processed to the sync stage queue.
  • If the timeout was reached, the entire queue is grabbed and the sessions transaction caches are flushed (as before). The leader then advance to the sync stage.
Figure 6. Comparing 5.6.5 and 5.6 June labs release

Performance, performance, performance...

I'm sure you all wonder what the improvements are, so without further delay, let's have a look at the results of some benchmarks we have done on the labs tree. There has been several improvements that is not related to the binary log, so I will just focus on the results involving the binary log. In Figure 6 you see a benchmark comparing the 5.6.5 release with the 5.6 June labs release using the binary log. These benchmarks were executed on an 2.00 GHz 48-core Intel® Xeon® 7540 with 512 GiB memory and using SSD disks.

As you can see, the throughput has increased tremendously, with increases ranging from a little less than 2 and approaching 4 times that of 5.6.5. To a large extent, the improvements are in the server itself, but what is interesting is that with binary log group commit, the server is able to keep the pace. Even with sync_binlog=0 on 5.6.5 and sync_binlog=1 on the 5.6 labs release, the 5.6 labs release outperforms 5.6.5 by a big margin.

Another interesting aspect is that even with sync_binlog=1 the server performs nearly as well when using sync_binlog=0. On higher number of connections (roughly more than 50), the difference in throughput is varying between 0% [sic] and with a more typical throughput between 5% and 10%. However, there is a drop of roughly 20% in the lower range. This looks very strange, especially in the light that the performance is almost equal in the higher range, so what is causing that drop and is there anything that can be done about it?

Figure 7. Benchmark of Binary Log Group Commit
The answer comes from some internal benchmarks done while developing the feature. For these tests we were using Sysbench on a 64-core Intel® Xeon® X7560 running at 2.27GHz with 126 GB memory and a HDD.

In the benchmarks that you can see in Figure 7 the enhanced version of 5.6 with and without the binary log group commit is compared. The enhanced version of 5.6 include some optimizations to improve performance that are not available in the latest 5.6 DMR, but most are available in the labs tree. However, these are benchmarks done while developing, so it is not really possible to compare them with Figure 6 above, but it will help understand why we have a 20% drop at the lower number of connections.

The bottom line in Figure 7 is the enhanced 5.6 branch without binary log group commit and using sync_binlog=1, which does not scale very well. (This is nothing new, and is why implementing binary log group commit is a good idea.) Note that even at sync_binlog=0 the staged commit architecture scale better than the old implementation. If you look at the other lines in the figure, you can see that even when the enhanced 5.6 server is running with sync_binlog=0, the binary log group commit implementation outperforms the enhanced 5.6 branch at roughly 105 simultaneous threads with sync_binlog=1.

Also note that the difference between sync_binlog=1 and sync_binlog=0 is diminishing as the number of simultaneous connections is increased, to vanish completely at roughly 160 simultaneous connections. We haven't made a deep analysis of this, but while using the performance schema to analyze the performance of each individual stage, we noted that the sync time was completely dominating the performance (no surprise there, just giving the background), and that all available transactions "piled up" in the sync stage queue. Since each connection can at most have one ongoing transaction, it means that at 32 connections, there can never be more than 32 transactions in the queue. As a matter of fact, one can expect that over a long run, roughly half of the connections are in the queue and half of the connections are inside the sync stage (this was also confirmed in the measurements mentioned above), so at lower number of connections it is just not possible to fill the queue enough to utilize the system efficiently.

The conclusion is that reducing the sync time would probably make the difference between sync_binlog=0 and sync_binlog=1 smaller even on low number of connections. We didn't do any benchmarks using disks with battery-backed caches (which should reduce the sync time significantly, if not entirely eliminates it), but it would be really interesting to see the effect of that on performance.

Summary and closing remarks

  • The binary logging code has been simplified and optimized, leading to improved performance even when using sync_binlog=0.
  • The prepare_commit_mutex is removed from the code and instead the server orders transactions correctly.
  • Transactions can be written and committed as groups without losing any transactions, giving around 3 times improvement in performance on both sync_binlog=1 and sync_binlog=0.
  • The difference between sync_binlog=0 and sync_binlog=1 is small and reduces as the load increases on the system.
  • Existing storage engines benefit from binary log group commit since there are no changes to the handlerton interface.
Binary log group commit is one of a range of important new enhancements to replication in MySQL 5.6, including global transaction IDs (GTIDs), multi-threaded slave, crash safe slave and binary log, replication event checksums, and some more. You can learn more about all of these from our DevZone article:
dev.mysql.com/tech-resources/articles/mysql-5.6-replication.html

You can also try out binary log group commit today by downloading the latest MySQL 5.6 build that is available on labs.mysql.com

Related links

  • It all started with this post where Mark points out the problem and show some results of their implementation.
  • The next year, Kristian Nielsen implemented binary log group commit for MariaDB and has a lot of good posts on the technical challenges in implementing it. This implementation is using an atomic queue and does flushing and syncing of the transactions as a batch, after which the sessions are signalled in order and commit their transactions.

9 comments:

Mark Callaghan said...

This sound excellent but I don't understand how the leader decides to wait (or not wait) during the flush stage. Can you provide more detail on that? I know it is more complex than always waiting for the timeout to expire as that would make singled-threaded workloads slow.

Do you use the number of threads currently doing InnoDB prepare to help decide whether the flush stage leader should wait?

Mats Kindahl said...

Hi Mark,

I assume that you're referring to the max queue time in the flush stage. The flush stage leader doesn't wait at any point, but if new transactions keep queuing up in the queue it is necessary to stop skimming of transactions from the flush queue to prevent a high latency of those transactions that have been flushed and are waiting for the leader to move over to the sync stage. This is what the max flush queue time is used for: to prevent the leader from continue to skim of transactions.

Dathan Pattishall said...

Fantastic! I love learning new things. Great job!

Robert Hodges said...

Excellent work and long awaited by many of us. I'm looking forward to using sync_binlog=1 for all masters all the time.

I was a little confused in the paragraph under "Transactions Galore" where you discussed isolation. Does group commit introduce any new serialization issues that were not present in 5.5 and previous releases?

Also, your write-up implies there are no changes to binlog format. Is that the case?

Mats Kindahl said...

Dathan, Robert, Thanks!

Robert,

No,binary log group commit does not introduce any new serialization issues, there are just the old ones.

I'm not really talking about the binlog format in the post, but there are no changes to the binary log format done for implementing binary log group commit, but there are changes to the binary log format to implement other features, such as the multi-threaded slave.

I assume that you were actually asking about the handlerton interface. There are some new informational functions in the server that InnoDB uses to get information about the transactions, but no changes to the handlerton interface.

knielsen said...

> Existing storage engines benefit from binary log group commit since there
> are no changes to the handlerton interface.

Mats, you are not telling the full story ;-) Did you intend to keep it a
secret, hoping to beat the MariaDB group commit? :)

You extend the handler interface in two ways. You add
thd_requested_durability() so that InnoDB will not fsync() at commit();
without this, storage engines will have no group commit in their commit()
method. And 5.6 also adds handlerton->flush_logs() which is necessary for
XA recovery to work. This is basically equivalent to the commit_ordered() and
commit_checkpoint_request() methods that MariaDB adds, it is good.

I am really disappointed that you completely fail to mention my group commit
work in MariaDB. You basically ditched your original design and instead
re-implemented the MariaDB group commit algorithm. I think you even did a
Ph.D. at University, so you should know proper decency in regards to proper
attribution to prior work. Shame on you!

Mats Kindahl said...

I have no idea if I "beat" MariaDB, but one of the goals with the
implementation was to avoid changes to the handlerton interface so
that existing engines should not have to be changed to work.

Regarding the changes to the handlerton interface that you point to:
the thd_requested_durability is not part of the handlerton interface,
so storage engines do not have to be changed to take advantage of
binary log group commit. According to the Bazaar logs,
flush_logs handlerton member was added by "Brian" (not sure if it is
Brian A or Brian M) in 2006, so that is not really anything new in
5.6. Whether the storage engines themselves support group commit or
not is a completely different issue.

The design I discussed in earlier posts were partially implemented,
but shelved since there were implementation issues making it more
difficult to work well. Basically, the IO_CACHE does not support
pwrite very well, affecting performance, so decided on a traditional implementation to ensure that it got into 5.6.

MariaDB also have a binary log group implementation and I think
that is good since it offers users some alternatives on what to
choose.

I haven't kept up to date with the latest changes in MariaDB, but
AFAICT, the only similarities between the MariaDB implementation and
this implementation is that both use queues (a single queue in MariaDB), mutexes, and condition
variables. Apart from that, the two implementations are very
different. For example, the MariaDB implementation does not have any
stages, and commits are handled by the threads themselves instead of
batched.

Mats Kindahl said...

Kristian,

I added references to your blogs on binary log group commit (and to your blog) as well as to Mark's original post that started it all.

vishnurao said...

hi

i had one doubt...

is the 'binlog_cache' parameter related to the 'transaction cache' you mention ?

are both of them the same ?

http://dev.mysql.com/doc/refman/5.1/en/replication-options-binary-log.html#sysvar_binlog_cache_size