Wednesday, August 18, 2010

Binary Log Group Commit - Recovery

It was a while since I wrote the previous article, but the merging of Oracle and Sun here resulted in quite a lot of time having to be spent on attending various events and courses for legal reason (one of the reasons I prefer working for smaller companies) and together with a summer vacation spent on looking over the house, there were little time for anything else. This is the second post of three, and in the last one I will cover some optimizations that improves performance significantly.

In the previous article, an approach was outlined to handle the binary log group commit. The basic idea is to use the binary log as a ticketing system by reserving space in it for the transactions that are going to be written. This will provide an order on the transactions as well as allowing writing the transactions in parallel to the binary log, thereby boosting performance. As noted in the previous post, a crash while writing transactions to the binary log requires recovery. To understand what needs to be changed, it is necessary to understand how the structure of the binary log as well as how recovery after a crash works currently together with the implementation of 2-phase commit that MySQL uses.

Figure 1. Binlog file structure

A quick intro to the structure of the binary log

Figure 1 gives the rough structure of the binary log with a set of binlog files and an binlog index file. The binlog index file just list the binlog files that makes up the binary log, while each binlog file have the real contents of the binary log that you can see when executing a SHOW BINLOG EVENTS.

Each binlog file consists of a sequence of binlog events, where the most important events from our perspective is the Format description event. In addition, each binlog file is also normally terminated by a Rotate event that refers to the next binlog file in the sequence.

The Format description event is used to describe the contents of the binlog file and therefore contain a a lot of information about the binlog file. In this case we are interested in a special flag called LOG_EVENT_BINLOG_IN_USE_F, which is used to tell if the binlog is actively being written by the server. When the server opens a new binlog file, this flag is set to indicate that the file is in use, and when the binary log is rotated and a new binlog file created, this flag is cleared when closing the old binlog file.

In the event of a crash, the flag will therefore be set and the server can see that the file was not closed properly and start with performing recovery.

Recovery and the binary log

When recovering, the server has to find all transactions that were partially executed and decide if they are going to be rolled back or committed properly. The deciding point when a transaction will be committed instead of rolled back is when the transaction has been written to the binary log. To do this, the server has to find all transactions that were written to the binary log and tell all storage engines to commit these transactions.

The recovery procedure is executed when the binary log is opened—which the server does calling TC_LOG_BINLOG::open during startup. When the binary log is opened, recovery is done if the last open binlog file was not closed properly. An outline of the procedure executed is:

  1. Open the binlog index file and go through it to find the last binlog file mentioned there [TC_LOG_BINLOG::open]
  2. Open this binlog file and check if the LOG_EVENT_BINLOG_IN_USE_F flag is set
  3. If the flag was clear, then the server stopped properly and no recovery is necessary. Otherwise, the server did not stop properly and recovery starts by calling.
  4. The last binlog file is now open, so the entire binlog file is scanned and the XID of each each Xid event is recorded. These XIDs denote the transactions that were properly written to the binary log—that is, the transactions that shall be committed [TC_LOG_BINLOG::recover].
  5. Each storage engine is handed the list of XIDs of transactions to commit through the handlerton::recover interface function [ha_recover].
  6. The storage engine will then commit each transaction in the list and roll back all the others.
Figure 2. Parallel binary log group commit

So, what's the problem?

The procedure above works fine, so what are the problems we have to solve to implement the procedure described in the previous article? If you look in Figure 2, you have a hint to what is the problem.

Now, assume that thread 1, 2, and 3 in Figure 2 is writing transactions to disk (starting at positions Trans_Pos1, Trans_Pos2, and Trans_Pos3 respectively) and that a preceding thread (a thread that got a binlog position before Last_Complete) decides that it is time to call fsync to group commit the state this far. The binlog file will then be written in this state—where some transactions are partially written—and Last_Committed will be set to the value of Last_Complete, leading to the situation depicted in Figure 2.

As you can see in the figure, thread 2 has already finished writing data to the binary log and is therefore written to durable storage. Since thread 1—which precedes thread 2 in the binary log—has not completed yet, thread 2 has not yet committed and is still waiting for all the preceding transactions to complete. If a crash occurs in this situation, it is necessary to somehow find the XID of all transactions that have committed—excluding the transaction that thread 2 has completed—and commit them to the storage engine when recovering.

A proposal for a new recovery algorithm

In the original algorithm, the scan of the binlog file stopped when the file ended, but since there can be partially written events in the binlog file after the "real" end of the file (the binlog file ends logically at Last_Committed/Last_Complete), so we have to find some other way to detect the logical end of the file.

To handle this, it is necessary to somehow mark events that are not yet committed so that the recovery algorithm can find the correct position where the binlog file ends. The same problem occurs if one wants to persist the end of the binlog file preallocating the binlog file. There are basically three ways to handle this:

  • Write the end of the binlog file in the binlog file header (that is, the Format description log event).
  • Mark each event by zeroing out a field that cannot be zero—for example, the length, the event type, or event position—before writing the event to the binary log. Then write this field with the correct value after the entire event has been written.
  • Checksum the events and find the end of the worklog by scanning for the first event with an incorrect checksum.
Write the length in the binlog file header
Finding the length of the binlog in this case is easy: just inspect the header and find the length of the binlog file there. In this case, it is necessary to update the length after the event has been written since there may be an fsync call at any time between starting to write the event data and finishing writing the event. Normally, this means updating two block of the file for each event written, which can be a problem since it requires at least the block containing the header and all the blocks that was written since the last group commit to be written when calling fsync. If a large number of events is written between each fsync, this might not impose a large penalty, but if sync-binlog=1 it might become quite expensive. Some experiments done by Yoshinori showed a drop from 15k events/sec to 10k events/sec, which means that we lose one third in performance.

Digression. The measurements that Yoshinori did consisted of one pwrite to write the event, one pwrite to write the length to the header and then a call to fsync. It is, in other word, most similar to using sync_binlog=1. In reality, however, this will not be the case since a user that is using the binary log group commit will have several events written between each call to fsync. Since these writes will be to memory (the file pages are in memory), performance will not drop as much. To evaluate the behavior for a group commit situation better, writing 10 events at a time was compared as well (pretending to be sync_binlog=10). Straight append (using write) gave at that point 110k events/sec and write to the header before calling fsync gave 80k events/sec. This means a performance reduction of 27%, which is an improvement but still a very large overhead.

Use a marker field
The second alternative is to use one of the fields as a marker field. By setting one of the fields that cannot be zero to zero, it is possible to detect that the event is incorrect and stop at the event before that. Good candidates as fields is the length—which cannot be zero for any event and is four bytes—and the event type, which is one byte and where zero denotes an unknown event and never occurs naturally in a binlog file. The technique would be to first blank out the type field of the event, write the event to the binlog file, and then use pwrite to fill in the correct type code after the entire event is written. If an fsync occurs before the event type is written, the event will be marked as unknown and if a crash occurs before the event is completely written (and written to disk), it will be possible to scan the binlog file to find the first event that is marked as unknown. In order for this technique to work, it is necessary to zero the unused part of the binlog file before starting to write anything there (or at least zero out the event type). Otherwise, crash recovery will not be able to correctly detect where the last completely written event is located.

Compared to the previous approach, this does not require writing to locations far apart (except in rare circumstances when the event spans two pages). It also has the advantage of not requiring any change of the binlog format. This technique is likely to be quite efficient. (Note that most of the writes will be to memory, so there will not be any extraneous "seeks" over the disk to zero out parts of the file.)

Checksum on each event
The third alternative is to rely on an event checksum to detect events that are incompletely written. This approach is by far the most efficient of the approaches since the event checksum is naturally written last. It also has the advantage of not requiring the unused parts of the binlog file to be zeroed since it is unlikely that the checksum will be correct for the event unless the event has been fully written. This also makes it a very good candidate for detecting the end of the binlog file when preallocating the binlog file. The disadvantage is, of course, that it requires checksums to be enabled and implemented.
With this in mind, the best approach seems to be to checksum each event and use that to detect the end of the binary log. If necessary, the second approach can be implemented when the binlog is not checksummed.

The next article will wrap up the description by pointing out some efficiency issues and how to solve them to get an efficient implementation.

2 comments:

knielsen said...

Thanks, Mats, for an interesting follow-up!

The main problem with relying on the checksum is of course that you leave some
probability (however small) that a partial event by chance happens to have a
valid checksum.

With respect to the "marker field", how about putting the field at the very
end of the event? Then there is no need to write zero first and overwrite with
non-zero last, I believe this is a common technique. This could be combined
with checksums (which are desirable anyway, after all), byt putting a checksum
that cannot be zero at the end of each event.

There is still the need to zero out the unused part of the binary log before
writing. However, pre-allocationg the binlog file is a needed improvement
anyway. The current approach, where every event write increases the file
length, presents the file system with the exact same problem that the binlog
has, but with fewer options for solving it. This means that an fdatasync() of
writes that extends a file is inherently more expensive than an fdatasync() of
writes inside a file. This is also seen in real benchmarks of --sync-binlog.

Mats Kindahl said...

Yes, there is a slight probability of having a valid checksum, but for a CRC-32 this is 1/(2^32). If you have a server that crashes once every second, this means that you will have one incorrect computation of the end of the binlog every 136 years(!), so I'm not sure one should be overly concerned about that.

It might be possible to place the marker field last, but I'm a little concerned how that affects "old" slaves (before the change). It is quite clear that it cannot handle recovery in the binlog anyway, but if we do not put "garbage" last in the event, it might be possible to develop a utility that truncates the binlog for correct recovery with old servers. In contrast to the common header and the post-header, there is no length byte for the body of the event, so it is not possible to add "footers" to events.

I'm working with Yoshinori to combine the problem of making the length of the binlog file persistent when the binlog files are pre-allocated, so whatever solution we pick will work for both that and the group commit implementation.