Tuesday, April 12, 2011

Crash-safe Replication

A common request is to have replication crash-safe in the sense that the replication progress information always is in sync with what has actually been applied to the database, even in the event of a crash. Although transactions are not lost if the server crashes, it could require some tweaking to bring the slaves up again.

In the latest MySQL 5.6 milestone development release, the replication team has implemented crash-safety for the slave by adding the ability of committing the replication information together with the transaction (see Figure 1). This means that replication information will always be consistent with has been applied to the database, even in the event of a server crash. Also, some fixes were done on the master to ensure that it recovers correctly.

If you're familiar with replication, you know that the replication information is stored in two files: master.info and relay-log.info. The update of these files are arranged so that they are updated after the transaction had been applied. This means that if you have a crash between the transaction commit and the update of the files, the replication progress information would be wrong. In other words, a transaction cannot be lost this way, but there is a risk that a transaction could be applied yet another time. The usual way to avoid this is to have a primary key on all your tables. In that case, a repeated update of the table would cause the slave to stop, and you would have to use SQL_SLAVE_SKIP_COUNTER to skip the transaction and get the slave up and running again. This is better than losing a transaction, but it is nevertheless a nuisance. Removing the primary key to prevent the slave from stopping will only solve the problem partially: it means that the transaction would be applied twice, which would both place a burden on the application to handle dual entries and also require that the tables to be cleaned regularly. Both of these approches require either manual intervention or scripting support to handle. This does not affect reliability, but it is so much easier to handle if the replication information is committed in the same transaction as the data being updated.

Crash-safe masters

Two problems related to crash-safe replication has been fixed in the master, both of which could cause some annoyance when the master recovered.
  • If the master crashed when a binary log was rotated, it was possible that some orphan binlog files ended up in the binary log index file. This was fixed in 5.1 but is also a piece in the pussle of having crash-safe replication.
  • Writing to the binary log is not an atomic operation, and if a crash occured while writing to the binary log, there were a possibility of a partial event at the end of the binary log.

    Now, the master recovers from this by truncating the binary log to the last known good position, removing the partially written transaction and rolling back the outstanding transactions in the storage engines.

Figure 1. Moving position information update into transaction

Crash-safe slaves

Several different solutions for implementing crash-safety—or transactional replication, as it is sometimes known as—have been proposed, with Google's TransactionalReplication patch being the most known. This solution stores the replication positions in the InnoDB transaction log, but the MySQL replication team decided to instead implement crash-safety by moving the replication progress information into system tables. This is a more flexible solution and has several advantages compared to storing the positions in the InnoDB transaction log:
  • If the replication information and data is stored in the same storage engine, it will allow both the data and the replication position to be updated as a single transaction, which means that it is crash-safe.
  • If the replication information and data is stored in different storage engines, but both support XA, they can still be committed as a single transaction.
  • The replication information is flushed to disk together with the transaction data. Hence writing the replication information directly to the InnoDB redo log does not offer a speed advantage, but does not prevent the user from reading the replication progress information easily.
  • The tables can be read from a normal session using SQL commands, which also means that it can be incorporated into such things as stored procedures and stored functions.
Table 1. slave_master_info
FieldLine in fileSlave status column
Master_id
Number_of_lines 1
Master_log_name 2 Master_Log_File
Master_log_pos 3 Read_Master_Log_Pos
Host 3 Master_Host
User_name 4 Master_User
User_password 5
Port 6 Master_Port
Connect_retry 7 Connect_Retry
Enabled_ssl 8 Master_SSL_Allowed
Ssl_ca 9 Master_SSL_CA_File
Ssl_capath 10 Master_SSL_CA_Path
Ssl_cert 11 Master_SSL_Cert
Ssl_cipher 12 Master_SSL_Cipher
Ssl_key 13 Master_SSL_Key
Ssl_verify_servert_cert 14 Master_SSL_Verify_Server_Cert
Heartbeat 15
Bind 16 Master_Bind
Ignored_server_ids 17 Replicate_Ignore_Server_Ids
Uuid 18 Master_UUID
Retry_count 19 Master_Retry_Count
In addition to giving us crash-safe slaves the last of these advantages should not be taken lightly. Being able to handle replication from pure SQL put some of the key features in the hands of application developers.

As previously mentioned, the replication information is stored in two files:

master.info
This file contain information about the connection to the master—such as hostname, user, and password—but also information about how much of the binary log that has been transferred to the slave.
relay-log.info
This file contain information about the current state of replication, that is, how much of the relay log that has been applied.

Options to select replication information repository

In order to make the solution flexible, we introduced a general API for adding replication information repositories. This means that we can support multiple types of repositories for replication information, but currently, only the old system using files master.info and relay-log.info and the system using tables slave_master_info and slave_relay_log_info is supported. In order to select what type of repository to use, two new options were added. These options are also available as server variables.
master_info_repository
The type of repository to use for the master info data seen in Table 1.
relay_log_info_repository
The type of repository to use for the relay log info seen in Table 2.
Both of the variables can be set to either FILE or TABLE. If the variable is set to TABLE the new table-based system will be used and if it is set to FILE, the old file-based system will be used. The default is FILE, so make sure to set the value if you want to use the table-based system.

Table 2. slave_relay_log_info
FieldLine in fileSlave status column
Master_id
Number_of_lines1
Relay_log_name2Relay_Log_File
Relay_log_pos3Relay_Log_Pos
Master_log_name4Relay_Master_Log_File
Master_log_pos5Exec_Master_Log_Pos
Sql_delay6SQL_Delay
If you look in Table 1 and Table 2 you can see the column names used for the tables as well as the line number in the corresponding file and the column name in the output of SHOW SLAVE STATUS. Since we are using tables, the column names are used for storing the data in the table, but when using a file, the column names are only used to identify the correct row to update and the value is inserted at the line number given in the table.

The format of the tables have been extended with an additional field that is not present in the files but which is present in the table: the Master_id field. The reason we added this is to make it possible to extend the server to track multiple masters. Note that we currently have no definite plans to add multi-source support, but as good engineers we do not want these tables to be a hindrance to adding multi-source.

Selecting replication repository engine

In contrast with most of the system tables in the server, the replication repositories can be configured to use any storage engine you prefer. The advantage of this is that you can select the same engine for the replication repositories as the data you're managing. If you do that, both the data and the replication information will be committed as a single transaction.

The new tables are created at installation using the mysql_install_db script, as usual, and the default engine for these tables are are the same as for all system tables: MyISAM. As you know MyISAM is not very transactional, so it is necessary to set this to use InnoDB instead if you really want crash-safety. To change the engine for these tables you can just use a normal ALTER TABLE.

slave> ALTER TABLE mysql.slave_master_info ENGINE = InnoDB;
slave> ALTER TABLE mysql.slave_relay_log_info ENGINE = InnoDB;
Note that this works for these tables because they were designed to allow any storage engine to be used for them, but it does not mean that you can change the storage engine for other system tables and expect it to work.

Event processing

This implementation of crash-safe slaves work naturally with both statement-based and row-based replication and there is nothing special that needs to be done in the normal cases. However, these tables interleave with the normal processing in a little different ways.

To understand how transactions are processed by the SQL thread, let us consider the following example transaction:

START TRANSACTION;
INSERT INTO articles(user, title, body)
      VALUE (4711, 'Taming the Higgs Boson using Clicker Training', '....');
UPDATE users SET articles = articles + 1 WHERE user_id = 4711;
COMMIT;
This transaction will be written to the binary log and then sent over to the slave and written to the relay log in the usual way. Once it is read from the relay log for execution, it will be executed as if an update statement where added to the end of the transaction, before the commit:
START TRANSACTION;
INSERT INTO articles(user, title, body)
      VALUE (4711, 'Taming the Higgs Boson using Clicker Training', '....');
UPDATE users SET articles = articles + 1 WHERE user_id = 4711;
UPDATE mysql.slave_relay_log_info
   SET Master_log_pos = @@Exec_Master_Log_Pos,
       Master_log_name = @@Relay_Master_Log_File,
       Relay_log_name = @@Relay_Log_File,
       Relay_log_pos = @@Relay_Log_Pos
COMMIT;
In this example, there is a number of pseudo-server variables (that is, they don't exist for real) that have the same name as the corresponding field in the result set from SHOW SLAVE STATUS. As you can see, the update of the position information is now inside the transcation and will be committed with the transaction, so if both articles and mysql.slave_relay_log_info are in the same transactional engine, they will be committed as a unit.

This works well for the SQL thread, but what about the I/O thread? There are no transactions executed here at all, so when is the information in this table committed?

Since a commit to the table is expensive—in the same way as syncing a file to disk is expensive when using files as replication information repository—the updates of the slave_master_info table is not updated with each processed event. Depending on the value of sync_master_info there are a few alternatives.

If sync_master_info = 0
In this case, the slave_master_info table is just updated when the slave starts or stops (for any reason, including errors), if the relay log is rotated, or if you execute a CHANGE MASTER command.
If sync_master_info > 0
Then the slave_master_info table will be updated every sync_master_info event.
This means that while the slave is running, you cannot really see how much data has been read to the slave without stopping it. If it is important to see how the slave progress in reading events from the master, then you have to set sync_master_info to some non-zero value, but you should be aware that there is a cost associated with doing this.

This does not usually pose a problem since the times you need to read the master replication information on a running replication is far and few between. It is much more common to read it when the slave has stopped for some reason: to figure out where the error is or to perform a master fail-over.

Closing remarks

We would be very interested in hearing any comments you have on this feature and how it is implemented. If you want to try this out for yourselves then you can download the MySQL 5.6 Milestone Development Release where all this is implemented from the MySQL Developer Zone (dev.mysql.com). If you want to find out more details, the section Slave Status Logs in the MySQL 5.6 reference manual will provide you with all the information. This is one of the features that presented by Lars Thalmann April 11, 2011 (yesterday) at 2:30pm, at the "MySQL Replication" talk at Collaborate 11 and April 12, 2011 (today) 10:50am "MySQL Replication Update" at the O'Reilly MySQL Conference & Expo.

13 comments:

Kristian Nielsen said...

It is very good to see work from MySQL@Oracle on making replication
crash-safe!

Unfortunately, I see a big problem with this approach as relates to possible
future implementations of parallel replication, and I was wondering what your
thoughts and plans are for this (hopefully you already thought of this and I
am just missing the possible solution).

The problem is the UPDATE statement at the end of each transaction. This means
every transaction needs to lock the same row! This makes it impossible to run
two transactions in parallel on the slave.

I know the current project for parallel replication at MySQL is based on
running one transaction at a time, but split in multiple threads. However I do
not believe this is sufficient for all future needs for scaling parallel
replication.

More importantly, to make replication really crash-safe when using
--log-slave-updates, it is necessary to run with --sync_binlog=1 and
--innodb-flush-log-at-trx-commit=1. To do this with acceptable performance we
must have group commit, as implemented in the Facebook patch and MariaDB. But
group commit is again impossible with this feature; as all transactions need
to lock the same row before the prepare step, only one transaction can commit
at a time -> no group commit.

Have you checked MWL#188
(http://askmonty.org/worklog/Server-RawIdeaBin/?tid=188)? It describes another
way to handle the crash safety, without introducing this lock contention
problem. Basically, it recovers the information from the binlog on the slave,
which is trasnactional due to the 2-phase commit.

Well, making the information available from SQL from system tables is a good
idea; but it should be done without imposing locking of a global row for every
commit. I think it could be done by updating the position with an INSERT
rather than an update. Each transaction at the end inserts a new row, rather
than update the last one. Add a sequence number that is increased at every
commit. Then to read the current position from SQL, just SELECT ... ORDER BY
sequence_number DESC LIMIT 1.

(A background thread can batch delete old entries from time to time).

This way, you avoid transactions contending one another on the single row,
which I think is really important for future development.

Robert Hodges said...

Hi Mats and Kristian. There is another solution to the global lock problem, namely to have a catalog table with a separate row for each apply thread and to access the rows only by their primary key. This avoids lock problems and allows parallel apply to proceed. Tungsten uses this approach to make parallel replication crash-safe.

The extra catalog table does have another problem Kristian did not mention, namely that you end up with a lot of extra writes on the table that tracks recovery position. Tungsten mitigates this problem with block commit--each apply thread applies multiple transactions on the slave at once whenever possible, which allows us to amortize the commit table write across several transactions.

These are two of the considerations that drove us to adopt to the serialized shard design described in my blog. (http://scale-out-blog.blogspot.com/2011/03/parallel-replication-using-shards-is.html)

Kristian Nielsen said...

Robert,

The main problem with your suggestion is that it assumes a static, fixed assignment of event to apply thread. This is true for Tungsten replication, but no necessarily in the general case of parallel replication (see for example http://askmonty.org/worklog/Server-RawIdeaBin/?tid=184 and http://askmonty.org/worklog/Server-RawIdeaBin/?tid=186 ).

On the other hand, the idea of replicating distinct databases individually (which I believe Tungsten uses? And I think I read MySQL is doing something similar) is an important one, so in my suggestion it would make sense to include the event group database as the second component for the primary key; this way it is possible to obtain current position for given database/apply thread, as well as globally, depending on the replication method.

I still think we should avoid in a general user-visible facility to include by-design global locks.

BTW, I am surprised you think the writes to the extra catalog table should incour significant overhead? It is just a single tiny table that should certainly fit in RAM, and one insert per commit. Remember, when we are talking crash-safe replication, we need to run with innodb_flush_log_at_trx_commit=1 and sync_binlog=1, and incur the overhead of fsync() on commits. It seems to me an in-memory insert will hardly be noticable compared to this ...

Anonymous said...

``I know the current project for parallel replication at MySQL is based on
running one transaction at a time, but split in multiple threads.''

Kristian, actually it's not like that.
Consider another article
http://d2-systems.blogspot.com/2011/04/mysql-56x-feature-preview-multi.html
describing MTS implementation.
MTS runs multiple master transaction at a time (provided they fit to partitioning requirement), and that is supposed to make use of Innodb group commit.

cheers,

Andrei Elkin

Robert Hodges said...

@Kristian On the write overhead--there's definitely some cost to tracking your position. Imagine you have short auto-commit writes with no block commit. Updating a table to track position adds an extra write for each user update. This is a worst case that can double the data going into the log, hence implies additional disk I/O.

Disclaimer: I don't have benchmark numbers for this specific effect, just for block commit in general, which is a big win at least in our case. The actual impact could be quite variable depending on workload, fsync intervals, and hardware configuration.

Incidentally I read with some interest your designs for parallelizing based on group commit. It will be interesting to see how the different approaches end up playing out practice. Performance results in this area are sometimes a bit non-intuitive.

Mats Kindahl said...

Hi Kristian and Robert! It is quite interesting to read your comments. I cannot fully comment on what you've written, it's just too much, but I can add one comment regarding one issue that Kristian mention.

Kristian, there is no difference between writing to a file and to a table w.r.t. the locked row and group commits. In both cases it is necessary to handle the group commit/batch commit, so this implementation does not change anything. There are several ways of handling that (I will elaborate on them at some other time).

Unfortunately, recovering the information from the binary log does not help since this is the *slave* information that is not in sync. The information could be written to the binary log on the slave (is that what you mean?), but that would mean that the binary log would have to be enabled for all slaves, and that is not always the case.

Robert Hodges said...

@Mats, you and Kristian should come to Giuseppe's Open Database Camp in Sardinia next month and we can reflect on these matters together. It should be fun. :)

Mats Kindahl said...

Robert, I would love to come to Sardinia, but I'm not sure I will be able to. :)

Kristian Nielsen said...

Mats, I am sorry you cannot comment on this. MySQL is one of the most important programs in the Free Software community, and it would really help if things like this were designed and discussed publicly, not in private mailing lists, IRC channels, etc, etc.

I just spent a year removing the InnoDB prepare_commit_mutex, which lost us group commit for >5 years. So I worry if you introduce a new user-visible feature that by design re-introduces the prepare_commit_mutex (this time as a table lock). So whatever solution you end up with, public discussions or no, please at least make sure you understand the problem.

Mats Kindahl said...

Kristian, I prefer to actually go over the design myself, including examining the alternative options, before I publish it. This give me a chance to have good answers to the questions that undoubtedly arise. However, I do understand that YMMV.

You can find the current status on the binary log group commit as WL#5223, which includes the current tentative design Note that it is still work in progress.

Luís Soares said...

Hello Kristian, many thanks for your feedback! Please, allow me to correct your premises:

1. In the current feature preview available on labs, the slave does not do INTRA-transaction but rather INTER-transaction parallelization on apply.

2. Again, in the same feature preview available on labs, the slave apply procedure does not have a global lock, because different worker threads update their own row on a different table: mysql.slave_workers_info. As such binlog group commit is not at risk here...

I invite you to read a worklog and a blog entry on this matter:

- http://forge.mysql.com/worklog/task.php?id=5569
- http://d2-systems.blogspot.com/2011/04/mysql-56x-feature-preview-multi.html

If you find bugs (which is likely, as the current implementation is yet a feature preview), or have more feedback, let us know about it. Again, thanks for taking time to look into this!

Hello Robert, you have posted some good feedback as well, thanks! On applying multiple transactions at once to avoid the additional position row update, I don't think that the update is the expensive part of it. Indeed you're removing those from the execution path, but more importantly, you're removing several expensive COMMIT operations. The latter, IMHO, is what makes the difference (and not so much the former).

Mark Callaghan said...

Hooray. Thanks for making a lot of progress on replication.

Block commit as described by Robert Hodges might not be that different from running the slave with innodb_flush_log_at_trx_commit=2. In both cases you avoid fsync on every commit.

Rith said...

Hi Mats,

In 5.6 GA version, both the tables
doesn't have Master_id.