Comments on MySQL Musings: Crash-safe Replication

Is it necessary to set sync_binlog=1 at a master t...

2015-11-18T15:34:17.453+01:00

Is it necessary to set sync_binlog=1 at a master to make it to be a crash safe master?

Hi Mats, In 5.6 GA version, both the tables doesn...

2013-04-18T00:07:02.880+02:00

Hi Mats,

In 5.6 GA version, both the tables
doesn't have Master_id.

Hooray. Thanks for making a lot of progress on rep...

2012-04-14T03:59:04.583+02:00

Hooray. Thanks for making a lot of progress on replication.

Block commit as described by Robert Hodges might not be that different from running the slave with innodb_flush_log_at_trx_commit=2. In both cases you avoid fsync on every commit.

Hello Kristian, many thanks for your feedback! Ple...

2011-04-22T17:43:49.014+02:00

Hello Kristian, many thanks for your feedback! Please, allow me to correct your premises:

1. In the current feature preview available on labs, the slave does not do INTRA-transaction but rather INTER-transaction parallelization on apply.

2. Again, in the same feature preview available on labs, the slave apply procedure does not have a global lock, because different worker threads update their own row on a different table: mysql.slave_workers_info. As such binlog group commit is not at risk here...

I invite you to read a worklog and a blog entry on this matter:

- http://forge.mysql.com/worklog/task.php?id=5569
- http://d2-systems.blogspot.com/2011/04/mysql-56x-feature-preview-multi.html

If you find bugs (which is likely, as the current implementation is yet a feature preview), or have more feedback, let us know about it. Again, thanks for taking time to look into this!

Hello Robert, you have posted some good feedback as well, thanks! On applying multiple transactions at once to avoid the additional position row update, I don't think that the update is the expensive part of it. Indeed you're removing those from the execution path, but more importantly, you're removing several expensive COMMIT operations. The latter, IMHO, is what makes the difference (and not so much the former).

Kristian, I prefer to actually go over the design ...

2011-04-20T10:17:24.479+02:00

Kristian, I prefer to actually go over the design myself, including examining the alternative options, before I publish it. This give me a chance to have good answers to the questions that undoubtedly arise. However, I do understand that YMMV.

You can find the current status on the binary log group commit as WL#5223, which includes the current tentative design Note that it is still work in progress.

Mats, I am sorry you cannot comment on this. MySQL...

2011-04-19T08:51:51.304+02:00

Mats, I am sorry you cannot comment on this. MySQL is one of the most important programs in the Free Software community, and it would really help if things like this were designed and discussed publicly, not in private mailing lists, IRC channels, etc, etc.

I just spent a year removing the InnoDB prepare_commit_mutex, which lost us group commit for >5 years. So I worry if you introduce a new user-visible feature that by design re-introduces the prepare_commit_mutex (this time as a table lock). So whatever solution you end up with, public discussions or no, please at least make sure you understand the problem.

Robert, I would love to come to Sardinia, but I...

2011-04-19T07:48:12.157+02:00

Robert, I would love to come to Sardinia, but I'm not sure I will be able to. :)

@Mats, you and Kristian should come to Giuseppe...

2011-04-19T06:29:51.268+02:00

@Mats, you and Kristian should come to Giuseppe's Open Database Camp in Sardinia next month and we can reflect on these matters together. It should be fun. :)

Hi Kristian and Robert! It is quite interesting to...

2011-04-19T00:50:37.646+02:00

Hi Kristian and Robert! It is quite interesting to read your comments. I cannot fully comment on what you've written, it's just too much, but I can add one comment regarding one issue that Kristian mention.

Kristian, there is no difference between writing to a file and to a table w.r.t. the locked row and group commits. In both cases it is necessary to handle the group commit/batch commit, so this implementation does not change anything. There are several ways of handling that (I will elaborate on them at some other time).

Unfortunately, recovering the information from the binary log does not help since this is the *slave* information that is not in sync. The information could be written to the binary log on the slave (is that what you mean?), but that would mean that the binary log would have to be enabled for all slaves, and that is not always the case.

@Kristian On the write overhead--there's defin...

2011-04-18T21:05:08.425+02:00

@Kristian On the write overhead--there's definitely some cost to tracking your position. Imagine you have short auto-commit writes with no block commit. Updating a table to track position adds an extra write for each user update. This is a worst case that can double the data going into the log, hence implies additional disk I/O.

Disclaimer: I don't have benchmark numbers for this specific effect, just for block commit in general, which is a big win at least in our case. The actual impact could be quite variable depending on workload, fsync intervals, and hardware configuration.

Incidentally I read with some interest your designs for parallelizing based on group commit. It will be interesting to see how the different approaches end up playing out practice. Performance results in this area are sometimes a bit non-intuitive.

``I know the current project for parallel replicat...

2011-04-18T13:06:31.176+02:00

``I know the current project for parallel replication at MySQL is based on
running one transaction at a time, but split in multiple threads.''

Kristian, actually it's not like that.
Consider another article
http://d2-systems.blogspot.com/2011/04/mysql-56x-feature-preview-multi.html
describing MTS implementation.
MTS runs multiple master transaction at a time (provided they fit to partitioning requirement), and that is supposed to make use of Innodb group commit.

cheers,

Andrei Elkin

Robert, The main problem with your suggestion is ...

2011-04-18T08:26:14.135+02:00

Robert,

The main problem with your suggestion is that it assumes a static, fixed assignment of event to apply thread. This is true for Tungsten replication, but no necessarily in the general case of parallel replication (see for example http://askmonty.org/worklog/Server-RawIdeaBin/?tid=184 and http://askmonty.org/worklog/Server-RawIdeaBin/?tid=186 ).

On the other hand, the idea of replicating distinct databases individually (which I believe Tungsten uses? And I think I read MySQL is doing something similar) is an important one, so in my suggestion it would make sense to include the event group database as the second component for the primary key; this way it is possible to obtain current position for given database/apply thread, as well as globally, depending on the replication method.

I still think we should avoid in a general user-visible facility to include by-design global locks.

BTW, I am surprised you think the writes to the extra catalog table should incour significant overhead? It is just a single tiny table that should certainly fit in RAM, and one insert per commit. Remember, when we are talking crash-safe replication, we need to run with innodb_flush_log_at_trx_commit=1 and sync_binlog=1, and incur the overhead of fsync() on commits. It seems to me an in-memory insert will hardly be noticable compared to this ...

Hi Mats and Kristian. There is another solution t...

2011-04-17T20:37:35.863+02:00

Hi Mats and Kristian. There is another solution to the global lock problem, namely to have a catalog table with a separate row for each apply thread and to access the rows only by their primary key. This avoids lock problems and allows parallel apply to proceed. Tungsten uses this approach to make parallel replication crash-safe.

The extra catalog table does have another problem Kristian did not mention, namely that you end up with a lot of extra writes on the table that tracks recovery position. Tungsten mitigates this problem with block commit--each apply thread applies multiple transactions on the slave at once whenever possible, which allows us to amortize the commit table write across several transactions.

These are two of the considerations that drove us to adopt to the serialized shard design described in my blog. (http://scale-out-blog.blogspot.com/2011/03/parallel-replication-using-shards-is.html)

It is very good to see work from MySQL@Oracle on m...

2011-04-14T11:14:19.218+02:00

It is very good to see work from MySQL@Oracle on making replication
crash-safe!

Unfortunately, I see a big problem with this approach as relates to possible
future implementations of parallel replication, and I was wondering what your
thoughts and plans are for this (hopefully you already thought of this and I
am just missing the possible solution).

The problem is the UPDATE statement at the end of each transaction. This means
every transaction needs to lock the same row! This makes it impossible to run
two transactions in parallel on the slave.

I know the current project for parallel replication at MySQL is based on
running one transaction at a time, but split in multiple threads. However I do
not believe this is sufficient for all future needs for scaling parallel
replication.

More importantly, to make replication really crash-safe when using
--log-slave-updates, it is necessary to run with --sync_binlog=1 and
--innodb-flush-log-at-trx-commit=1. To do this with acceptable performance we
must have group commit, as implemented in the Facebook patch and MariaDB. But
group commit is again impossible with this feature; as all transactions need
to lock the same row before the prepare step, only one transaction can commit
at a time -> no group commit.

Have you checked MWL#188
(http://askmonty.org/worklog/Server-RawIdeaBin/?tid=188)? It describes another
way to handle the crash safety, without introducing this lock contention
problem. Basically, it recovers the information from the binlog on the slave,
which is trasnactional due to the 2-phase commit.

Well, making the information available from SQL from system tables is a good
idea; but it should be done without imposing locking of a global row for every
commit. I think it could be done by updating the position with an INSERT
rather than an update. Each transaction at the end inserts a new row, rather
than update the last one. Add a sequence number that is increased at every
commit. Then to read the current position from SQL, just SELECT ... ORDER BY
sequence_number DESC LIMIT 1.

(A background thread can batch delete old entries from time to time).

This way, you avoid transactions contending one another on the single row,
which I think is really important for future development.