Friday, September 15, 2006

Replication and the disappearing statements

After reading Todd Huss blog about the gotcha when using statement-based replication, where statements can "disappear" (that is, not be applied to the slave database), I believe that I can shed some light on the reason for this behavior.

Before that, some background.

Traditionally, MySQL has been using what is called statement-based replication. Statement-based replication replicates the changes to the slave by sending the actual statement that was executed on the master over to the slave, and the slave subsequently executes that statement. Of course, only statements that change something will be sent to the slave.

Sometimes, you don't want to send all changes to the slaves. So therefore it is possible to prevent the master from sending changes to some databases using the --binlog-do-db and --binlog-ignore-db switches, which will allow you to filter out statements that updates certain databases (this is not the whole story, more about that filtering later).

This works well for most queries, such as:

INSERT INTO products   SET name='Gizmo2000', price='$2000'
But suppose that we have two databases db1 and db2 and we decide to not replicate changes to db1 but will replicate changes to db2. Now, consider the following statement:
SET =, =;
The statement updates both db1 and db2, so shall we replicate it or shall we not? Since we need to handle even this situation in a consistent manner, the current database is used to decide if the statement shall be replicated or not. (I didn't actually write the code, since it pre-dates me beginning at MySQL, but after being immersed in the code for almost two years, I'm pretty sure this is the reason.) This works well for most users, since one usually work with one database only, and set the current database to that before actually starting doing changes. However, for some special cases, like the one mentioned by Todd, it starts to look strange.

Recently (that is, in 5.1), MySQL released something called row-based replication, where the master sends the actual rows that were inserted/deleted/updated to the slave, and the slave then subsequently insert/delete/update those rows from the database. For each row, the database and table that the row belongs to is known, so if you are using row-based replication (option --binlog-format=row to the server, or use the SET GLOBAL BINLOG_FORMAT=ROW), the filtering will be done on the actual table being changed even if the statement updates several different tables in different databases.

I'll summarize with some general advice when using statement-based replication:

  • Don't qualify your table names with a database name. If you do, you might have trouble with the replication, so this is something to look for.
  • If you are going to make changes to tables in a database, always USE the database to set the current database correctly.
  • Don't use multi-table updates (or other statements that manipulate several tables) unless the tables are all in the same database.
  • Even if you are using multi-table updates on tables in the same database you might have problems. So watch out for any statement that manipulates several tables and make sure that they work by testing them.

1 comment:

roobaron said...

Hey Mats, Thanks for the great articles! The explanation in the flaws of statement-based replication vs row-based replication were good.