Monday, September 26, 2011

Python Interface to MySQL

There has been a lot of discussions lately about various non-SQL languages that provide access to databases without having to resort to using SQL. I wondered how difficult it would be to implement such an interface, so as an experiment, I implemented a simple interface in Python that similar to the document-oriented interfaces available elsewhere. The interface generate SQL queries to query the database, but does not require any knowlegdge of SQL to use. The syntax is inspired by JQuery, but since JQuery works with documents, the semantics is slightly different.

A simple example would look like this:

from native_db import *
server = Server(host='127.0.0.1')
server.test.t1.insert({'more': 3, 'magic': 'just a test', 'count': 0})
server.test.t1.insert({'more': 3, 'magic': 'just another test', 'count': 0})
server.test.t1.insert({'more': 4, 'magic': 'quadrant', 'count': 0})
server.test.t1.insert({'more': 5, 'magic': 'even more magic', 'count': 0})
for row in server.test.t1.find({'more': 3}):
  print "The magic is:", row['magic']
server.test.t1.update({'more': 3}, {'count': 'count+1'})
for row in server.test.t1.find({'more': 3}, ['magic', 'count']):
  print "The magic is:", row['magic'], "and the count is", row['count']
server.test.t1.delete({'more': 5})
The first line define a server to communicate with, which is simply done by creating a Server object with the necessary parameters. The constructor accepts the normal parameters for Connector/Python (which is what I'm using internally), but the user defaults to whatever getpass.getuser() returns, and the host default to 127.0.0.1, even though I've provided it here.

After that, the necessary methods are overridden so that server.database.table will refer to the table with name table in database with name database on the given server. One possibility would be to just skip the database and go directly on the table (using some default database name), but since this is just an experiment, I did this instead. After that, there are various methods defined to support searching, inserting, deleting, and updating.

Since this is intended to be a simple interface, autocommit is on. Each of the functions generate a single SQL statement, so they will be executed atomically if you're using InnoDB.

table.insert(row)
This function will insert the contents of the dictionary into the table. using the keys of the dictionary as column names. If the table does not exist, it will be created with a "best effort" guess of what types to use for the columns.
table.delete(condition)
This function will remove all rows in the table that matches the supplied dictionary. Currently, only equality mapping is supported, but see below for how it could be extended.
table.find(condition, fields="*")
This will search the table and return an iterable to the rows that match condition. If fields is supplied (as a list of field names), only those fields are returned.
table.update(condition, update)
This will search for rows matching condition and update each matching row according to the update dictionary. The values of the dictionary is used on the right side of the assignments of the UPDATE statement, so expressions can be given here as strings.

That's all folks!

The code is available at http://mats.kindahl.net/python/native_db.py if you're interested in trying it out. The code is very basic, and there's potential for a lot of extensions. If there's interest, I could probably create a repository somewhere.

Note that this is not a replacement for an ORM library. The intention is not to allow storing arbitrary objects in the database: the intention is to be able to query the database using a Python interface without resorting to using SQL.

I'm just playing around and testing some things out, and I'm not really sure if there is any interest in anything like this, so what do you think? Personally, I have no problems with using SQL, but since I'm working with MySQL on a daily basis, I'm strongly biased on the subject. For simple jobs, this is probably easier to work with than a "real" SQL interface, but it cannot handle as complex queries as SQL can (at least not without extensions).

There is a number of open issues for the implementation (this is just a small list of obvious ones):

Only equality searching supported
Searching can only be done with equality matches, but it is trivial to extend to support more complex comparisons. To allow more complex conditions, the condition supplied to find, delete, and update can actually be a string, in which case it is used "raw".

Conditions could be extended to support something like {'more': '>3'}, or a more object-oriented approach would be to support something similar to {'more': operator.gt(3)}.

No support for indexes
There's no support for indexes yet, but that can easily be added. The complication is what kind of indexes should be generated.

For example, right now rows are identified by their content, but if we want unique rows to be handled as a set? Imagine the following (not supported) query where we insert :

server.test.t1.insert(content with some more=3).find({'more': eq(3)})
In this case, we have to fetch the row identifiers for the inserted rows to be able to manipulate exactly those rows and none other. Not sure how to do this right now, but auto-inventing a row-identifier would mean that tables lacking it cannot be handled naturally.

Creating and dropping tables
The support for creation of tables is to create tables automatically if they do not exist. A simple heuristic is used to figure out the table definition, but this has obvious flaws if later inserts have more fields than the first one.

To support extending the table, one would have to generate an ALTER TABLE statement to "fix" the table.

There is no support for dropping tables... or databases.

Wednesday, July 27, 2011

Binlog Group Commit Experiments

Binlog Group Commit Experiments

It was a while ago since I talked about binary log group commit. I had to spend time on a few other things.

Since then, Kristian has released a version of binary log group commit that seems to work well. However, for a few reasons that will be outlined below, we decided to do experiments ourselves using the approach that I have described earlier. A very early version of what we will start doing benchmarks on are available at the MySQL labs. We have not done any any benchmarking on this approach before OSCON, so we we'll have to get back on that.

All of this started with Facebook pointing out a problem in how the group commit interacts with the binary log and proposed a way to handle the binary log group commit by demonstrating a patch to solve the problem.

What's in the patch

The patch involves implementing logic for handling binary log group commit and parallel writing of the binary log, including a minor change to the handler protocol by adding a persist callback. The extension of the handler interface is strictly speaking not necessary for the implementation, but it is natural to extend the interface in this manner and I belive that it can be used by storage engines to execute more efficiently). In addition to the new logic, three new options were added and one option was created as an alias of an old option.
binlog-sync-period=N
This is just a rename of the old sync-period option, which tell that fsync should be called for the binary log every N events. For many of the old options, it is not clear what they are configuring, so we are adding the binlog- prefix to options that affect the binary log. The old option is kept as an alias for this option.
binlog-sync-interval=msec
No transaction commit will wait for more than msec milliseconds before calling fsync on the binary log. If set to zero, it is disabled. You can set both this option and the binlog-sync-period option.
binlog-trx-committed={COMPLETE,DURABLE}
A transaction is considered committed when it is either in durable store or when it is completed. If set to DURABLE either binlog-sync-interval or binlog-sync-period has to be non-zero. If they are both zero, transactions will not be flushed to disk and hence they will never be considered durable.
master-trx-read=={COMPLETE,DURABLE}
A transaction is read from the binary log when it is completed or when it is durable. If set to DURABLE either binlog-sync-interval or binlog-sync-period has to be non-zero or an error will be generated. If it was possible for both zero, no transactions will ever be read from the binary log and hence never sent out.
The patch also contain code to eliminate the prepare_commit_mutex as well as moving release of row locks inside InnoDB (not completely applied yet, I will get it there as soon as possible) to the prepare phase. The focus on these changes is that we should maintain consistency, so we have not done any aggressive changes like moving the release of the write locks to the prepare phase: that could possibly lead to inconsistencies.

Figure 1. Binary log with transaction in different stages
The main changes are about how a transaction is committed. The details are explained in the previous articles, but for understanding the rest of this blog post, I'll briefly recapitulate how a transaction is committed in this solution. Each transaction pass through three states: prepared, completed (committed to memory), and durable (committed to disk), as seen in Figure 1. The transaction is pushed through these states using the following procedure:
  1. The transaction is first prepared, which is now split into two steps:
    1. In the reserve step, a slot is assigned for the transaction in the binary log and the storage engine is asked check if this transaction can be committed. At this point, the storage engine can abort the transaction if it is unable to fulfill the commit, but if it approves of the commit, the only thing that can abort the transaction after this point is a server crash. This check is currently done using the prepare call. This step is executed with a lock, but is intended to be short.
    2. In the persist step, the persist function is called, which asks the storage engine to persist any data that it need to persist to guarantee that the transaction is fully prepared. After this step is complete, the transaction is fully prepared in the storage engine and in the event of a crash, it will be able to commit the transaction on recovery, if asked to do so. This step is executed without a lock and a storage engine that intend to handle group commit should defer any expensive operations to this step.
  2. To record the decision, the transaction is written to the reserved slot in the binary log. Since the write is done to a dedicated place in the binary log reserved to this transaction, it is not necessary to hold any locks, which means that several threads can write the transaction to the binary log at the same time.
  3. The commit phase is in turn split into two steps:
    1. In the completion step, the thread waits for all preceeding transactions to be fully written to the binary log, after which the transaction is completed, which means that it is logically committed but not necessarily in durable storage.
    2. In the durability, step, the thread waits for the transaction (and all preceeding transactions) to be written to disk. If this does not occur within the given time period, it will itself call fsync for the binary log. This will make all completed transactions durable.
After this procedure is complete, the transaction is fully committed and the thread can proceed with executing the next statement.

The different approaches

So, providing this patch begs the questions: why a third version of binary log group commit? There are three approaches: Facebook's patch (#1), Kristian's patch (#2), and my patch (#3). Before going over the rationale leading to a third version, it is necessary to understand how the Facebook patch and Krisian's patch work on a very high level. If you look at Figure 1, you see a principal diagram showing how the patches work. Both of them maintain a queue of threads with transactions to be written and will ensure that they are written in the correct order to the binary log.

The Facebook patch ensures that the transactions are written in the correct order by signalling each thread waiting in the queue in the correct order, after which the thread will take a lock on the binary log, append the transaction, and release the lock. When the decision to commit the outstanding transactions are made, fsync() is called. It has turned out that this lock-write-unlock loop can just be executed at a certain speed, which means that as the number threads waiting to write transactions increase, the system choke and is not able to keep up.

Kristian solves this by designating the first thread in the queue as the leader, and have it write the transactions for all threads in the queue instead of just having each thread do it individually and then broadcast to the other threads, who just return from the commit. This improves performance significantly as can be seen from the figures in the measurements that Mark did. Note, however, that a lock of the binary log is still kept while writing the transactions.

The approach we are experimenting with goes about this in another way and instead of queueing the data to be written, a place is immediately allocated in the binary log after which the thread proceed to write the data. This means that several threads can at the same time write in parallel to the binary log without needing to keep any locks. There is a need for a lock when allocating space in the binary log, but that is very short. Since the threads can finish writing in different order, it is necessary to keep logic around for deciding when a transaction is committed and when it's not. For details, you can look at the worklog (which is not entirely up to date, but I'll fix that). In this sense, the binary log itself is the queue (there is a queue in the implementation, but this is just for bookkeeping). The important differences leading us to a want to have a look at this third version are:

  • Approaches #1 and #2 keep a lock while writing the binary log while #3 doesn't.
  • Approaches #1 and #2 keep the transactions on the side (in the queue) and write them to the binary log when they are being committed. Approach #3 writes the transactions directly to the binary log, possibly before they are committed.
Figure 1. Sources of performance problems

Efficiently using Multiple Cores

Efficiently using a multi-threaded systems, especially one with multiple cores, is very hard. It requires knowledge of hardware issues, operating systems considerations, algorithms, and some luck. I will not cover all the issues revolving around designing a system for multi-core use, but I will focus on three of the parts that we are considering in this case. We split the sources of performance degradations when committing a transaction into three separate parts: CPU and memory issues, software lock contention, and I/O.
  • The CPU and memory issues has to do with how caches are handled on the CPU level, which can affect performance quite a lot. There some things that can be done, such as avoiding false sharing, handling data alignment, and checking the cache access patterns, but in general, it is hard to add as an afterthought and require quite a lot of work to get right. We are not considering this and view it as static.
  • The I/O can be reduced using either SSDs or use RAID solutions (which does not reduce latency, but improves the throughput and therefore reduce the I/O needed for each transaction). Also, reducing the number of accesses to disk using group commits will improve the situation significantly, which is what we're doing here.
  • To reduce the software lock contention there is only one solution: reduce the time each lock is kept. This can be as simple as moving the lock aquire and release, using atomic primitives instead of locks, but can also require re-designing algorithms to be able to run without locks.
So, assuming that we reduce the I/O portion of committing a transaction—and only I/O portion—as you can see in Figure 1, the software lock time start to become the problem and we need to start to work on reducing that. To do this, there are not many options except the approach described above. And if we take this approach to reduce lock contention, there's just a few additions to get the group commit as well.

Given this, it is rational to explore if this solution can solve the group commit problem as good as the other solutions and improve the scalability of the server at the same time.

Scaling out

One of the most central uses for replication is to achieve high-availability by duplicating masters and replicate between them to keep both up to date. For this reason, it is important to get the changes over to the other master as fast as possible. In this case, whether the data is durable on the original master or not is of a smaller concern since once the transaction has left the node, a crash will not cause the transaction to disappear since it has already been distributed. This means that for implementing multi-masters, we want replication to send transactions as soon as possible—and maybe even before that—since we can achive high-availablility by propagating the information as widely as possible.

On the other hand, transactions sent from the master to the slave might need to be durable on the master since otherwise the slave might be moving into an alternative future—a future where this transaction was committed—if the transactions sent to the slave are lost because of a crash. In this case, it is necessary for the master to not send out the transaction before it is in durable store. Having a master that is able to send out both completed transactions and durable transactions at the same time, all based on the requirements of the slave that connects, is a great feature and allow the implementation of both an efficient multi-master solution as well as slaves that does not diverge from the master even in the event of crashes. Currently, a master cannot both deliver transactions that are completed and transactions that are durable at the same time. With the patch presented in this article, it is possible to implement this, but in alternative #1 and #2 described above, all the transactions are kept "on the side" and not written to the binary log until they are being committed. This means that it is harder to support this scenario with the two other alternatives.

Concluding remarks

To sum up the discussion above: we are interested in exploring this approach since we think that it provides shorter lock time, hence scales better to multi-core machines, and in addition provide better scale-out capabilities, since it will be possible that the slaves can decide if they want to receive durable or completed transactions. Thanks to all in the community for the great work and discussions on binlog group commit. The next steps will be to benchmark this solution to see how it flies and it would be great to also get some feedback on this approach. As always, we are interested in getting a good and efficent solution that also can be maintained end evolved easily.

Wednesday, April 13, 2011

Round-Robin Multi-Source in Pure SQL

With the addition of the new tables to implement crash-safe replication we also get access to replication information through the SQL interface. This might not seem like a big advantage, but it should not be taken lightly. To demonstrate the power of using this approach, I will show how to implement a multi-source round-robin replication described at other places (including our book). However, compared to the other implementations—where the implementation requires a client to parse the output of SHOW SLAVE STATUS—the twist is that the implementation is entirely done in the server, using pure SQL.

If you're familiar with replication, you know that a slave can just replication from a single master. The trick used to replicate from multiple master—this is usually called multi-source—is to switch between masters in a time-share fashion as illustrated in Figure 1. The schema used to pick the master to replicate can vary, but it is common to use a round robin schedule.

The steps necessary to switch master are:

  1. Stop reading events from the master and empty the relay log. To stop reading events from the master, it is necessary to ensure that there are no outstanding events in the relay log before switching to another master. If this is not done, some will not be applied and will have to be re-fetched from the master.
    1. Stop the I/O thread.
    2. Wait for the events in the relay log to be applied.
    3. Stop the SQL thread.
  2. Save away the replication information.
  3. Fetch the saved information about the next master to replicate from.
  4. Change master using the new information.
  5. Start the slave threads.
Simple, right? So, let's make an implementation! So, what pieces do we need?
  • To handle the periodic switching, we use an SQL event for executing the above procedure.
  • We need a table to store the state of each master. The table should contain all the necessary information for configuring the master, including the binlog position.
  • We need to be able to store what master we're currently replicating from.

Saving state information

Figure 1. Tables for storing information about masters
CREATE TABLE my_masters (
    idx INT AUTO_INCREMENT PRIMARY KEY,
    host VARCHAR(50), port INT DEFAULT 3306,
    user VARCHAR(50), passwd VARCHAR(50),
    log_file VARCHAR(50), log_pos LONG,
    UNIQUE INDEX (host,port,user)
) ENGINE=InnoDB;

CREATE TABLE current_master (
    idx INT
) ENGINE=InnoDB;
We need two tables: a table my_masters to record information about the available masters and a table current_master that keeps information about the current master. The my_masters table will contain information on how to connect to the masters as well as the last seen position. We assume that the user and password information is stored in the table and won't save away that information when switching master. To store the current master being replicated from, We cannot use a user defined variable—because each invocation of an event spawns a new session—so we store this information in a table.

Switching masters

To be able to execute a CHANGE MASTER statement with the information we need, it would be perfect to use a prepared statement, but unfortunately, the CHANGE MASTER statement is one of those statements that cannot be used inside a prepared statement, so we have to build the statement dynamically. To make it easier, we create a change_master procedure that does the job of building, preparing, executing, and deallocating a prepared statement. We also allow the file name and position passed to be NULL, in which case we start replication without these parameters, essentially starting from the beginning of the masters binary log.
delimiter $$
CREATE PROCEDURE change_master(
    host VARCHAR(50), port INT,
    user VARCHAR(50), passwd VARCHAR(50),
    name VARCHAR(50), pos LONG)
BEGIN
  SET @cmd = CONCAT('CHANGE MASTER TO ',
                    CONCAT_WS(', ',
                    CONCAT('MASTER_HOST = "', host, '"'),
                    CONCAT('MASTER_PORT = ', port),
                    CONCAT('MASTER_USER = "', user, '"'),
                    CONCAT('MASTER_PASSWORD = "', passwd, '"')));

  IF name IS NOT NULL AND pos IS NOT NULL THEN
    SET @cmd = CONCAT(@cmd,
                      CONCAT_WS(', ', '',
                                CONCAT('MASTER_LOG_FILE = "', name, '"'),
                                CONCAT('MASTER_LOG_POS = ', pos)));
  END IF;
  PREPARE change_master FROM @cmd;
  EXECUTE change_master;
  DEALLOCATE PREPARE change_master;
END $$
delimiter ;
The last step is to create the event that switch master for us. As a specific feature, we implement the event handling so that we can add and remove rows from the my_masters table and the event will just pick the next one in order. To solve this, we use queries to pick the next one in order based on the index of the last used master and then an additional query to handle the case of a wrap-around with a missing table at index 1.

To allow the table to be changed while the events are executing, we place all the updates of our tables into a transaction. That way, any updates done to the table while the event is executing will not affect the logic for picking the next table.

There are some extra logic added to handle the case that there are "holes" in the index numbers: it is possible that there is no master with index 1 and it is possible that the next master does not have the next index in sequence. This also allow the server ID of the master to be used, but in the current implementation, we use a simple index instead.

delimiter $$
CREATE EVENT multi_source
    ON SCHEDULE EVERY 10 SECOND DO
BEGIN
   DECLARE l_host VARCHAR(50);
   DECLARE l_port INT UNSIGNED;
   DECLARE l_user TEXT;
   DECLARE l_pass TEXT;
   DECLARE l_file VARCHAR(50);
   DECLARE l_pos BIGINT;
   DECLARE l_idx INT DEFAULT 1;
   SET SQL_LOG_BIN = 0;
Don't write any of this to the binary log. Since this is an event, it will automatically be reset at the end of the execution and not affect anything else.
   STOP SLAVE IO_THREAD;
   SELECT master_log_name, master_log_pos
     INTO l_file, l_pos
     FROM mysql.slave_master_info;
   SELECT MASTER_POS_WAIT(l_file, l_pos);
   STOP SLAVE;
Stop the slave I/O thread and empty the relay log before switching master
   START TRANSACTION;
   UPDATE my_masters AS m,
          mysql.slave_relay_log_info AS rli
      SET m.log_pos = rli.master_log_pos,
          m.log_file = rli.master_log_name
    WHERE idx = (SELECT idx FROM current_master);
Save the position of the current master
   SELECT idx INTO l_next_idx FROM my_masters
    WHERE idx > (SELECT idx FROM current_master)
    ORDER BY idx LIMIT 1;
Find the next master in turn. To handle that masters have been removed, we will pick the next one index-wise. Wrap-around is handled by using the default of 1 above.
    SELECT idx INTO l_next_idx FROM my_masters
     WHERE idx >= l_next_idx
     ORDER BY idx LIMIT 1;
If we did a wrap-around, it might be the case that master with index 1 does not exist (the default for l_next_idx), so then we have to scan and find the first index that exists which is equal to or greater than l_next_idx.
    UPDATE current_master SET idx = l_next_idx;

    SELECT host, port, user, passwd, log_pos, log_file
      INTO l_host, l_port, l_user, l_pass, l_pos, l_file
      FROM my_masters
      WHERE idx = l_next_idx;

    CALL change_master(l_host, l_port, l_user,
                       l_pass, l_file, l_pos);
    COMMIT;
    START SLAVE;
END $$
delimiter ;
Extract the information about the new master from our masters table my_masters and change to use that master.
That's all! No you go off and play with it and send me comments.

You can download the MySQL 5.6 Milestone Development Release MySQL Developer Zone (dev.mysql.com), which contain the new replication tables and you can find information in the previous post on how to set up the server to use the new tables.

Tuesday, April 12, 2011

Crash-safe Replication

A common request is to have replication crash-safe in the sense that the replication progress information always is in sync with what has actually been applied to the database, even in the event of a crash. Although transactions are not lost if the server crashes, it could require some tweaking to bring the slaves up again.

In the latest MySQL 5.6 milestone development release, the replication team has implemented crash-safety for the slave by adding the ability of committing the replication information together with the transaction (see Figure 1). This means that replication information will always be consistent with has been applied to the database, even in the event of a server crash. Also, some fixes were done on the master to ensure that it recovers correctly.

If you're familiar with replication, you know that the replication information is stored in two files: master.info and relay-log.info. The update of these files are arranged so that they are updated after the transaction had been applied. This means that if you have a crash between the transaction commit and the update of the files, the replication progress information would be wrong. In other words, a transaction cannot be lost this way, but there is a risk that a transaction could be applied yet another time. The usual way to avoid this is to have a primary key on all your tables. In that case, a repeated update of the table would cause the slave to stop, and you would have to use SQL_SLAVE_SKIP_COUNTER to skip the transaction and get the slave up and running again. This is better than losing a transaction, but it is nevertheless a nuisance. Removing the primary key to prevent the slave from stopping will only solve the problem partially: it means that the transaction would be applied twice, which would both place a burden on the application to handle dual entries and also require that the tables to be cleaned regularly. Both of these approches require either manual intervention or scripting support to handle. This does not affect reliability, but it is so much easier to handle if the replication information is committed in the same transaction as the data being updated.

Crash-safe masters

Two problems related to crash-safe replication has been fixed in the master, both of which could cause some annoyance when the master recovered.
  • If the master crashed when a binary log was rotated, it was possible that some orphan binlog files ended up in the binary log index file. This was fixed in 5.1 but is also a piece in the pussle of having crash-safe replication.
  • Writing to the binary log is not an atomic operation, and if a crash occured while writing to the binary log, there were a possibility of a partial event at the end of the binary log.

    Now, the master recovers from this by truncating the binary log to the last known good position, removing the partially written transaction and rolling back the outstanding transactions in the storage engines.

Figure 1. Moving position information update into transaction

Crash-safe slaves

Several different solutions for implementing crash-safety—or transactional replication, as it is sometimes known as—have been proposed, with Google's TransactionalReplication patch being the most known. This solution stores the replication positions in the InnoDB transaction log, but the MySQL replication team decided to instead implement crash-safety by moving the replication progress information into system tables. This is a more flexible solution and has several advantages compared to storing the positions in the InnoDB transaction log:
  • If the replication information and data is stored in the same storage engine, it will allow both the data and the replication position to be updated as a single transaction, which means that it is crash-safe.
  • If the replication information and data is stored in different storage engines, but both support XA, they can still be committed as a single transaction.
  • The replication information is flushed to disk together with the transaction data. Hence writing the replication information directly to the InnoDB redo log does not offer a speed advantage, but does not prevent the user from reading the replication progress information easily.
  • The tables can be read from a normal session using SQL commands, which also means that it can be incorporated into such things as stored procedures and stored functions.
Table 1. slave_master_info
FieldLine in fileSlave status column
Master_id
Number_of_lines 1
Master_log_name 2 Master_Log_File
Master_log_pos 3 Read_Master_Log_Pos
Host 3 Master_Host
User_name 4 Master_User
User_password 5
Port 6 Master_Port
Connect_retry 7 Connect_Retry
Enabled_ssl 8 Master_SSL_Allowed
Ssl_ca 9 Master_SSL_CA_File
Ssl_capath 10 Master_SSL_CA_Path
Ssl_cert 11 Master_SSL_Cert
Ssl_cipher 12 Master_SSL_Cipher
Ssl_key 13 Master_SSL_Key
Ssl_verify_servert_cert 14 Master_SSL_Verify_Server_Cert
Heartbeat 15
Bind 16 Master_Bind
Ignored_server_ids 17 Replicate_Ignore_Server_Ids
Uuid 18 Master_UUID
Retry_count 19 Master_Retry_Count
In addition to giving us crash-safe slaves the last of these advantages should not be taken lightly. Being able to handle replication from pure SQL put some of the key features in the hands of application developers.

As previously mentioned, the replication information is stored in two files:

master.info
This file contain information about the connection to the master—such as hostname, user, and password—but also information about how much of the binary log that has been transferred to the slave.
relay-log.info
This file contain information about the current state of replication, that is, how much of the relay log that has been applied.

Options to select replication information repository

In order to make the solution flexible, we introduced a general API for adding replication information repositories. This means that we can support multiple types of repositories for replication information, but currently, only the old system using files master.info and relay-log.info and the system using tables slave_master_info and slave_relay_log_info is supported. In order to select what type of repository to use, two new options were added. These options are also available as server variables.
master_info_repository
The type of repository to use for the master info data seen in Table 1.
relay_log_info_repository
The type of repository to use for the relay log info seen in Table 2.
Both of the variables can be set to either FILE or TABLE. If the variable is set to TABLE the new table-based system will be used and if it is set to FILE, the old file-based system will be used. The default is FILE, so make sure to set the value if you want to use the table-based system.

Table 2. slave_relay_log_info
FieldLine in fileSlave status column
Master_id
Number_of_lines1
Relay_log_name2Relay_Log_File
Relay_log_pos3Relay_Log_Pos
Master_log_name4Relay_Master_Log_File
Master_log_pos5Exec_Master_Log_Pos
Sql_delay6SQL_Delay
If you look in Table 1 and Table 2 you can see the column names used for the tables as well as the line number in the corresponding file and the column name in the output of SHOW SLAVE STATUS. Since we are using tables, the column names are used for storing the data in the table, but when using a file, the column names are only used to identify the correct row to update and the value is inserted at the line number given in the table.

The format of the tables have been extended with an additional field that is not present in the files but which is present in the table: the Master_id field. The reason we added this is to make it possible to extend the server to track multiple masters. Note that we currently have no definite plans to add multi-source support, but as good engineers we do not want these tables to be a hindrance to adding multi-source.

Selecting replication repository engine

In contrast with most of the system tables in the server, the replication repositories can be configured to use any storage engine you prefer. The advantage of this is that you can select the same engine for the replication repositories as the data you're managing. If you do that, both the data and the replication information will be committed as a single transaction.

The new tables are created at installation using the mysql_install_db script, as usual, and the default engine for these tables are are the same as for all system tables: MyISAM. As you know MyISAM is not very transactional, so it is necessary to set this to use InnoDB instead if you really want crash-safety. To change the engine for these tables you can just use a normal ALTER TABLE.

slave> ALTER TABLE mysql.slave_master_info ENGINE = InnoDB;
slave> ALTER TABLE mysql.slave_relay_log_info ENGINE = InnoDB;
Note that this works for these tables because they were designed to allow any storage engine to be used for them, but it does not mean that you can change the storage engine for other system tables and expect it to work.

Event processing

This implementation of crash-safe slaves work naturally with both statement-based and row-based replication and there is nothing special that needs to be done in the normal cases. However, these tables interleave with the normal processing in a little different ways.

To understand how transactions are processed by the SQL thread, let us consider the following example transaction:

START TRANSACTION;
INSERT INTO articles(user, title, body)
      VALUE (4711, 'Taming the Higgs Boson using Clicker Training', '....');
UPDATE users SET articles = articles + 1 WHERE user_id = 4711;
COMMIT;
This transaction will be written to the binary log and then sent over to the slave and written to the relay log in the usual way. Once it is read from the relay log for execution, it will be executed as if an update statement where added to the end of the transaction, before the commit:
START TRANSACTION;
INSERT INTO articles(user, title, body)
      VALUE (4711, 'Taming the Higgs Boson using Clicker Training', '....');
UPDATE users SET articles = articles + 1 WHERE user_id = 4711;
UPDATE mysql.slave_relay_log_info
   SET Master_log_pos = @@Exec_Master_Log_Pos,
       Master_log_name = @@Relay_Master_Log_File,
       Relay_log_name = @@Relay_Log_File,
       Relay_log_pos = @@Relay_Log_Pos
COMMIT;
In this example, there is a number of pseudo-server variables (that is, they don't exist for real) that have the same name as the corresponding field in the result set from SHOW SLAVE STATUS. As you can see, the update of the position information is now inside the transcation and will be committed with the transaction, so if both articles and mysql.slave_relay_log_info are in the same transactional engine, they will be committed as a unit.

This works well for the SQL thread, but what about the I/O thread? There are no transactions executed here at all, so when is the information in this table committed?

Since a commit to the table is expensive—in the same way as syncing a file to disk is expensive when using files as replication information repository—the updates of the slave_master_info table is not updated with each processed event. Depending on the value of sync_master_info there are a few alternatives.

If sync_master_info = 0
In this case, the slave_master_info table is just updated when the slave starts or stops (for any reason, including errors), if the relay log is rotated, or if you execute a CHANGE MASTER command.
If sync_master_info > 0
Then the slave_master_info table will be updated every sync_master_info event.
This means that while the slave is running, you cannot really see how much data has been read to the slave without stopping it. If it is important to see how the slave progress in reading events from the master, then you have to set sync_master_info to some non-zero value, but you should be aware that there is a cost associated with doing this.

This does not usually pose a problem since the times you need to read the master replication information on a running replication is far and few between. It is much more common to read it when the slave has stopped for some reason: to figure out where the error is or to perform a master fail-over.

Closing remarks

We would be very interested in hearing any comments you have on this feature and how it is implemented. If you want to try this out for yourselves then you can download the MySQL 5.6 Milestone Development Release where all this is implemented from the MySQL Developer Zone (dev.mysql.com). If you want to find out more details, the section Slave Status Logs in the MySQL 5.6 reference manual will provide you with all the information. This is one of the features that presented by Lars Thalmann April 11, 2011 (yesterday) at 2:30pm, at the "MySQL Replication" talk at Collaborate 11 and April 12, 2011 (today) 10:50am "MySQL Replication Update" at the O'Reilly MySQL Conference & Expo.

Monday, April 11, 2011

Replication Event Checksum

MySQL replication is fast, easy to use, and reliable, but once it breaks, it can be very hard to figure out what the problem is. One of the concerns often raised is that events are corrupted, either through failing hardware, network failure, or software bugs. Even though it is possible to handle errors during transfer over the network using an SSL connection, errors here is rarely the problem. A more common problem (relatively) is that the events are corrupted either due to a software bug, or hardware error.

To be able to better handle corrupted events, the replication team has added replication event checksums to MySQL 5.6 Milestone Development Release. The replication event checksums are added to each event as it is written to the binary log and are used to check that nothing happened with the event on the way to the slave. Since the checksums are added to all events in the binary log on the master and transfered both over the network and written to the relay log on the slave, it is possible to track events corrupted events both because of hardware problems, network failures, and software bugs.

Figure 1. Master and Slave with Threads
The checksum used is a CRC-32 checksum, more precisely ISO-3309, which is the one supplied with zlib. This is an efficient checksum algorithm, but there is of course a penalty since the checksum needs to be generated. At this time, we don't have any measurements on the performance impact.

If you look at Figure 1 you can see an illustration of how events propagate through the replication system. In the figure, the points where a checksum could be generated or checked are marked with numbers. In the diagram, you can see the threads that handle the processing of events, and an outgoing arrow from a thread can generate a checksum while an arrow going into a thread can validate a checksum. Note, however, that for pragmatic reasons not all validations or generations can be done.

To enable validation or generation three new options were introduced:

binlog_checksum
This option is used to control checksum generation. Currently, it can accept two different values: NONE and CRC32, with NONE being default (for backward compatibility).

Setting binlog_checksum to NONE means that no checksum is generated, while setting it to CRC32 means that an ISO-3309 CRC-32 checksum is added to each binary log event.

This means that a checksum will be generated by the session thread and written to the binary log, that is, at point 1 in Figure 1.

master_verify_checksum
This option can be set to either 0 or 1 (with default being 0) and indicates that the master should verify any events read from the binary log on the master, corresponding to point 2 in Figure 1. In addition to being read from the binary log by the dump thread events are also read when a SHOW BINLOG EVENTS is issued at the master and a check is done at this time as well.

Setting this flag can be useful to verify that the event really written to the binary log is uncorrupted, but it is typically not needed in a replication setting since the slave should verify the event on reception.

slave_sql_verify_checksum
Similar to master_verify_checksum, this option can be set to either 0 or 1 (but defaults to 1) and indicates that the SQL thread should verify the checksum when reading it from the relay log on the slave. Note that this means that the I/O thread writes a checksum to the event written to the relay log, regardless of whether it received an event with a checksum or not.

This means that this option will enable verification at point 5 in Figure 1 and also enable generation of a checksum at point 4 in the figure.

If you payed attention, you probably noticed that there is no checking for point 3 in the figure. This is not necessary since the checksum is verified when the event is written to the relay log at point 4, and the I/O thread just does a straight copy of the event (potentially adding a checksum, as noted above).

So, how does it look when we encounter a checksum error? Let's try it out and see what happens. We start by generating a simple binary log with checksums turned on and see what we get.

master> CREATE TABLE t1 (id INT AUTO_INCREMENT PRIMARY KEY, name CHAR(50));
Query OK, 0 rows affected (0.04 sec)

master> INSERT INTO t1(name) VALUES ('Mats'),('Luis');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

master> SHOW BINLOG EVENTS FROM 261;
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
| Log_name          | Pos | Event_type | Server_id | End_log_pos | Info                                                      |
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
| master-bin.000001 | 261 | Query      |         1 |         333 | BEGIN                                                     |
| master-bin.000001 | 333 | Intvar     |         1 |         365 | INSERT_ID=1                                               |
| master-bin.000001 | 365 | Query      |         1 |         477 | use `test`; INSERT INTO t1(name) VALUES ('Mats'),('Luis') |
| master-bin.000001 | 477 | Query      |         1 |         550 | COMMIT                                                    |
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
4 rows in set (0.00 sec)
Here, everything looks as before, so no sign of a checksum here, but let's edit the binlog file directly and change the 's' in 'Mats' to a 'z' and see what happens. First with MASTER_VERIFY_CHECKSUM set to 0, and then with it set to 1.
master> SHOW BINLOG EVENTS FROM 261;
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
| Log_name          | Pos | Event_type | Server_id | End_log_pos | Info                                                      |
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
| master-bin.000001 | 261 | Query      |         1 |         333 | BEGIN                                                     |
| master-bin.000001 | 333 | Intvar     |         1 |         365 | INSERT_ID=1                                               |
| master-bin.000001 | 365 | Query      |         1 |         477 | use `test`; INSERT INTO t1(name) VALUES ('Matz'),('Luis') |
| master-bin.000001 | 477 | Query      |         1 |         550 | COMMIT                                                    |
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
4 rows in set (0.00 sec)

master> SET GLOBAL MASTER_VERIFY_CHECKSUM=1;
Query OK, 0 rows affected (0.00 sec)

master> SHOW BINLOG EVENTS FROM 261;
ERROR 1220 (HY000): Error when executing command SHOW BINLOG EVENTS: Wrong offset or I/O error
Now, the error message generated is not the crystal clear, but there were an I/O error when reading the binary log: the checksum verification failed. You can see this because I could show the content of the binary log with MASTER_VERIFY_CHECKSUM set to 0, but not when set to 1. Since the checksum is checked when reading events from the binary log, we get a checksum failure when using SHOW BINLOG EVENTS.

So, if we restore the error and verify that it is correct by issuing a SHOW BINLOG EVENTS again, we can try to send it over to the slave and see what happens. The steps to do this (in case you want to try yourself) is:

  1. Start the I/O thread and let it create the relay log using START SLAVE IO_THREAD.
  2. Stop the slave using STOP SLAVE (this is necessary since the slave buffers part of the relay log).
  3. Manually edit the relay log to corrupt one event (I replaced the 's' with a 'z'.
  4. Start the slave using START SLAVE.
The result when doing this is an error, as you can see below. Removing the corruption and starting the slave again will apply the events as expected.
slave> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
                        .
                        .
                        .
              Master_Log_File: master-bin.000001
          Read_Master_Log_Pos: 550
               Relay_Log_File: slave-relay-bin.000002
                Relay_Log_Pos: 419
        Relay_Master_Log_File: master-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
                        .
                        .
                        .
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1594
               Last_SQL_Error: Relay log read failure: Could not parse
                               relay log event entry. The possible
                               reasons are: the master's binary log is
                               corrupted...
                        .
                        .
                        .
     Last_SQL_Error_Timestamp: 110406 09:41:40
1 row in set (0.00 sec)
Now, this is all very nice, but if you have a corruption, you also want to find out where the corruption is—and that preferably without having to start the server. To handle this, the mysqlbinlog program was extended to print the CRC checksum (if there is one) and also to verify it if you give the verify-binlog-checksum option to it.
$ client/mysqlbinlog --verify-binlog-checksum master-bin.000001
        .
        .
        .
# at 261
#110406  8:35:28 server id 1  end_log_pos 333 CRC32 0xed927ef2  Query   thread_id=1...
SET TIMESTAMP=1302071728/*!*/;
BEGIN
/*!*/;
# at 333
#110406  8:35:28 server id 1  end_log_pos 365 CRC32 0x01ed254d  Intvar
SET INSERT_ID=1/*!*/;
ERROR: Error in Log_event::read_log_event(): 'Event crc check failed! Most likely...
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
As you can see, an error is emitted for the offending event, and you can also see the CRC checksum value (which is 32 bits) in the output above, and it corresponds to the position where the slave stopped for my corrupted binary log.

This is just the beginning: there are many things that can be done using checksums, and many new things that are now possible to implement. If you think that this is a useful feature, please let us know, and if you think that it needs to be enhanced, changed, or extended, we would also like to hear from you.

Closing remarks

We would be very interested in hearing any comments you have on this feature and how it is implemented. If you want to try this out for yourselves then you can download the MySQL 5.6 Milestone Development Release where all this is implemented from the MySQL Developer Zone (dev.mysql.com).

If you want to find out the details, the reference documentation for the replication checksum can be found together with the options mentioned above:

This is one of the features that are presented by Lars Thalmann today (April 11, 2011) at 2:30pm, at the "MySQL Replication" talk at Collaborate 11 and tomorrow (April 12, 2011) 10:50am "MySQL Replication Update" at the O'Reilly MySQL Conference & Expo.

Tuesday, February 08, 2011

Slave Type Conversions

[Note: I'm testing to use googlecl to post this article.] Replication is typically used to replicate from a master to one or more slaves using the same definition of tables on the master and slave, but in some cases you want to replicate to tables with a different definition on the slave, for example:
  • Adding a timestamp column on the slave to see when the row was last updated.
  • Eliminating some columns on the slave because you don't need them and they take up space that you can use for better purposes.
  • Temporarily handling an on-line upgrade of a dual-master or circular replication setup.
Of these alternatives, the last one is critical to any deployment that want to stay available. If this case can be handled, most other changes can also be handled, so let's focus on that.

Figure 1. Table with an extra column on slave
MasterSlave
CREATE TABLE employee (
    id SMALLINT AUTO_INCREMENT,
    name VARCHAR(64),
    email VARCHAR(64),

    PRIMARY KEY (id))
CREATE TABLE employee (
    id SMALLINT AUTO_INCREMENT,
    name VARCHAR(64),
    email VARCHAR(64),
    ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    PRIMARY KEY (id))
When using statement-based replication, the plain statements are replicated—this can at times can be an advantage, but not always, as you will soon see. The most obvious case is when you have more or fewer columns on the master than you have on the slave. To illustrate the problem, let us start with the table definitions in Figure 1. Here a timestamp column was added to the slave to see when the row was last changed. When using statement-based replication, we can properly replicate between these tables provided we always give column names to the statement on the master, for example:

master> INSERT INTO employee(name, email) VALUES ('Mats', 'mats@example.com');
master> DELETE FROM employee WHERE email = 'mats@example.com';
master> UPDATE employee SET name = 'Matz' WHERE email = 'mats@example.com';
In all these cases, the statements execute perfectly well with both table definition since the "missing" column has a default value and each statement gives exactly the names of the columns to update. The DELETE and UPDATE statements naturally refer only to the column on the master, but for INSERT it is necessary to add the column names even if the tuple matches the definition on the master since it could be different on the slave.

Having to give the column names all the time is fragile and if the user—or the application—makes a mistake and types the following statement, replication on the slave will stop with an error:

master> INSERT INTO employee VALUES (DEFAULT, 'Mats', 'mats@example.com');
In contrast to statement-based replication, row-based replication will do the right thing and throw away extra columns sent by the master or add default values to extra columns on the slave—if the column has a default value—provided that the columns are added or removed last in the table.

This works fine for the example above since the extra timestamp column is last in the table. The effect is to keep track of when the row was last updated on the slave, which could be used to see if the row is current.

Depending on what you want to accomplish, there could be better techniques for this, described in our book. The problem is that the timestamp might not have enough precision in a high-load situation.
So, row-based replication in MySQL 5.1 contain support for using more or fewer columns on the slave as compared to the master, but there were one case that was not supported: replicating between different column types. This is very important for basic upgrade scenarios where you, for example, change the size of some column during an upgrade.

Figure 2. Different types on master and slave
MasterSlave
CREATE TABLE employee (
    id SMALLINT AUTO_INCREMENT,
    name CHAR(64),
    email CHAR(64),
    PRIMARY KEY (id))
CREATE TABLE employee (
    id SMALLINT AUTO_INCREMENT,
    name VARCHAR(64),
    email VARCHAR(64),
    PRIMARY KEY (id))
For example, consider the table definition in Figure 2. In this case, the intention is to save space on the slave by storing the strings in a VARCHAR field instead of a CHAR field—recall that VARCHAR fields are variable length strings while CHAR fields occupy a fixed space in the row. (We don't care too much about the reasons for using CHAR on the master, we just use this example to illustrate the problem.)

When using statement-based replication, this works well since the actual statement is replicated. However, when using row-based replication we have the additional requirement (in 5.1) that the column types have to have identical base types. Unfortunately, CHAR and VARCHAR does not have the same base type, so replication will stop with an error when you try to execute the INSERT, which is not very helpful.

Fortunately, the replication team have extended row-based replication with a new feature in MySQL 5.5: that of converting between types when replicating from a master and to a slave with a different table definition. With this feature, a stricter type checking is also implemented and better error messages.

The conversion checks the declared types on the master and slave and decides before executing the transaction if the conversion is allowed. This means that it does not investigate the actual values replicated: only the types of the column on the master and the slave. In addition to better performance when not checking each value this check is done so that you can be sure that any value replicated between the tables will work, not just the values that you happened to have in your test suite.

When dealing with conversions, we are only considering conversions within the groups below.

Integer types
TINYINT, SMALLINT, MEDIUMINT, INT, BIGINT
Decimal types
DECIMAL, FLOAT, DOUBLE, NUMERIC
String types
CHAR(N), VARCHAR(N), TEXT even for different values of N on master and slave.
Binary types
BINARY(N), VARBINARY(N), BLOB even for different values for N on master and slave.
Bit types
Conversion between BIT(N) for different values of N on master and slave.
Since the string and binary types only differ in the character set they use—and replication is not aware of character sets yet—replication between string and binary types will be possible simply because the character set is not known. Don't rely on this though; as soon as Bug#47673 is fixed, string and binary types will be separated into distinct groups and replication will stop if the character sets don't allow conversion.

Within each group, we also have two types of conversions: non-lossy conversions and lossy conversions. With a non-lossy conversion you are guaranteed that no information is lost, but with lossy conversions it is possible that you lose some information. A typical example of a non-lossy conversion is converting from a CHAR(32) field to a CHAR(64) field—since the target field is wider than the source field, there is no risk that any part of the string is lost. Converting in the other direction, however, is a lossy conversion since a string with more than 32 characters cannot fit into a CHAR(32) field. A more odd example is conversion between FLOAT and DECIMAL(N,M), which are always considered lossy, regardless of the direction the conversion is done. Since it cannot be guaranteed that all floating-point numbers can be converted to decimal numbers without losing precision, and vice versa.

Controlling what conversions are allowed is controlled with a new server variable SLAVE_TYPE_CONVERSIONS, which is of the type SET('ALL_LOSSY','ALL_NON_LOSSY'), that is, it is a set of allowed conversions. The default for this variable is the empty set, meaning that no conversions are allowed at all.

If the ALL_NON_LOSSY constant is in the set, all conversions (within each group) that do not lose any information are allowed. For example, replicating from CHAR(32) to TINYTEXT is allowed since the conversion goes to a wider field (even if it is a different type).

If the ALL_LOSSY constant is in the set, all conversions (again, within the same group) that could potentially lose information is allowed. For example, conversion to a narrower field on the slave, such as CHAR(32) to CHAR(16) is allowed. Note that non-lossy conversions are not automatically allowed when ALL_LOSSY is set.

The prefix ALL is used since we were considering the possibility of allowing conversions within certain groups only, for example, to add the feature of only allowing lossy conversions for strings and non-lossy conversions for integers, we could set SLAVE_TYPE_CONVERSIONS to 'STRING_LOSSY,INTEGER_NON_LOSSY'. This is, however, pure speculations at this time.
If you are interested about the details of how slave type conversions work, you can find more information in the MySQL Reference Manual in Replication with Differing Tables on Master and Slave.