MySQL Musings

Friday, April 30, 2010

Binary Log Group Commit - An Implementation Proposal

It is with interest that I read Kristian's three blogs on the binary log group commit. In the article, he mentions InnoDB's prepare_commit_mutex as the main hindrance to accomplish group commits—which it indeed is—and proposes to remove it with the motivation that FLUSH TABLES WITH READ LOCK can be used to get a good binlog position instead. That is a solution—but not really a good solution—as Kristian points out in the last post.

The prepare_commit_mutex is used to ensure that the order of transactions in the binary log is the same as the order of transactions in the InnoDB log—and keeping the same order in the logs is critical for getting a true on-line backup to work, so removing it is not really an option, which Kristian points out in his third article. In other words, it is necessary to ensure that the InnoDB transaction log and the binary log have the same order of transactions.

To understand how to solve the problem, it is necessary to take a closer look at the XA commit procedure and see how we can change it to implement a group commit of the binary log.

The transaction data is stored in a per-thread transaction cache and the transaction size is the size of the data in the transaction cache. In addition, each transaction will have a transaction binlog position (or just transaction position) where the transaction data is written in the binary log.

The procedure can be outlined in the following steps:

Prepare InnoDB [ha_prepare]:

Write prepare record to log buffer
fsync() log file to disk (this can currently do group commit)
Take prepare_commit_mutex

Log transaction to binary log [TC_LOG_BINLOG::log_xid]:

Lock binary log
Write transaction data to binary log
Sync binary log based on sync_binlog. This forces the binlog to always fsync() (no group commit) due to prepare_commit_mutex
Unlock binary log

Commit InnoDB:

Release prepare_commit_mutex
Write commit record to log buffer
Sync log buffer to disk (this can currently do group commit)
InnoDB locks are released

There are mainly two problems with this approach:

The InnoDB row level and table level locks are released very late in the sequence, which affects concurrency. Ideally, we need to release the locks very early, preferably as soon as we have prepared InnoDB.
It is not possible to perform a group commit in step 2

As you can see here, the prepare of the storage engines (in this case just InnoDB) is done before the binary log mutex is taken, and that means that if the prepare_commit_mutex is removed it is possible for another thread to overtake a transaction so that the prepare and writing to the binary log is done in different order.

To solve this, Mark suggests using a queue or a ticket system to ensure that transactions are committed in the same order, but we actually already have such a system that we can use to assign tickets: namely the binary log.

The idea is to allocate space in the binary log for the transaction to be written. This gives us a sequence number that we can use to order the transactions.

In the worklog on binary log group commits you will find the complete description as well as the status of the evolving work.

In this post, I will outline an approach that Harrison and I have discussed, which we think will solve the problems mentioned above. In this post, I will outline the procedure during normal operations, in the next post I will discuss recovery, and in the third post (but likely not the last on the subject), I will discuss some optimizations that can be done.

I want to emphasize that the fact that we have a worklog does not involve any guarantees or promises of what, when, or even if any patches will be pushed to any release of MySQL.

In Worklog #4007 an approach for writing the binary log is suggested where space is allocated for the transaction in the binary log before actually starting to write it. In addition to avoiding unnecessary locking of the binary log, it also allow us to use the binary log to order the transactions in-place. We will use this idea of reserving space in the binary log to implement the binary log group commit.

By re-structuring the procedure above slightly, we can ensure that the transactions are written in the same order in both the InnoDB transaction log and the binary log.

There are two ways to re-structure the code: one simple and one more complicated that potentially can render better performance. To simplify the presentation, it is assumed that pre-allocation is handled elsewhere, for example using Worklog #4925. In a real implementation, pre-allocation can either be handled when a new binlog file is created, or when transaction data is being written to the binary log.

The sequential write approach

Figure 1. Sequential binary log group commit

In the sequential write approach, the transactions are still written to the binary log in order and the code is just re-ordered to avoid keeping mutexes when calling fsync(). To describe the algorithm, three shared variables are introduced to keep track of the status of replication:

Next_Available: This variable keeps track of where a new transaction can be written
Last_Committed: This variable keeps track of the last committed transaction, meaning that all transactions preceding this position is actually on disc. This variable is not necessary in the real implementation, but it is kept here to simplify the presentation of the algorithm.
Last_Complete: This variable keeps track of the last complete transaction. All transactions preceding this point is actually written to the binary log, but are not necessarily flushed to disc yet.

You can see an illustration of how the variables are used with the binary log in Figure 1 where you can also see three threads each waiting to write a transaction. Both variables are initially is set to the beginning of the binary log and it is always true that Last_Committed ≤ Last_Complete ≤ Next_Available . The procedure can be described in the following steps:

Lock the binary log
Save value of Next_Available in a variable Trans_Pos and increase Next_Available with the size of the transaction.
Prepare InnoDB:

Write prepare record to log buffer (but do not fsync() buffer here)
Release row locks

Unlock binary log
Post prepare InnoDB:

fsync() log file to disk, which can now be done using group commit since no mutex is held.

Log transaction to binary log:

Wait until Last_Complete = Trans_Pos. (This can be implemented using a condition variable and a mutex.)
Write transaction data to binary log using pwrite. At this point, it is not really necessary to use pwrite since the transaction data is simply appended, but it will be used in the second algorithm, so we introduce it here.
Update Last_Complete to Trans_Pos + transaction size.
Broadcast the the new position to all waiting threads to wake them up.
Call fsync() to persist binary log on disk. This can now be group committed.

Commit InnoDB:

Write commit record to log buffer
Sync log buffer to disk, which currently can be group committed.

To implement group commit, it is sufficient to have a condition variable and wait for that for a specified interval. Once the interval has passed, the transaction data can call fsync(), after which it broadcasts the fact that data has been flushed to disc to other waiting threads so that they can skip this. Typically, the code looks something along these lines (we ignore checking error codes here to simplify the description):

pthread_mutex_lock(&binlog_lock);
while (Last_Complete ≥ Last_Committed) {
  struct timespec timeout;
  gettimeofday(&timeout, NULL);
  timeout.tv_usec += 1000;    /* 1 msec */
  int error= pthread_cond_timedwait(&binlog_flush, &binlog_lock, &timeout);
  if (error == ETIMEDOUT) {
    fsync(&binlog_file);
    Last_Committed = Last_Complete;
    pthread_cond_broadcast(&binlog_flush);
  }
}
pthread_mutex_unlock(&binlog_lock);

There are a few observations regarding this approach:

Step 6a requires a condition variable and a mutex when waiting for Last_Complete to reach Trans_Pos. Since there is just a single condition variable, it is necessary to broadcast a wakeup to all waiting threads, which each will evaluate the condition just to find a single thread that should continue, while the other threads go to sleep again.
This means that the condition will be checked O(N²) times to commit N transactions. This is a waste of resources, especially if there is a lot of threads waiting, and if we can avoid this, we can gain performance.
Since the thread has a good position in the binary log where it could write, it could just as well start writing instead of waiting. It will not interfere with any other threads, regardless if locks are kept or not.

These observations lead us to the second approach, that of writing transaction data to the binary log in parallel.

A parallel write approach

Figure 2. Parallel binary log group commit

In this approach, each session is allowed to write to the binary log at the same time using pwrite since the space for the transaction data has already been allocated when preparing the engines. Figure 2 illustrates how the binary log is filled in (grey areas) by multiple threads at the same time. Similar to the sequential write approach, we still have the Last_Complete, Last_Committed, and Next_Available variables.

Each thread does not have to wait for other threads before writing, but it does have to wait for the other threads to commit. This is necessary since we required the order of commits in the InnoDB log and the binary log to be the same. In reality, this does not pose a problem since the I/O is buffered, hence the writes are done to in-memory file buffers.

The algorithms look quite similar to the sequential write approach, but notice that in step 6, the transaction data is simply written to the binary log using pwrite.

Lock the binary log
Save value of Next_Available in a local variable Trans_Pos and increase Next_Available with the size of the transaction.
Prepare InnoDB:

Write prepare record to log buffer (but do not fsync() buffer here)
Release row locks

Unlock binary log
Post prepare InnoDB:

fsync() log file to disk, which can now be done using group commit since no mutex is held.

Log transaction to binary log:

Write transaction data to binary log using pwrite. There is no need to keep a lock to protect the binary log here since all threads will write to different positions.
Wait until Last_Complete = Trans_Pos.
Update Last_Complete to Trans_Pos + transaction size.
Broadcast the the new position to all waiting threads to wake them up.
Call fsync() to persist binary log on disk. This can now be group committed.

Commit InnoDB:

Write commit record to log
Sync log file to disk

This new algorithm has some advantages, but there are a few things to note:

When a transaction is committed, it is guaranteed that Trans_Pos ≥ Last_Committed for all threads (recall that Trans_Pos is a thread-local variable).
Writes are done in parallel, but when waiting for the condition in step 6b still requires a broadcast to wake up all waiting threads, while only one will be allowed to proceed. This means that we still have the O(N²) complexity of the sequential algorithm. However, for the parallel algorithm it is possible to improve the performance significantly, which we will demonstrate in the third part where we will discuss optimizations to the algorithms.
Recovery in the sequential algorithm is comparably simple since there are no partially written transactions. If you consider that a crash can occur in the situation described in Figure 2, it is necessary to device a method for correctly recovering. This we will discuss in the second part of these posts.

Tuesday, April 13, 2010

MySQL Conference Replication tutorial: Article and Demo Software

The MySQL Conference and Expo started with me and Lars Thalmann doing the replication tutorial. Unfortunately, we cannot at this time distribute the slides (please watch the replication tutorial page at the conference site), but there is a replication tutorial package for easy setup of server to play around with—including some sample scripts—and a paper that both explains how the package can be used as well as giving some example setups.

The software package can be downloaded from the forge and requires Perl at least version 5.6.0 to execute.
The article can can also be downloaded from my site (PDF).

Friday, March 05, 2010

Going to the O'Reilly MySQL Conference & Expo

As I've been doing the last couple of years, I will be going to the O'Reilly MySQL Conference & Expo. In addition to the tutorial and the replication sessions that I will be holding together with Lars, I will be holding a session about the binary log together with Chuck from the Backup team which the Replication team normally works very close with.

This year, O'Reilly also have a Friend of the Speaker discount of 25% that you can use when you register using the code mys10fsp.

The sessions that we are going to hold are listed below. Note that I am using Microformats, which will allow you to easily extract and add the events to your calendar using, for example, the Operator plugin for Firefox.

See you there!

Mysteries of the Binary Log: April 14th, 2010 10:50am - 11:50am Room: Ballroom F
New Replication Features: April 13th, 2010 2:00pm - 3:00pm Room: Ballroom A
Replication Tricks & Tips: April 14th, 2010 2:00pm - 3:00pm Room: Ballroom B
The Replication Tutorial: April 12th, 2010 8:30am - 12:00pm Room: Ballroom E

Wednesday, February 03, 2010

MySQL Replicant: Architecture

MySQL Replicant Library
Class Design

In the previous post I described the first steps of a Python library for controlling the replication of large installations. The intention of the library is to provide a uniform interface to such installations and that will allow procedures for handling various situations to be written in a uniform language.

For the library to be useful, it is necessary to support installations that use different operating systems for the machines, as well as different versions of the servers. Specifically, it is necessary to allow some aspects of the system to vary.

Depending on the operating system, or even just how the server is installed on the machine, the procedures for bringing the server down and up will differ.
Configurations are managed different ways depending on the deployment and there are various other tools to manage configurations of large systems.

As part of the management of the topology, it is necessary to change the configuration files, but this should play well with other tools.

In either case, any specific method for configuration handling should neither be required nor enforced.
In the example in the previous article, the technique for cloning a server was demonstrated. In this case the naive method of copying the database files was used. For the general case, however, some backup method will be used, but it depends on the requirements of the deployment. In other words, it is necessary to parameterize the backup method as well.
Each server in the system has a specific role to fulfill. Some server are final slaves whose only purpose is to answer queries, at least one server is a master, and some servers are relay servers.

To allow the system to be parameterized on these aspects, a set of abstract classes is introduced. In the figure you can see a UML diagram describing the high-level architecture of the Replicant library.

In the figure, there are four abstract classes:

Machine: The responsibility of this class it to handle all issues that are specific to the remote operating system, for example, to fetch files or issue commands to start and stop the server.
Config: The responsibility of this class is to maintain the configuration of a server. To do this, it may need to parse configuration files to be able to extract the specific section containing the definition.
BackupMethod: The responsibility of this class is to provide the primitives to create a backup and restore a backup. In both cases, the class supports taking a backup and potentially placing the backup image at a different machine, and restoring it.
Role: The responsibility of this class is to provide all the information necessary to configure a server in a role. Since the role does not only entails pure configuration information, but can also involve keeping certain tables and other database objects available, this is modeled as a separate class.

The central Server class relies on a Machine instance and a Config instance to implement the interface to the machine and to the configuration, respectively.

Configuration Management

The configuration of the server is made part of the Replicant library since manipulating the server configuration is usually necessary when changing roles of servers.

Depending on the deployment, other configuration managers such as cfengine or puppet are used to administer the configuration of all servers, while others hand-edit the configuration files (which has to be for small configurations, since it would be a pain to administer larger deployments in this way).

Long-term, there should be support for some safety measures when working with server configurations, so implementing an interface for handling server configurations in a safe transaction-like manner—or maybe this should be called a RCU-style manner—seems like a good idea. To support that, the following methods to fetch and replace configurations are introduced.

Server.fetch_config(): Returns a Config instance of the configuration for the server.
Server.replace_config(config): Replace the configuration of the server with the modified configuration instance config.

This will allow an implementation to keep version numbers around to avoid conflicts, but is not required by the interface.

Each Config instance can then be manipulated by using the following methods:

Config.get(option): Get the value of option as a string.
Config.set(option[, value]): Set the value of option to value. If no value is supplied, None is used, which denotes that the option is set but not given a specific string value.
Config.remove(option): Remove the option from the configuration instance entirely.

So, for example, the log-bin option can be set in the following manner:

config = server.fetch_config()
config.set('log-bin', 'master-bin')
server.replace_config(config)

Machines

A MySQL server can run on many different machines and in many setups. A server can run on Linux, Solaris, or Windows, and even in those cases, there can be multiple servers on a single machine.

For a Linux machine with a single server, one usually uses the script /etc/init.d/mysql to start and stop the server—at least on my Ubuntu—but if multiple servers are used on a single machine, then mysqld_multi should be used instead.

For Windows and Solaris, the procedure for starting and stopping servers are entirely different. Windows starts and stops the servers using net start MySQL and net stop MySQL, while Solaris uses the svcadm(1M)

To parameterize the system over the various ways it can be installed, the concept of a Machine is introduced (I actually had problems figuring out a name for this, but this was suggested to me and seems to be good enough).

The responsibility of the Machine class is to provide an interface to access the installed server together with installation information such as the location of configuration files.

BackupMethod

One of the more important techniques when managing a set of server is the ability to clone a slave or a master to create new slaves. Cloning involves taking a backup of a server and then restoring the backup image on a the new slave. Since the techniques for taking backups vary a lot and different techniques will be used in different situations, parameterizing over the various backup methods is sensible.

BackupMethod.backup_to(server, url): This method will take a backup of server and store it at the location indicated by url.
BackupMethod.restore_from(server, url): This method will restore the backup image indicated by url into server.

Role

In a deployment, each server is configured to play a specific role. It can either be acting as a master, a slave, or even a relay. To represent a role, a separate Role class is introduced. Once a role is created, a server can be imbued with it.

Not every server have an assigned role.
Each server can just have a single role.
Each roles can be assigned to multiple servers.

Since a role may encompass much more than just setting some configuration parameters, this more flexible approach was chosen. When imbuing a server with a role, a piece of Python code is executed to configure the server correctly.

The use of roles in this case is actually just one of many choices, and when using this approach, there is actually two different ways that roles can be used. I am slightly undecided on the two and would like to hear comments on which one to use.

Roles are just applied to the initial deployment and does not play any role after the system have been deployed. Roles are imbued into a server initially, and then the configuration of the server can be changed by procedures to manipulate the deployment.
Roles exists in the entire deployment and when a server changes roles in the deployment, the Role instance will also change. Every server is assigned a role in the system, which is represented using a subclass of the Role class.

The first is by far the easiest to implement, which is why I chose this at this time. Since the roles are just containers for configuration options and other items that needs to be added, they are easy to write. Since this is what is used in the library currently, it is also what you see in the class design above.

The second approach seems better, but it has a number of consequences:

Every server has to have a role class associated with it, even the "initial" role is required.
If the role changes, another role class will be associated with it. This forces the role class to not only be able to imbue a server in a role, but to also unimbue the server from that role.
It cannot be possible to change the configuration of a server directly, it has to be in the form of defining a role and then changing the server to that role. Unimbuing the server from a role becomes very hard if the configuration of the server is changed outside the control of the role.

Friday, December 18, 2009

MySQL Replicant: a library for controlling replication deployments

Keeping a MySQL installation up and running can be quite tricky at times, especially when having many servers to manage and monitor. In the replication tutorials at the annual MySQL Users' Conference, we demonstrate how to set up replication appropriately and also how to handle various issues that can arise. Many of these procedures are routine: bring down the server, edit the configuration file, bring the server up again, start a mysql client and add a user, etc.

It has always annoyed me that these procedures are perfect candidates for automation, but that we do not have the necessary interfaces to manipulate an entire installation of MySQL servers.

If there were an interface with a relatively small set of primitives—re-directing servers, bringing servers down, add a line to the configuration file, etc.—it would be possible to create pre-canned procedures that can just be executed.

To that end, I started writing on a library that would provide an interface like this. Although more familiar with Perl, Python was picked for this project, since it seems to be widely used by many database administrators (it's just a feeling I have, I have no figures to support it) and just to have a cool name on the library, we call it MySQL Replicant and it is (of course) available at Launchpad.

So what do we want to achieve with having a library like this? Well... the goal is to to provide an generic interface to complete installations and thereby make administration of large installations easy.

By providing such an interface, it will allow description of procedures in an executable format, namely as Python scripts.

In addition to making it easy to implement common tasks for experienced database administrators, it also promotes sharing by providing a way to write complete scripts for solving common problems. Having a pool of such scripts makes it easier for newcomers to get up and running.

The basic idea is that you create a model of the installation on a computer and then manipulate the model. When doing these manipulations, the appropriate commands—either as SQL commands to a running server or shell commands to the host where the server is running—will then be sent to the servers in the installation to configure them correctly.

So, to take small example, how does the code for re-directing a bunch of servers to a master look?

import mysqlrep, my_servers
for slave in my_server.slaves:
   mysqlrep.change_master(slave, my_servers.master)

In this case, the installation is defined in a separate file and is imported as a Python module. Right now, the interface for specifying a topology is quite rough, but this is going to change.

from mysqlrep import Server, User, Linux

servers = [Server(server_id=1, host="server1.example.com",
                  sql_user=User("mysql_replicant", "xyzzy"),
                  ssh_user=User("mysql_replicant"),
                  machine=Linux()),
           Server(server_id=2, host="server2.example.com",
                  sql_user=User("mysql_replicant", "xyzzy"),
                  ssh_user=User("mysql_replicant"),
                  machine=Linux()),
           Server(server_id=3, host="server3.example.com",
                  sql_user=User("mysql_replicant", "xyzzy"),
                  ssh_user=User("mysql_replicant"),
                  machine=Linux()),
           Server(server_id=4, host="server4.example.com",
                  sql_user=User("mysql_replicant", "xyzzy"),
                  ssh_user=User("mysql_replicant"),
                  machine=Linux())]
master = servers[0]
slaves = servers[1:]

Here, the Server class represents a server and to be able to do it's job, it is necessary to have one MySQL account on the server and one shell account on the host machine. Right now, it is also necessary to specify the server ID, but the plan is to just require the host, port, socket, SQL account name, and SSH account information. The remaining information can then be fetched from the configuration file of the server. Each server have a small set of primitives on top of which everything else is built:

Server.sql(SQL command): Execute the SQL command and return a result set.
Server.ssh(command list): Execute the command given by the command list return an iterator to the result output.
Server.start(): Start the server
Server.stop(): Stop the server.

There is a small set of commands defined on top of these primitives that can be used. Here is a list of just a few of them, but there are some more in the library at Launchpad.

change_master(slave, master, position=None): Change the master of slave to be master and start replicating from position.
fetch_master_pos(server): Fetch the master position of server, which is the position where the last executed statement ends in the binary log.
fetch_slave_pos(server): Fetch the slave position of server, which is the position where the last executed event ends.
flush_and_lock_database(server): Flush all tables on server and lock the database for read.
unlock_database(server): Unlock a previously locked database.

Using these primitives, it is easy to clone a master by executing the code below. For this example, I use the quite naive method of backing up a database by creating an archive of the database files and copying them to the new slave.

from mysqlrep import flush_and_lock_database, fetch_master_position
from subprocess import call

flush_and_lock_database(master)
position = fetch_master_position(master)
master.ssh("tar Pzcf " + backup_name + " /usr/var/mysql")
unlock_database(master)
call(["scp", source.host + ":" + backup_name, slave.host + ":."])
slave.stop()
slave.ssh("tar Pzxf " + backup_name + " /usr/var/mysql")
slave.start()
start_replication(slave)

What do you think? Would this be a valuable project to pursue? Here are some links related to this post:

Thursday, December 17, 2009

Using mysqld_multi on Karmic

I wanted to set up several servers on my machine using the Ubuntu distribution and control them using mysqld_multi: the typical way to manage several servers on your machine. However, I also wanted to use MySQL 5.1 and not 5.0, which is the default on Jaunty (Ubuntu 9.04). About a month ago, I upgraded to Karmic Koala and one of the reasons were that MySQL 5.1 is used by default. Even though I could install the latest revision all the time, I usually want to use the real distributions for my private projects for a number of reasons.

I actually tried to upgrade to MySQL 5.1 on Ubuntu 9.04, but I discovered that all kinds of applications had dependencies on MySQL 5.0, so I avoided to upgrade at that time.

Anyway, the procedure for installing multiple servers on the same machine is this:

Shut down the running server.
This is, strictly speaking, not necessary unless you are going to edit the options for the running server, but I do this as a precaution.
Edit your my.cnf configuration file and add sections for mysqld_multi and the new servers.
I wanted to add four servers to play with, not counting the one that is already installed and running, so I added sections mysqld1 to mysqld4. Also add a section for mysqld_multi
Create server directories and database files using mysql_install_db
The new servers need to be bootstrapped so that they have all the necessary databases and tables set up.
Optionally: install an init.d script that uses mysqld_multi.
This is currently not very well-supported in Debian (there is actually a comment saying that it is not supported), so I skipped this step. If you feel adventerous, you can always copy the /usr/share/mysql/mysqld_multi.server as /etc/init.d/mysql.server as they suggest in the file, but I will not do it, nor recommend it (because I haven't tried it).
Start the installed server(s).
Well, not much to say here.

So, on my way, I edited the /etc/mysql/my.cnf and added the sections necessary. (You can see a diff of that below.)

The important options to add are server-id so that each server gets a unique server id (I'm going to replicate between them), port and socket so that you can connect to each of them both when you're on the local machine and from another machine, and pid-file to give each server a unique pid file name (this is important, since the default will not work at all).

Next step is to install the data directories for the servers, which should be trivial:

$ sudo mysql_install_db --user=mysql --datadir=/var/lib/mysqlfoo --basedir=/usr
Installing MySQL system tables...
091120  9:40:23 [Warning] Can't create test file /var/lib/mysqlfoo/romeo.lower-test
091120  9:40:23 [Warning] Can't create test file /var/lib/mysqlfoo/romeo.lower-test
ERROR: 1005  Can't create table 'db' (errno: 13)
091120  9:40:23 [ERROR] Aborting

091120  9:40:23 [Warning] Forcing shutdown of 2 plugins
091120  9:40:23 [Note] /usr/sbin/mysqld: Shutdown complete


Installation of system tables failed!  Examine the logs in
/var/lib/mysqlfoo for more information.
    .
    .
    .

OK, the warning is a warning, but it seems I forgot the permissions on the directory. Checking the write permissions, no problems. Hmmm... checking that I can create the directories and files manually as the mysql user, no problems(!)

What on earth is going on?

After some digging around, I found bug #201799 which quite clearly explains that what I thought was a permission problem is actually AppArmor doing its job.

So updating the AppArmor configuration file /etc/apparmor.d/usr.sbin.mysqld with this solved the problem and I could get on with installing the servers.

diff --git a/apparmor.d/usr.sbin.mysqld b/apparmor.d/usr.sbin.mysqld
index f9f1a37..7a94861 100644
--- a/apparmor.d/usr.sbin.mysqld
+++ b/apparmor.d/usr.sbin.mysqld
@@ -21,10 +25,20 @@
   /etc/mysql/my.cnf r,
   /usr/sbin/mysqld mr,
   /usr/share/mysql/** r,
   /var/log/mysql.log rw,
   /var/log/mysql.err rw,
+  /var/log/mysql[1-9].log rw,
+  /var/log/mysql[1-9].err rw,
   /var/lib/mysql/ r,
   /var/lib/mysql/** rwk,
+  /var/lib/mysql[1-9]/ r,
+  /var/lib/mysql[1-9]/** rwk,
   /var/log/mysql/ r,
   /var/log/mysql/* rw,
+  /var/log/mysql[1-9]/ r,
+  /var/log/mysql[1-9]/* rw,
   /var/run/mysqld/mysqld.pid w,
   /var/run/mysqld/mysqld.sock w,
+  /var/run/mysqld/mysqld[1-9].pid w,
+  /var/run/mysqld/mysqld[1-9].sock w,
 }

Changes to /etc/mysql/my.cnf

Here is a unified diff of the changes I made to /etc/mysql/my.cnf to add some more servers.

$ git diff mysql/my.cnf
--- a/mysql/my.cnf
+++ b/mysql/my.cnf
@@ -111,7 +111,46 @@ max_binlog_size         = 100M
 # ssl-cert=/etc/mysql/server-cert.pem
 # ssl-key=/etc/mysql/server-key.pem
 
+[mysqld_multi]
+mysqld         = /usr/bin/mysqld_safe
+mysqladmin     = /usr/bin/mysqladmin
+user           = root
 
+[mysqld1]
+server-id      = 1
+pid-file = /var/run/mysqld/mysqld1.pid
+socket  = /var/run/mysqld/mysqld1.sock
+port  = 3307
+datadir = /var/lib/mysql1
+log-bin        = /var/lib/mysql1/mysqld1-bin.log
+log-bin-index  = /var/lib/mysql1/mysqld1-bin.index
+
+[mysqld2]
+server-id      = 2
+pid-file = /var/run/mysqld/mysqld2.pid
+socket  = /var/run/mysqld/mysqld2.sock
+port  = 3308
+datadir = /var/lib/mysql2
+log-bin        = /var/lib/mysql2/mysqld2-bin.log
+log-bin-index  = /var/lib/mysql2/mysqld2-bin.index
+
+[mysqld3]
+server-id      = 3
+pid-file = /var/run/mysqld/mysqld3.pid
+socket  = /var/run/mysqld/mysqld3.sock
+port  = 3309
+datadir = /var/lib/mysql3
+log-bin        = /var/lib/mysql3/mysqld3-bin.log
+log-bin-index  = /var/lib/mysql3/mysqld3-bin.log
+
+[mysqld4]
+server-id      = 4
+pid-file = /var/run/mysqld/mysqld4.pid
+socket  = /var/run/mysqld/mysqld4.sock
+port  = 3310
+datadir = /var/lib/mysql4
+log-bin        = /var/lib/mysql4/mysqld3-bin.log
+log-bin-index  = /var/lib/mysql4/mysqld3-bin.log
 
 [mysqldump]
 quick

Tuesday, November 03, 2009

Bisection testing using Quilt

Having produced a nice little series of 124 patches (yes, really), I recently had to find out what patch introduced a problem for distcheck to pass. Since distcheck takes quite some time to execute, I want to make as few runs as possible.

In Git, there is the bisect command that can be used to perform bisection testing of a series of patches, but quilt does not have anything like that, so to simplify my job, I needed to implement that for quilt.

I started by defining a shell function that did the actual test, and returned the result.

do_test () {
    echo -n "running distcheck..."
    make -j6 distcheck >/dev/null 2>&1
}

After that, I added code to add values for some variables used and to process options to the script. The script supports two options: --lower and --upper. Both accept a number of a patch: the lowest patch that was good and the number of the last known patch to fail the test. I could have supplied the patch names here, but this was good enough for my purposes.

Note that I am using Bash since it has support.

series=(`quilt series`)                  # Array of the patch names
lower=0                                  # Lowest item in tested range
upper=$((${#series[@]} - 1))             # Upper limit of range

while true; do
    case "$1" in
        -l|--lower)
            lower="$1"
            shift
            ;;
        -u|--upper)
            upper="$1"
            shift
            ;;
        *)
            shift
            break
            ;;
    esac
done

middle=$(($lower + ($upper - $lower) / 2))

Then we start by preparing the looping by moving to the middle of the patches in the range.

quilt pop -a >/dev/null
quilt push $middle >/dev/null

The main loop will keep pushing or popping depending on whether the current patch fails the test or succeeds. The invariant for the loop is that that $middle holds the number of the current patch to be tested (the patch that quilt top would report) and we keep looping until $lower == $upper. Just to ensure that the right patch is tested, we test the invariant in the loop.

If the test succeeds, we know that the first failing test is somewhere between this patch and the last known failing test. So, we compute the next midpoint to be between this patch and the last known unsuccessful patch and store it in middle. We then push patches to reach this patch.
If the test fails, we know that the first failing test is somewhere between the current patch and the last known successful patch. So, we compute the next midpoint to be between this patch and the last successful patch and store it in middle. We then pop patches to reach this patch.

while test $lower -lt $upper
do
    top=`quilt top`
    echo -n "$top..."

    if test "$top" != "${series[$(($middle-1))]}"; then
        echo "invariant failed ($top != ${series[$(($middle-1))]})!" 1>&2
        exit 2
    fi

    if do_test $lower $upper; then
        lower=$(($middle + 1))
        middle=$(($lower + ($upper - $lower) / 2))
        cnt=$(($middle - $lower + 1))
        echo -n "succeeded..."
        if test $cnt -gt 0; then
            echo -n "pushing $cnt patches..."
            quilt push $cnt >/dev/null
            echo "done"
        fi
    else
        upper=$middle
        middle=$(($lower + ($upper - $lower) / 2))
        cnt=$(($upper - $middle))
        echo -n "failed..."
        if test $cnt -gt 0; then
            echo -n "popping $cnt patches..."
            quilt pop $cnt >/dev/null
            echo "done"
        fi
    fi
done

Next task: extend quilt to support the bisect command.