MySQL Musings

MySQL Central: It's that time of the year

2014-09-16T20:11:00.000+02:00

It's that time of the year again: yes, Oracle Open World is coming up and with that I'll be travelling to San Francisco. New for this year is that we are part of the main Open World event and therefore have our own MySQL Central. Here you will have the opportunity of meeting many of the engineers behind MySQL, discuss technical problems you have, and also learn some about how we look at the future of the MySQL ecosystem.

This year, me and Narayanan Venkateswaran will be presenting two sessions:

Elastic Scalability in MySQL Fabric with OpenStack (Thursday, Oct 2, 1:15 PM-2:00 PM in Moscone South, 252): In this session you will see how Fabric can use the new provisioning support to fetch servers from an OpenStack instance. The presentation will cover how to use the provisioning support to fetch servers from OpenStack Nova, OpenStack Trove, and also from Amazon AWS. You will also learn about the provisioning interface and how you can use it to create your own hardware registry support.
MySQL Fabric: High Availability at Different Levels (Wednesday, Oct 1, 2:00 PM-2:45 PM in Moscone South - 250): MySQL Fabric is a distributed system that requires coordination among different components to provide high availability: connectors, servers, and MySQL Fabric nodes must be orchestrated to create a solution resilient in the face of failures. Ensuring that each component alone is fault-tolerant does not guarantee that applications will continue working in the event of a failure. In this session, you will learn how all components in MySQL Fabric were designed to provide a high-availability solution and how they cooperate to achieve this goal. The presentation shows you how to create your own infrastructure to monitor MySQL servers and manage manual switchover or automatic failover operations together with MySQL Fabric.

As every year, it's going to be fun to meet all the people in the MySQL community, both new and old, so I'm looking forward to meeting you all there.

MySQL Fabric: Musings on Release 1.4.3

2014-05-27T14:47:00.000+02:00

As you might have noticed in the press release, we just released MySQL Utilities 1.4.3, containing MySQL Fabric, as a General Availability (GA) release. This concludes the first chapter of the MySQL Fabric story.

It all started with the idea that it should be as easy to manage and setup a distributed deployments with MySQL servers as it is to manage the MySQL servers themselves. We also noted that some of the features that were most interesting were sharding and high-availability. Since we also recognized that every user had different needs and needed to customize the solution, we set of to create a framework that would support sharding and high-availability, but also other solutions.

With the release of 1.4.3, we have a range of features that are now available to the community, and all under an open source license and wrapped in an easy-to-use package:

High-availability support using built-in slave promotion in a master-slave configuration.
A framework with an execution machinery, monitoring, and interfaces to support management of large server farms.
Sharding using hash and range sharding. Range sharding is currently limited to integers, but hash sharding support anything that looks like a string.
Shard management support to move and split shards.
Support for failure detectors, both built-in and custom ones.
Connectors with built-in load balancing and fail-over in the event of a master failure.

Beyond MySQL Fabric 1.4.3

As the MySQL Fabric story develop, we have a number of challenges ahead.

Loss-less Fail-over. MySQL 5.7 have extended the support for semi-sync so that transactions that are not replicated to a slave server will not be committed. With this support, we can truly have a loss-less fail-over so that you cannot lose a transaction if a single server fails.

More Fabric-aware connectors. We currently have support for Connector/J, Connector/PHP, and Connector/Python, but one common request is to have support for a Fabric-aware C API. This is both for applications developed using C/C++, but also to add Fabric support to connectors based on the MySQL C API, such as the Perl and Ruby connector.

Multi-Node Fabric Instance. Many have pointed out that the Fabric node is a single point of failure, and it is instead a single node, but if the Fabric node goes down, the system do not stop working. Since the connectors cache the data, they can "run on the cache" for the time it takes for the Fabric node to be brought up again. Procedures being executed will stop, but once the Fabric node is on-line again, execution will resume from where it left off. To ensure that the meta-data (the information about the servers in the farm) is not lost in the event of a machine failure, MySQL Cluster can be used as storage engine, and will then ensure that your meta-data is safe.

There are, however, a few advantages in having support for multiple Fabric nodes:

The most obvious advantage is that execution can fail-over to another node and there will be no interruption in the execution of procedures. If the fail-over is built-in, you avoid the need for external clusterware to manage several Fabric nodes.
If you have several Fabric nodes available to deliver data, you improve responsiveness to bursts in meta-data requests. This can happen if you have a large bunch of connectors brought on-line at the same time.
If you have multiple data centers, having a local version of the data to serve the applications deployed in the same center improve locality of data and avoid an unnecessary round-trip over WAN to fetch some meta-data.
With several nodes to execute management procedures, you can improve scaling by being able to execute several management procedures in parallel. This would require some solution to avoid that that procedures do no step over each other.

Location Awareness. In deployments spread over several data-centers, the location of all the components suddenly become important. There is no reason for a connector to be directed to a remote server when a local one suffices, but that require some sort of location awareness in the model allowing the location of servers (or other components) to be given.

Extending the model by adding data centers is not enough though. The location of components withing a data center might be important. For example, if a connector is located in a particular rack in the data center, going to a different rack to fetch data might be undesirable. For this reason, the location awareness need to be hierarchical and support several levels, e.g., continent, city, data center, hall, rack, etc.

Multi-Shard Queries. Sharding can improve performance significantly since it split the data horizontally across several machines and each query therefore go directly to the right shard of the data. In some cases, however, you also need to send queries to multiple shards. There are a few reasons for this:

You do not have the shard key available, so you want to search all servers for some object of interest. This of course affect performance, but in some cases there are few alternatives. Consider, for example, searching for a person given name and address when the database is sharded on the SSN.
You want to generate a report of several items in the database, for example, find all customers above 50 that have more than 2 cars.
You want a summary of some statistic over the database, for example, generate a histogram over the age of all your customers.

Session Consistency Guarantees. As Alfranio point out, when you use multiple servers in your farm, and transactions are sent to different servers at different times, it might well be that you write one transaction that goes to the master of a group and then try to read something from the same group. If the write transactions have not reached the server that you read from, then you might get an incorrect result from your transaction. In some cases, this is fine, but in other cases, you have certain guarantees that you want to have on your session. For example, you want to ensure that anything you write will also be available when you read in transactions following the write, you might want to guarantee that multiple reads read later data all the time (called "read monotonicity"), or other forms of guarantees on the result sets you get back from the distributed database. This might require connectors to wait for transactions to reach slaves before reading, but this should be transparent to the application.

This is just a small set of the possibilities for the future, so it is really going to be interesting to see how the MySQL Fabric story develops.

You can download MySQL Utilities (which includes MySQL Fabric) from http://dev.mysql.com/downloads/tools/utilities
You can read MySQL Utilities documentation at http://dev.mysql.com/doc/mysql-utilities/1.4/en/index.html
You can report bugs or request features on http://bugs.mysql.com
MySQL Forum Fabric, Sharding, HA, Utilities

MySQL Fabric: Tales and Tails from Percona Live

2014-04-29T15:21:00.000+02:00

Going to Percona Live and presenting MySQL Fabric gave me the opportunity to meet a lot of people and get a lot of good feedback. I talked to developers from many different companies and got a lot of great feedback that will affect the priorities we make, so to all I spoke to I would like to say a great "Thank you!" for the interesting discussions that we had. Your feedback is very valuable. It was very interesting to read the comments on MySQL Fabric on MySQL Performance Blog. The article discuss the current version of MySQL Fabric distributed with MySQL Utilities and give some brief points on features of MySQL Fabric. I think it could be good to give some context to some of the points they raise, both to elaborate on the points and show what they mean in reality, and also to give some background to how we were thinking around these points.

The Art of Framing the Fabric

It was a deliberate decision to make MySQL Fabric extensible, so it is not surprising that it have the feel of a framework. By making MySQL Fabric extensible, we allow community and users to explore ideas or add user-specific support.

In the MySQL Team at Oracle we are strong believers in the open source model and are working hard to keep it that way. There are many reasons to why we believe in this model, but one of the reasons is that we do not believe that one size fit all. For any users, there are always minor variations or tweaks that are required by the users own specific needs. This means that the ability to tweak and adapt the solution to their specific needs is very important. Without MySQL being open-source, this would not be possible. As you can see from WebScaleSQL, this is not just a theoretical exercise, this is how companies really use MySQL.

From the start, we therefore focused on building a framework and created the sharding and high-availability as plugins; granted, they are very important plugins, but they are nevertheless plugins. This took a little more effort, and a little more thinking, but by doing it this way we can ensure that the system is truly extensible for everybody.

Hey! I've got a server in my farm!

As noted, many if the issues related to high-availability and sharding require server-side support to get it really solid. This is also something we recognized quite early; the alternative would be to place the logic in the connectors or the Fabric node. We recognized that the right place to solve this is in the server, not in connector layer since that put a lot of complexity at the wrong place. Even if it was possible to handle everything in the connector, there is still a chance that something goes wrong if the constraints are not enforced in the server. This could be because of bugs, because of mistakes in the administration of the server, or any other number of reasons, so to build a solid solution, constraints on the data should be enforced by the servers and not in the connectors or in a proxy.

An example given is that there is no way to check that a row ends up in the right shard, which is very true. A generic solution to this would be to add CHECK constraint on the server, but unfortunately, this is a very big change in the server code-base. Adding triggers to the tables on the server is probably a good short-term solution, but that require managing and deploying extra code on all servers, which in turn is an additional burden on managing the servers, which is something we would like to avoid (the more "special" things you have to do with the servers, the higher the risk is of something going wrong).

On the proximity of things...

One of the central components of MySQL Fabric are the high-availability groups (or just groups, when it is clear from the context) that were discussed in an earlier post. The central idea around a group is that each group manages the same piece of data and MySQL Fabric is designed to handle and coordinate multiple groups into a federation of databases. The feature of being able to manage multiple groups is something that is critical to create a sharded system. On thing that is quite often raised is that it should be possible for a server to belong to multiple groups, but I think this comes from a misunderstanding on what a group represents. It is not a "replica set", which gives information about the topology, that is, how replication is set up, nor does it say anything about how the group is deployed. It is perfectly OK to have members of the group in different data centers (for geographical redundancy), and it is perfectly OK to have replication between groups to support, for example, functional partitioning. If a server belonged to two different groups, it would mean that it manages two different sets of data at the same time.

The fact that group members can be located in different data centers raises another important aspect, something that was often mentioned at Percona Live, that of managing the proximity of components in the system. There is some support for this in Hadoop where you have rack-awareness, but we need a slightly more flexible model. Imagine that you have a group set up with two servers in different data centers and you further have scale-out slaves attached locally. You have connectors deployed in both data centers, but when reading data you do not want to go to the other data center to execute the transaction, it should always be done locally. So, is it sufficient to be able to just have a simple grouping of the components? No, because you can have multiple levels of proximity, for example, data centers, continents, and even rooms or racks within a data center. You can also have different facets that you want to model, such as latency, throughput, or other properties that are interesting for particular uses. For that reason, whatever proximity model we deploy, it need to support a hierarchy and also have a more flexible cost model where you can model different aspects. Given that this problem have been raised several times on Percona Live and also by others, it is likely to be something we need to prioritize.

The crux of the problem

As most of you have already noted, there is a single Fabric node running that everybody talk to. Isn't this a single point of failure? It is indeed, but there is more to the story than just this. A single point of failure is a problem because if it goes down, so does the system... but in this case, it doesn't really go down, it will keep running most of the time.

The Fabric node does a lot of things: it keeps track of the status of all the components of the farm, execute procedures to handle fail-over, and deliver information about the farm on request. However, the connectors are the ones that route the transactions to the correct place, and to avoid having to ask the Fabric node about information each time, the connectors maintain caches. This means that in the event of a Fabric node failure, connectors might not even notice that it is gone unless they had to re-fill their caches. This means that if you restart the Fabric node, it will be able to serve the information again.

Another thing that stops when the Fabric node goes down is that no more fail-overs can be done and ongoing procedures are stopped in their tracks, which could potentially leave the farm in an unknown state. However, the state of the execution of any ongoing procedures are stored in the backing store, so when you bring up the Fabric node again, it will restore the procedures from the backing store and continue executing. This feature alone do not help against a complete loss of the machine where the Fabric node and the backing store are put, but, MySQL Fabric is not relying on specific storage engine features, any transactional engine will do, so by using MySQL Cluster as the storage engine it is possible to ensure safe-keeping of the state.

There are still good reasons to support multi-node Fabric instances:

If one Fabric node goes down, it should automatically fail over to another and continue execution. This will prevent any downtime in handling executions.
Detecting and bringing up a secondary Fabric node can become very complicated in the case of network partitions since it require handling split-brain scenarios reliably. It is then better to have this built into MySQL Fabric since it makes deployment and management significantly simpler.
Management of a farm does not put any significant pressure on the database back-end, but having a single Fabric node can be a bottleneck. In this case, it would be good to be able to execute multiple independent procedures on different Fabric nodes and coordinate the updates.
If a lot of connectors are required to fill their caches at the same time, we have a risk of a thundering herd. Having a set of Fabric nodes for read scale-out can then be beneficial.
If a group is deployed in two very remote data centers, it is desirable to have a local Fabric node for read-only purposes instead of having to go to the other data center.

More Fabric-aware Connectors

Currently we support connectors for Python, Java, and PHP, but one point that pop up quite often (both at Percona Live and elsewhere) is the lack of a Fabric-aware C connector. It is the basis for implementing both the Perl Database Interface MySQL driver DBD::mysql and for the Ruby connector, but is also desirable in itself for applications using C or C++ connector. All I can say at this point is that we are aware of the situation and know that it is something desired and important.

Interesting links

MySQL Fabric Setup using ndb Cluster

MySQL Fabric 1.4.2 Released

2014-04-01T05:17:00.000+02:00

As you saw in the press release, MySQL Fabric 1.4.2 is now released! If you're interested in learning more about MySQL Fabric, there is a session April 3, 2014 11:10–12pm titled Sharding and Scale-out using MySQL Fabric in Ballroom G. MySQL Fabric is a relatively new project in the MySQL ecosystem and it focuses on building a framework for working with large deployments of MySQL Servers. The architecture of MySQL Fabric is such that it allows extensions to be added and the first two extensions that we added were support for high-availability using High-Availability groups (HA groups) and sharding to manage very large databases. The first version of sharding have hash and range sharding implemented as well as procedures for moving and splitting shards.
A critical part of working with a collection of servers is the ability to route transactions to the correct servers, and for efficiency reasons we quite early decided to put this routing logic into the connectors. This avoid one extra network hop and hence improve performance by reducing latency, but it does require that the connectors containing routing logic, caches, and support for fetching data from MySQL Fabric. Putting the routing logic into the connector also make it easy to extend the API to add new support that applications can require.
MySQL Fabric 1.4.2 is distributed as part of MySQL Utilities 1.4.2. To avoid confusion, we have changed the version numbering to match the version of MySQL Utilities it is distributed in.

You can download MySQL Utilities 1.4.2 from http://dev.mysql.com/downloads/tools/utilities
You can read MySQL Utilities 1.4.2 documentation at http://dev.mysql.com/doc/mysql-utilities/1.4/en/index.html

We have just done a few public releases, even though we did a few internal releases as well, but a brief history of our releases this far is:

MySQL Fabric 1.4.0
- First public release
- High-Availability groups for modeling farms
- Event-driven Executor for execution of management procedures.
- Simple failure detector with fail-over procedures.
- Hash and Range sharding allowing management of large databases.
- Shard move and shard split to support management of a sharded database.
- Connector interfaces to support federated database systems.
- Fabric-aware Connector/Python (labs)
- Fabric-aware Connector/J (labs)
- Fabric-aware Connector/PHP (labs)
MySQL Fabric 1.4.1
- More solid scale-out support in connectors and MySQL Fabric
- Improvements to the Executor to avoid stalling reads
- Connector/Python 1.2.0 containing:

Labs release of Connector/J with Fabric-support

MySQL Fabric 1.4.2

Credentials in MySQL Fabric
External failure reporting interfaces supporting external failure detectors
Support for unreliable failure detectors in MySQL Fabric
Credentials support in Connector/Python
Connector/Python 1.2.1 containing:

Failure reporting
Credentials Support

Connector/J 5.1.30 containing Fabric support

Do you want to participate?

There is a lot you can do if you want to help improve MySQL Fabric.

If you find bugs or want specific feature, please report a bug at http://bugs.mysql.com
MySQL Forum Fabric, Sharding, HA, Utilities

Blogs about MySQL Fabric

MySQL Fabric: High Availability Groups

2013-10-21T22:13:00.000+02:00

As you might have noticed, we have released a framework for managing farms (or grids, as Justin suggested) of MySQL servers called MySQL Fabric. MySQL Fabric is focused on being easy to use and extensible, and two extensions are currently part of the framework: one to manage high-availability and one to implement sharding.

High-Availability Group

High-Availability Groups

One of the central concepts used to construct a farm is the high-availability group (or just group when there is no risk of confusion) and is introduced by the high-availability extension. As mentioned in the previous post, the group concept does not really represent anything new but is rather a formalization of how we think and work with the structure of the farm. The key to supporting high-availability is to have redundancy in the system: if one component fail, another one should be ready to pick up the job of the failing component. Hardening the systems (by using hardware less prone to fail or hardware with built-in redundancy) can help reduce the chance of a component failing, but not completely eliminate it. Even a hardened system is susceptible to failure in a power outage or an earthquake. With this in mind, we introduced the group concept for managing pieces of data in our farm:

each group consists of several machines that are responsible for managing the same piece of data.The concept of a group is an abstraction to model the basic concept that we're after, but does not really say anything about how it is implemented. This is intentional: it should be concrete enough to support all the operations we need, but abstract enough to not restrict how it is implemented. This is important because connectors (or any other "outside" observer) that work with groups should not have to be updated whenever new implementations are added. For example, it should not make a difference to a connector if the group is implemented using a traditional Master-Slave setup, a MySQL Cluster, or using replicated storage such as DRBD.

Server properties in groups

There are a few key properties that we assume for groups:

A server belong to (at most) one group.
At any time, each server in the group have a designated a status.
At any time, each server has a mode indicating if it accepts reads, writes, both, or neither.
Each server also has a weight, which is the relative power of the server and is used to balance the load.

Note that these properties might change over time, depending on events that happen.For handling load-balancing and high-availability the properties Status, Mode, and Weight where introduced. The mode and weight properties are used by a connector when it comes to deciding where to send a transaction, while the status property is used by the Fabric to keep track of the status of the server. Let's take a closer look at the properties.

Figure 1. Server Status

Server Status (or Role). The status of the server provide information about what the server is currently doing in the group. The status of a server is Fabric's view of the status of the server and changes as time passes and the Fabric notice changes. A primary server accept both a write and a read load and sending high-priority read transactions here mean that they get current data. A secondary server can handle reads but, in the case of a master-slave configuration, it should not accept writes since that would lead to a split-brain situation. Secondary servers are servers waiting to pick up the job of the primary if it fails. Spare servers do not accept reads nor writes, but are ready and running and can therefore change status in the group to replace other servers in the event of failures. In addition, spare servers can be used to handle reads.

In Figure 1 you can see an example of how servers could change status, but note that at this time, we do not track all states. For example, we are considering how to handle the provisioning of new servers in flexible and extensible way, but more about that in a separate post.

Server Mode. The mode of the server gives information on whether it can be read or written and provide information for the connector on how it should send queries. For now, we only have three modes: Offline, Read-only, and Read-Write. Offline servers cannot be read from or written to, and usually does not accept connections. Read-only servers can only be read from and write transactions should not be sent to these. Read-Write servers are usually primaries of the group. They can accept writes and will propagate them correctly to other servers in the group.

Server Weight. The weight of a server is used to balance the load between servers. The weight represent the relative power of the server. When balancing the load between servers, the connector will figure out what servers are eligible for accepting a transaction and then pick one of the servers in such a way that the distribution over time will be proportional to the weight of the server.

What's up with transactions?

So what is it with transactions that makes it necessary to declare the properties of the transaction before it actually? The reason is quite simple: the transaction is fed to the server one line at a time. This means that it is not always possible to know if it is a read-only or read-write transaction until the entire transaction has been seen.

An example is this transaction (yeah, it's a little constructed, but it's just an example):

START TRANSACTION;
SELECT salary INTO @salary FROM salaries
 WHERE emp_no = 20101 AND to_date = DATE('9999-01-01');
UPDATE titles SET to_date = CURRENT_DATE()
 WHERE emp_no = 20101 and to_date = DATE('9999-01-01');
UPDATE salaries SET to_date = CURRENT_DATE()
 WHERE emp_no = %s and to_date = DATE('9999-01-01');
INSERT INTO titles VALUES
  (20101, 'Master of the Universe', CURRENT_DATE(), DATE('9999-01-01'));
INSERT INTO salaries VALUES
  (20101, 10 * @salary, CURRENT_DATE(), DATE('9999-01-01'));
COMMIT;

On line 1 or 2, it is not possible to know if this is a read-only or a read-write transaction, it cannot be known until line 3, where an UPDATE is executed.

Transaction properties

As mentioned before, one of the goals is to support sharding in the presence of transactions and to make that work correctly, it is necessary to declare up-front what the transaction will contain. Not everything, but the key elements of the transaction: what tables it will access, what sharding key is used, and if it is a read-only or read-write transaction. The first two properties are only necessary if you are working with a sharded system, so we skip those for now; the last one, however, is important for handling load-balancing in the connector.When executing transactions using a Fabric-aware connector, you provide the information about the transaction using transaction properties. There are several properties available, but we will focus on the ones related to group handling: group and type. The group property is used to provide the name of the group you want to connect to (you can have several), and the type property is used to tell if this is a read-only or read-write transaction. In the future, we might add more properties such as priority to indicate that this is an urgent transaction and a prompt reply is needed. For example, the following code is using a Fabric-aware connector to promote an employee.

from mysql.connector.fabric import (
   TYPE_READWRITE,
)

def promote_employee(conn, emp_no):
    stmts = [
        ("SELECT salary INTO @salary FROM salaries"
         " WHERE emp_no = %s AND to_date = DATE('9999-01-01')"),
        ("UPDATE titles SET to_date = CURRENT_DATE()"
         " WHERE emp_no = %s and to_date = DATE('9999-01-01')"),
        ("UPDATE salaries SET to_date = CURRENT_DATE()"
         " WHERE emp_no = %s and to_date = DATE('9999-01-01')"),
        ("INSERT INTO titles VALUES"
         " (%s, 'Master of the Universe', CURRENT_DATE(), DATE('9999-01-01'))"),
        ("INSERT INTO salaries VALUES"
         " (%s, 10 * @salary, CURRENT_DATE(), DATE('9999-01-01'))"),
        ]

    # Use the group for the ACME company
    conn.set_property('group', 'ACME')
    conn.set_property('type', TYPE_READWRITE)
    conn.start_transaction()
    cur = conn.cursor()
    for stmt in stmts:
        print "Executing:", stmt % (emp_no,)
        cur.execute(stmt, (emp_no,))
    conn.commit()

On line 20 and 21 you see how the properties of the transaction is set. In this case, we declare the group that we will access (for example, a fictional company "ACME") and also the type of the transaction. After that, a transaction is started as normal and executed. The Fabric-aware connector will pick the right server to send the transaction to and you will get the result back in the normal fashion.

Note that the property type is not yet implemented in Connector/Python, some work remains to make it support load-balancing fully.

Picking a server

But are these server and transaction properties sufficient for a connector to make a decision on what to do with a transaction? Let's take a look and see how the server can be selected.A server can be chosen by first selecting a set of candidates and then picking one of the candidates based on the weight of the server. Picking the candidates are done by matching the transaction properties and the server properties to find all server that are eligible for accepting the transaction. When a list of candidates are available you can, for example, pick one at random based on the weight of the servers. You can see an example Python code below that illustrates how this could be done. The first function find_candidates computes the set of candidates from the set of all servers SERVERS, while the second function pick_server pick one of the servers at random based on the weight of the server.

def find_candidates(props):
   candidates = []
   for srv in SERVERS:
      if props.group == srv.group and (props.mode & srv.mode):
         candidates.append(srv)
   return candidates

def pick_server(servers):
   random_weight = random() * sum(srv.weight for srv in servers)
   sum_weight = 0.0
   for idx, srv in enumerate(servers):
      sum_weight += srv.weight
      if sum_weight > random_weight:
         return servers[idx]
   return servers[-1]    # Last server in list

# Example code for picking a server based on transaction properties
pick_server(find_candidates(trans.props))

Implementation of groups

The reason to why we introduced the group concept in this manner is to be able to vary the implementation of a group, so the question is then, does it work? To see if it works, it is good to consider some sample implementations of high-availability groups and see if they can be described in this manner, so let's do that. Note that the only version that is currently implemented is the primary-secondary approach: the other ones are just food for thought (at this point).

The primary-secondary approach (also known as primary-backup or master-slave) is the traditional way to set up MySQL servers for high-availability. The idea is that there is a single primary managing the data and one or more secondaries that replicate the data from the primary and are ready to become primary in the event that the primary dies. In addition, there is a number of pure read slaves that are used to scale-out reads.

In this approach, the primary would be in read-write mode, and the secondaries could either be offline or in read mode. Secondaries cannot accept writes since that might cause a split-brain situation, but they can either be in read-only mode or offline. Not loading the servers with read-only transactions can make it easier for the secondaries to be up to date with the primary, but this depends on the general load on the system. Scale-out slaves added would then, of course, be pure read-only servers, and they cannot be promoted to be masters because they do not have the binary log enabled. However, if the primary master fails, they still need to fail-over to the new primary.

If (when?) the primary master fails, MySQL Fabric will detect the failure and start executing a procedure to promote one of the secondary master to be primary instead of the one that failed. MySQL Fabric have to do this because the servers do not know how to handle the fail-over themselves, and in addition it is necessary to inform the connectors about the new topology. In this procedure, the scale-out servers have to be moved over to the new primary as well.

Another popular solution for high-availability shared storage (for example, using shared network disks) or replicated storage (for example, using DRBD to replicate the block device). In this case, one of the server will be on-line, but the other will be on standby. For both DRBD and shared storage, it is necessary that the standby is completely offline and in the case of DRBD the server should not even be running on the standby machine. In addition to the primary and the secondary, you could have read slaves attached to the primary.

In this setup, the primary would then be a read-write server, while the standby server would be in offline mode. Scale-out servers would be in read-only mode, in this case attached to the primary.

Another approach is to use MySQL Cluster as a group. The cluster consists of several data nodes and employ a shared-nothing architecture to ensure high availability. In this case, all the servers will be both write and read servers, and all might be primaries. In the event that an NDB data node fails, the other nodes are always ready to pick up the job, so a MySQL Cluster group is self-managing. (There is an excellent overview of MySQL Cluster at http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-overview.html.)

The two solutions above employ different fail-over procedures that are executed by the Fabric node when it notices the failure. In contrast with the solutions above, MySQL Cluster is self-governing and does not require any fail-over handling implemented in the Fabric node.

Summary and considerations for the future

For the examples above, the properties we have outlined is definitely sufficient, but there might be other cases where more information is needed.One property that is missing in the current implementation is a way to select a server based on the proximity to the connector. For example, it could be possible to put the primaries and secondaries in a group in different data centers to ensure that it can handle a catastrophic failure. This, however, opens two issues:

There will be a set of read servers in each data center that should be connected to the primary or secondary in the same data center.
When the connector picks one of the candidates, it should prefer to use those in the same data center.

Both these issues mean that we need some measure of the distance between the servers and connectors so that when adding new scale-out servers it is not added in such a way that the same data is shipped several times between data centers, nor should a connector connect to a server in a different data center. Adding a complete matrix with distances between each and every server would not really work well, so it is likely that some other way to model the proximity is needed.Another case that might require some additional information is if the Fabric node fails or is unavailable temporarily (for instance, is restarting). Such an event should not block the entire system and since the connectors have information it would be possible to "run on the cache" for a brief period. The key issues here is that nothing can be updated, so each connector need to have a fallback plan in the event that a server fails. For example, if a master fails, the fail-over will not be executed, but it would still be possible to read information from the slaves.

MySQL Connect presentations on MySQL Fabric available on SlideShare

2013-10-08T22:00:00.000+02:00

Going to MySQL Connect was truly a blast. We got a lot of good questions and feedback in the sessions and there were a lot of interest in both MySQL Fabric and the MySQL Applier for Hadoop.

A big thank you to all that attended the talks, I got a lot of good questions and comments that will help us build good solutions.

The talks are available on SlideShare:

A Brief Introduction to MySQL Fabric

2013-09-21T18:35:00.000+02:00

As you saw on the keynote, we are introducing an integrated framework for managing farms of MySQL servers with support for both high-availability and sharding. It should be noted that this is a very early alpha and that it at this point is not ready for production use.

MySQL Fabric is an integrated system for managing a collection of MySQL servers and is the framework on which high-availability and sharding is built. MySQL Fabric is open-source and is intended to be extensible, easy to use, and support procedure execution even in the presence of failure, an execution model we call resilient execution.

To ensure high-availability, it is necessary to have redundancy in the system. For database systems, the redundancy traditionally takes the form of having a primary server acting as a master and using replication to keep secondaries available to take over in case the primary fails. This means that the "server" that the application connects to is in reality a collection of servers, not a single server. In a similar manner, if the application is using a sharded database, it is in reality working with a collection of servers, not a single server. In this case, we refer to a collection of servers as a farm.

Now, just having a collection of servers does not really help us that much: it is necessary to have some structure imposed on the farm as well as an API to work with the farm, and this is where MySQL Fabric comes in.

Before going over the concepts, have a look at the farm below. In the figure, there is an application that want to connect to the farm and there are a set of database servers at the bottom organized into groups called high-availability groups. To manage the structure of the farm, there is a MySQL Fabric Node available that keeps, among other things, track of the meta-data as well as handles procedure execution. Both the application and an operator can connect to the Fabric node to get information about the farm being managed.

The MySQL Fabric nodes are responsible for keeping track of the structure of the farm as well as all information about their status and is also where any management procedures are executed. If a server fails, a the Fabric node will handle the procedure of promoting one of the slaves to be the new master, but it also contain the logic for handling shard moving and shard splitting.

High-Availability Group

High-availability Groups

The central concept to handling high-availability in Fabric are the high-availability groups. These are collections of servers that shall work together to deliver a database service to the application that connects. The groups are introduced to give structure to the farm and allow you to describe how the servers are organized to support the redundancy necessary to ensure high-availability.

Inside the group, all the servers manage the same data and hence have the same schema. In addition, each server have distinct roles and no server can belong to more than one group. The group concept does not really represent anything new: it is just a way to structure the farm in such a manner that managing it is easy to understand and the roles of servers are clear. To manage redundancy, a traditional Master-Slave Setup is used, a topology that we all are familiar with (it is often called the Primary-Secondary Approach, hence the names that follow). Each group have a primary that is is the master for all the data. Any queries that update data is sent here and that data is propagated to the other servers in the group. Redundancy is achieved by keeping one or more secondaries in the group that receive changes from the primary and are ready to take over the role as primary should the primary dissapear. To handle scale-out, the group also contain scale-out servers which are severs that receive changes from the primary but are not eligable for being promoted to primary. In the group, there are also spares, which are servers that are available for use but which are not assigned any active role yet.

Sharding

In addition to high-availability support, MySQL Fabric also offer support for sharding, which is a technique for handling very large databases and/or very high write loads. The database is split into a large number of shards, where each shard contain a fragment of the data in the database. Each shard is stored on a separate server (or a separate set of servers if you want to ensure high-availability) and the transactions are directed to each shard based on the shard key. Splitting the database in this way allow you both to manage a larger database by separating it onto more servers, but it also scale the write traffic because you can execute writes independently.

When using sharding, MySQL Fabric separate tables into sharded tables MySQL Fabric allow you to shard just some of the tables in the database and keep the other tables available on all shards, which we call global tablesand global tables. Since databases usually have multiple tables with foreign key relationships between, it is critical to be able to shard several tables the same way (but possibly on different columns), which is something that MySQL Fabric supports. Using this support, you can give all the tables that are sharded and what column should be used as the sharding key and MySQL Fabric will shard all tables and distribute the rows on the shards. Tables that are not sharded are the global tables and they will be available on all shards.

If you want to know more about how Fabric support sharding, or about sharding in general, you should come to the session MySQL Sharding: Tools and Best Practices for Horizontal Scaling, September 21, 4:00pm-5:00pm in Imperial Ballroom B.

Connecting to a MySQL Farm

To provide better control when working with a MySQL farm we have extended the connector API in such a manner that it hides the complexities of handling fail-over in the event of a server failure as well as dispatching transactions to shards correctly. There is currently support for Fabric-aware versions of Connector/J, Connector/PHP, Connector/Python as well as some rudimentary support for Hibernate and Doctrine. If you are interested in how the extensions to the interface look and how you can use them to scale your application, you should come to Scaling PHP Applications, September 22, 10:00am-11:00am in Union Square Room 3/4.

More information about MySQL Fabric

There are several blogs being published on high-availability and sharding from the developers working on the Fabric system or the connectors.

If you are interested in discussing and asking questions about MySQL Fabric, or sharding and high-availability in general, the forum on Fabric, Sharding, HA, Utilities is an excellent place for discussions. Also, if you are at MySQL Connect, going to MySQL Sharding, Replication, and HA (September 21, 5:30-6:30pm in Imperial Ballroom B) is an excellent opportunity to learn more about the project, meet us developers and team leads, and provide feedback to us. The BOF will cover several areas, some overlapping, but the discussion is likely to cover MySQL Fabric and MySQL replication.

Going to MySQL Connect 2013

2013-08-29T20:10:00.001+02:00

MySQL Connect 2013 is coming up with several interesting new sessions. Some sessions that I am participating in got accepted for the conference, so if you are going there, you might find the following sessions interesting. For your convenience, the sessions have hCalendar markup, so it should be easier to add them to your calendar.

MySQL Sharding, Replication, and HA (September 21, 5:30-6:30pm in Imperial Ballroom B)

This session is an opportunity for you to meet the MySQL engineering team and discuss the latest tools and best practices for sharding MySQL across distributed server farms while maintaining high availability.

Come here and meet Lars, Luis, Johannes, and me, and bring up any questions or comments you have regarding sharding, high-availability, and replication. We might have some informal presentations available to discuss sharding and high-availability issues.

MySQL Sharding: Tools and Best Practices for Horizontal Scaling (September 21, 4:00pm-5:00pm in Imperial Ballroom B)

In this session, Alfranio and I will discuss how to create and manage a sharded MySQL database system. The session covers tools, techniques, and best practices for handing various aspects of sharding such as:

Hybrid approaches mixing sharding and global tables.
Sharding in the precense of transactions and how that affect
applications
Turning an unsharded database into a sharded database
system.
Handling schema changes to a sharded database.

MySQL and Hadoop: Big Data Integration—Unlocking New Insights (September 22,
11:30am-12:30pm in Taylor)

Hadoop enables organizations to gain deeper insight into their customers, partners, and processes. As the world's most popular open source database, MySQL is a key component of many big data platforms. This session discusses technologies that enable integration between MySQL and Hadoop, exploring the lifecycle of big data, from acquisition via SQL and NoSQL APIs through to delivering operational insight and tools such as Sqoop and the Binlog API with Hadoop Applier enabling both batch and real-time integration.

MySQL High Availability: Managing Farms of Distributed Servers (September 22, 5:30pm-6:30pm in Imperial Ballroom B)

In this session, Alfranio and I will discuss tools, best practices, and frameworks for delivering high availability using MySQL. For example, handling such issues as ensuring server redundancy, handling failure detection and executing recovery, re-directing clients in the event of failure.

Scaling PHP Applications (September 22, 10:00am-11:00am in Union Square Room 3/4)

When building a PHP application, it is necessary to consider both scaling and high-availability issues. In this session, Johannes and I will discuss how to ensure that your PHP application scales well and can work in a high-availability environment.

Round Robin Replication using GTID

2012-10-11T16:25:00.001+02:00

In a previous post I showed how to implement multi-source round-robin replication in pure SQL using the tables that are needed for crash-safe replication. I also outlined a revised version of this approach in the Replication Tips & Tricks presentation I gave at MySQL Connect. This was, however, before the GTID (Global Transaction ID) implementation was done. Now that they are introduced, multi-source replication is even easier since you no longer have to keep track of the positions.

**Figure 1.** Tables for storing information about masters
CREATE TABLE my_masters ( idx INT AUTO_INCREMENT, host CHAR(50) NOT NULL, port INT NOT NULL DEFAULT 3306, PRIMARY KEY (idx), UNIQUE INDEX (host,port) ); CREATE TABLE current_master ( idx INT ); CREATE TABLE replication_user( name CHAR(40), passwd CHAR(40) );

Figure 1. Tables for storing information about masters

CREATE TABLE my_masters (
    idx INT AUTO_INCREMENT,
    host CHAR(50) NOT NULL,
    port INT NOT NULL DEFAULT 3306,
    PRIMARY KEY (idx),
    UNIQUE INDEX (host,port)
);

CREATE TABLE current_master (
    idx INT
);

CREATE TABLE replication_user(
    name CHAR(40),
    passwd CHAR(40)
);

One caveat is that this only works if you are replicating from servers that have GTID enabled, so if you are trying to replicate from a pre-5.6 server, you can use the original implementation. I have added a re-factored version of the code last in this post, and you can also find some utility procedures that I use in the version described here.

For the version that uses GTID, we still keep the two tables we were using in the original implementation around, but we remove the file and position from the tables, giving the definitions seen in Figure 1. Also, the user and password is stored in a separate table and is assumed to be identical for all machines.

To fetch the new master, I created a fetch_next_master procedure that fetches the next master in turn and then advance current_master to the next master. The second select in the code below is used to handle the case that you have a table with masters defined as in Table 1.

delimiter $$
CREATE PROCEDURE fetch_next_master(
    OUT p_host CHAR(50), OUT p_port INT UNSIGNED,
    OUT p_file CHAR(50), OUT p_pos BIGINT)
BEGIN
   DECLARE l_next_idx INT DEFAULT 1;

   SELECT idx INTO l_next_idx FROM my_masters
    WHERE idx > (SELECT idx FROM current_master)
    ORDER BY idx LIMIT 1;

   SELECT idx INTO l_next_idx FROM my_masters
    WHERE idx >= l_next_idx
    ORDER BY idx LIMIT 1;

   UPDATE current_master SET idx = l_next_idx;

   SELECT host, port INTO p_host, p_port
     FROM my_masters WHERE idx = l_next_idx;
END $$
delimiter ;

Since we no longer need to save the position, the code for multi_source event is significantly simpler. All that is necessary is to change master to the next master in turn: the server remembers what transactions are missing automatically and will start replicating from the correct position.

delimiter $$
CREATE EVENT multi_source
    ON SCHEDULE EVERY 1 MINUTE DO
BEGIN
   DECLARE l_host CHAR(50);
   DECLARE l_port INT UNSIGNED;
   DECLARE l_user CHAR(40);
   DECLARE l_passwd CHAR(40);
   DECLARE l_file CHAR(50);
   DECLARE l_pos BIGINT;

   SET SQL_LOG_BIN = 0;

   CALL stop_slave_gracefully();
   START TRANSACTION;
   CALL fetch_next_master(l_host, l_port);
   SELECT name, passwd INFO l_user, l_passwd FROM replication_user;
   CALL change_master(l_host, l_port, l_user, l_passwd);
   COMMIT;
   START SLAVE;
END $$
delimiter ;

Full code for original implementation

Here is the code for replicating from pre-5.6 to 5.6 using the replication tables added for implementing crash-safe slaves.

Compared to the version described in the earlier post, I have added a few utility procedures such as a procedure to stop the slave gracefully. The procedure will first stop the I/O thread, and then empty the relay log before stopping the SQL thread. This is mainly to avoid having to re-transfer a lot of events from the master. Compared to the version provided in the previous post, I factored out some separate procedures. You can see the re-factored version last in the post.

delimiter $$
CREATE PROCEDURE change_master(
    p_host CHAR(40), p_port INT,
    p_user CHAR(40), p_passwd CHAR(40),
    p_file CHAR(40), p_pos LONG)
BEGIN
   SET @cmd = CONCAT('CHANGE MASTER TO ',
                     CONCAT('MASTER_HOST = "', p_host, '", '),
                     CONCAT('MASTER_PORT = ', p_port, ', '),
                     CONCAT('MASTER_USER = "', p_user, '", '),
                     CONCAT('MASTER_PASSWORD = "', p_passwd, '"'));

   IF p_file IS NOT NULL AND p_pos IS NOT NULL THEN
     SET @cmd = CONCAT(@cmd,
                       CONCAT(', MASTER_LOG_FILE = "', p_file, '"'),
                       CONCAT(', MASTER_LOG_POS = ', p_pos));
   END IF;
   PREPARE change_master FROM @cmd;
   EXECUTE change_master;
   DEALLOCATE PREPARE change_master;
END $$
delimiter ;

delimiter $$
CREATE PROCEDURE save_position()
BEGIN
   DECLARE l_idx INT UNSIGNED;
   DECLARE l_msg CHAR(60);

   UPDATE my_masters AS m,
          mysql.slave_relay_log_info AS rli
      SET m.log_pos = rli.master_log_pos,
          m.log_file = rli.master_log_name
    WHERE idx = (SELECT idx FROM current_master);
END $$
delimiter ;

delimiter $$
CREATE PROCEDURE fetch_next_master(
    OUT p_host CHAR(40), OUT p_port INT UNSIGNED,
    OUT p_file CHAR(40), OUT p_pos BIGINT)
BEGIN
   DECLARE l_next_idx INT DEFAULT 1;

   SELECT idx INTO l_next_idx FROM my_masters
    WHERE idx > (SELECT idx FROM current_master)
    ORDER BY idx LIMIT 1;

   SELECT idx INTO l_next_idx FROM my_masters
    WHERE idx >= l_next_idx
    ORDER BY idx LIMIT 1;

   UPDATE current_master SET idx = l_next_idx;

   SELECT host, port, log_pos, log_file
     INTO p_host, p_port, p_pos, p_file
     FROM my_masters
    WHERE idx = l_next_idx;
END $$
delimiter ;

delimiter $$
CREATE EVENT multi_source
    ON SCHEDULE EVERY 10 SECOND DO
BEGIN
   DECLARE l_host CHAR(40);
   DECLARE l_port INT UNSIGNED;
   DECLARE l_user CHAR(40);
   DECLARE l_passwd CHAR(40);
   DECLARE l_file CHAR(40);
   DECLARE l_pos BIGINT;

   SET SQL_LOG_BIN = 0;

   STOP SLAVE;
   START TRANSACTION;
   CALL save_position();
   CALL fetch_next_master(l_host, l_port, l_file, l_pos);
   SELECT name, passwd INTO l_user, l_passwd FROM replication_user;
   CALL change_master(l_host, l_port, l_user, l_passwd, l_file, l_pos);
   COMMIT;
   START SLAVE;
END $$
delimiter ;

Binary Log Group Commit in MySQL 5.6

2012-06-05T17:51:00.001+02:00

With the release of MySQL 5.6 binary log group commit is included, which is a feature focused on improving performance of a server when the binary log is enabled. In short, binary log group commit improve performance by grouping several writes to the binary log instead of writing them one by one, but let me digress a little on how transactions are logged to the binary log before going into the details. Before going into details about the problem and the implementation, let look at what you do to turn it on.

Nothing.

Well... we actually have a few options to tweak it, but nothing required to turn it on. It even works for existing engines since we did not have to extend the handlerton interface to implement the binary log group commit. However, InnoDB has some optimizations to take advantage of the binary log group commit implementation.

binlog_order_commits={0|1}: This is a global variable that can be set without stopping the server.
If this is off (0), transactions may be committed in parallel. In some circumstances, this might offer some performance boost. For the measurements we did, there were no significant improvement in throughput, but we decided to keep the option anyway since there are special cases were it can offer improvements.
binlog_max_flush_queue_time=microseconds: This variable controls when to stop skimming the flush queue (more about that below) and move on as soon as possible. Note that this is not a timeout on how often the binary log should be written to disk since grabbing the queue and writing it to disk takes time.

Transactions galore...

As the server executes transactions the server will collect the changes done by the transaction in a per-connection transaction cache. If statement-based replication is used, the statements will be written to the transaction cache, and if row-based replication is used, the actual rows changed will be written to the transaction cache. Once the transaction commits, the transaction cache is written to the binary log as one single block. This allow each session to execute independently of each others and only need to take a lock on the binary log when writing the transaction data to it. Since transactions are isolated from each others it is enough to serialize the transactions on commits. (Of course, this is in an ideal world. Transactions can see other transactions changes if you set a different transaction isolation level. You would never do that unless you knew exactly what you're doing... right?)

Figure 1. Two-Phase Commit Protocol (2PC)

In order to keep the storage engine and the binary log in sync, the server employs a two-phase commit protocol (or just 2PC) that you can see in Figure 1. As you can see, there is a call to write() and one call to fsync() in the diagram: I'll get to that is just a moment, so stay tuned.

The entire point of using a two-phase commit protocol is to be able to guarantee that the transaction is either both in the engine and the binary log (or in neither) even in the event that the server crashes after the prepare, and subsequently recovers. That is, it should not be possible that the transaction is in the engine but not in the binary log, or vice verse. Two-phase commit solves this by requiring that once a transaction is prepared in the engine, it can be either fully committed or fully rolled back even if the server crashes and recover. So, on recovery, the storage engine will then provide the server with all the transactions that are prepared but not yet committed, and the server will then commit the transaction if it can be found in the binary log, and roll it back otherwise.

This is, however, only possible if the transaction can be guaranteed to be persistent in the binary log before committing the transaction in the engine. Since disks are slow and memory fast, the operating system tries to improve performance by keeping part of the file in memory instead of writing directly to disk. Once enough changes have been written, or the memory is needed for something else, the changes are written to disk. This is good for the operating system (and also for anybody using the computer), but causes a problem for the database server since if the server crashes, it is possible that the transaction is committed in the storage engine, but there is no trace of it in the binary log.

For recovery to work properly, it is therefore necessary to ensure that the file is really on disk, which is why there is a call to fsync() in Figure 1, which makes the in-memory part of the file to be written to disk.

The Infamous `prepare_commit_mutex`

Figure 2. Missing un-committed transactions

When the server recovers, it has access to the binary log and can therefore decide what to commit and what to rollback, but what if there is no binary log?

In the general case, a recovery can just roll back all prepared transactions and start again. After all, the transactions that were just prepared but not committed are safe to roll back. They just move the database to the state it had just before starting those transactions. Any clients being connected has not got an indication that the transaction is committed, so they will realize that the transactions have to be re-executed.

There is another case where recovery is being used in this way and that is when using on-line backup methods such as InnoDB Hot Backup (which is used in the MySQL Enterprise Backup). These tools take a copy of the database files and InnoDB transaction logs directly—which is an easy way to take a backup—but it means that the transaction logs contain transactions that have just been prepared. On recovery, they roll back all the transactions and have a database in a consistent state.

Since these on-line backup methods are often used to bootstrap new slaves, the binary log position of the last committed transaction is written in the header of the InnoDB redo log. On recovery, the recovery program print the binary log position of the last committed transaction and you can use this information with the CHANGE MASTER command to start replicating from the correct position. For this to work correctly, it is necessary that all the transactions are committed in the same order as they are written to the binary log. If they are not, there can be "holes" where some transactions are written to the binary log, but not yet committed, which cause the slave to miss transactions that were not committed. The problematic case that can arise is what you see in Figure 3 below.

Figure 3. Committing in parallel

You can see an example of this in Figure 2, where replication will start from the last committed position, but there is a transaction that were just prepared and hence was rolled back when the backup was restored on the slave.

To solve this, InnoDB added a mutex called the prepare_commit_mutex that was taken when preparing a transaction and released when committing the transaction. This is a simple solution to the problem, but causes some problems that we will get to in a minute. Basically, the prepare_commit_mutex solve the problem by forcing the call sequence to be as in Figure 4.

Figure 4. Sequencing transactions

Steady as she goes... NOT!

Since disk writes are slow, writing every transaction to disk will affect performance quite a lot... well, actually very much...

To try to handle that, there is a server option sync_binlog that can be set to how often the binary log should be written to disk. If it is set to 0, the operating system will decide on when the file pages should be written to disk, if it is set to 1, then fsync() will be called after every transaction being written to the binary log. In general, if you set sync_binlog to N, you can at most lose N-1 transactions, so in practice there are just two useful settings: sync_binlog=0 means that you accept that some transactions can be lost and handle it some other way, and sync_binlog=1 means that you do not accept to lose any transactions at all. You could of course set it to some other value to get something in between, but in reality you can either handle transaction loss or not.

To improve performance, the common case is to bundle many writes with each sync: this is what the operating system does, and that is what we should be able to do. However, if you look at Figure 4 you see that there is no way to place a fsync() call in that sequence so that several transactions are written to disk at the same time. Why? Because at any point in that sequence there is at most one prepared and written transaction. However, if you go back to Figure 3, you can see that it would be possible to place an fsync() as shown and write several transactions to disk at the same time. If it was possible, then all transactions written to the binary log before the fsync() call would be written to disk at once. But this means that it is necessary to order the commits in the same order as the writes without using the prepare_commit_mutex.

So, how does all this work then...

The binary log group commit implementation used split the commit procedure into several stages as you can see in Figure 5. The stages are entirely internal to the binary log commit procedure and does not affect anything else. In theory, it would be possible to have another replication implementation with another policy for ordering commits. Since the commit procedure is separated into stages, there can be several threads processing transactions at the same time, which also improves throughput.

Figure 5. Commit Procedure Stages

For each stage, there is an input queue where sessions queue up for processing. If a thread registers in an empty queue, it is considered the stage leader otherwise, the session is a follower. Stage leaders will bring all the threads in the queue through the stage and then register the leader and all followers for the next stage. Followers will move off to the side and wait for a leader to signal that the entire commit is done. Since it is possible that a leader registers to a non-empty queue, a leader can decide to become a follower and go off waiting as well, but a follower can never become a leader.

When a leader enters a stage, it will grab the entire queue in one go and process it in order according to the stage. After the queue is grabbed, other sessions can register for the stage while the leader processes the old queue.

In the flush stage, all the threads that registered will have their caches written to the binary log. Since all the transactions are written to an internal I/O cache, the last part of the stage is writing the memory cache to disk (which means it is written to the file pages, also in memory).

In the sync stage, the binary log is synced to disk according to the settings of sync_binlog. If sync_binlog=1 all sessions that were flushed in the flush stage will be synced to disk each time.

In the commit stage, the sessions will that registered for the stage will be committed in the engine in the order they registered, all work is here done by the stage leader. Since order is preserved in each stage of the commit procedure, the writes and the commits will be made in the same order.

After the commit stage has finished executing, all threads that were in the queue for the commit stage will be marked as done and will be signaled that they can continue. Each session will then return from the commit procedure and continue executing.

Thanks to the fact that the leader registers for the next queue and is ready to become a follower, the stage that is slowest will accumulate the most work. This is typically the sync stage, at least for normal hard disks. However, it is critical to fill the flush stage with as many transactions as possible, so the flush stage is treated a little special.

In the flush stage, the leader will skim the the sessions one by one from the flush queue (the input queue for the flush stage). As long as the last session was not remove the from the queue, or the first session was unqueued more than binlog_max_flush_queue_time microseconds ago, this process will continue. There are two different conditions that can stop the process:

If the queue is empty, the leader immediately advanced to the next stage and registers all sessions processed to the sync stage queue.
If the timeout was reached, the entire queue is grabbed and the sessions transaction caches are flushed (as before). The leader then advance to the sync stage.

Figure 6. Comparing 5.6.5 and 5.6 June labs release

Performance, performance, performance...

I'm sure you all wonder what the improvements are, so without further delay, let's have a look at the results of some benchmarks we have done on the labs tree. There has been several improvements that is not related to the binary log, so I will just focus on the results involving the binary log. In Figure 6 you see a benchmark comparing the 5.6.5 release with the 5.6 June labs release using the binary log. These benchmarks were executed on an 2.00 GHz 48-core Intel® Xeon® 7540 with 512 GiB memory and using SSD disks.

As you can see, the throughput has increased tremendously, with increases ranging from a little less than 2 and approaching 4 times that of 5.6.5. To a large extent, the improvements are in the server itself, but what is interesting is that with binary log group commit, the server is able to keep the pace. Even with sync_binlog=0 on 5.6.5 and sync_binlog=1 on the 5.6 labs release, the 5.6 labs release outperforms 5.6.5 by a big margin.

Another interesting aspect is that even with sync_binlog=1 the server performs nearly as well when using sync_binlog=0. On higher number of connections (roughly more than 50), the difference in throughput is varying between 0% [sic] and with a more typical throughput between 5% and 10%. However, there is a drop of roughly 20% in the lower range. This looks very strange, especially in the light that the performance is almost equal in the higher range, so what is causing that drop and is there anything that can be done about it?

Figure 7. Benchmark of Binary Log Group Commit

The answer comes from some internal benchmarks done while developing the feature. For these tests we were using Sysbench on a 64-core Intel® Xeon® X7560 running at 2.27GHz with 126 GB memory and a HDD.

In the benchmarks that you can see in Figure 7 the enhanced version of 5.6 with and without the binary log group commit is compared. The enhanced version of 5.6 include some optimizations to improve performance that are not available in the latest 5.6 DMR, but most are available in the labs tree. However, these are benchmarks done while developing, so it is not really possible to compare them with Figure 6 above, but it will help understand why we have a 20% drop at the lower number of connections.

The bottom line in Figure 7 is the enhanced 5.6 branch without binary log group commit and using sync_binlog=1, which does not scale very well. (This is nothing new, and is why implementing binary log group commit is a good idea.) Note that even at sync_binlog=0 the staged commit architecture scale better than the old implementation. If you look at the other lines in the figure, you can see that even when the enhanced 5.6 server is running with sync_binlog=0, the binary log group commit implementation outperforms the enhanced 5.6 branch at roughly 105 simultaneous threads with sync_binlog=1.

Also note that the difference between sync_binlog=1 and sync_binlog=0 is diminishing as the number of simultaneous connections is increased, to vanish completely at roughly 160 simultaneous connections. We haven't made a deep analysis of this, but while using the performance schema to analyze the performance of each individual stage, we noted that the sync time was completely dominating the performance (no surprise there, just giving the background), and that all available transactions "piled up" in the sync stage queue. Since each connection can at most have one ongoing transaction, it means that at 32 connections, there can never be more than 32 transactions in the queue. As a matter of fact, one can expect that over a long run, roughly half of the connections are in the queue and half of the connections are inside the sync stage (this was also confirmed in the measurements mentioned above), so at lower number of connections it is just not possible to fill the queue enough to utilize the system efficiently.

The conclusion is that reducing the sync time would probably make the difference between sync_binlog=0 and sync_binlog=1 smaller even on low number of connections. We didn't do any benchmarks using disks with battery-backed caches (which should reduce the sync time significantly, if not entirely eliminates it), but it would be really interesting to see the effect of that on performance.

Summary and closing remarks

The binary logging code has been simplified and optimized, leading to improved performance even when using sync_binlog=0.
The prepare_commit_mutex is removed from the code and instead the server orders transactions correctly.
Transactions can be written and committed as groups without losing any transactions, giving around 3 times improvement in performance on both sync_binlog=1 and sync_binlog=0.
The difference between sync_binlog=0 and sync_binlog=1 is small and reduces as the load increases on the system.
Existing storage engines benefit from binary log group commit since there are no changes to the handlerton interface.

Binary log group commit is one of a range of important new enhancements to replication in MySQL 5.6, including global transaction IDs (GTIDs), multi-threaded slave, crash safe slave and binary log, replication event checksums, and some more. You can learn more about all of these from our DevZone article:

dev.mysql.com/tech-resources/articles/mysql-5.6-replication.html

You can also try out binary log group commit today by downloading the latest MySQL 5.6 build that is available on labs.mysql.com

Pythonic Database API: Now with Launchpad

2012-02-20T21:24:00.001+01:00

In a previous post, I demonstrated a simple Python database API with a syntax similar to jQuery. The goal was to provide a simple API that would allow Python programmers to use a database without having to resort to SQL, nor having to use any of the good, but quite heavy, ORM implementations that exist. The code was just an experimental implementation, and I was considering putting it up on Launchpad.
I did some basic cleaning of the code, turned it into a Python package, and pushed it to Launchpad. I also added some minor changes, such as introducing a define function to define new tables instead of automatically creating one when an insert was executed. Automatically constructing a table from values seems neat, but in reality it is quite difficult to ensure that it has the right types for the application. Here is a small code example demonstrating how to use the define function together with some other operations.

import mysql.api.simple as api

server = api.Server(host="example.com")

server.test_api.tbl.define(
    { 'name': 'more', 'type': int },
    { 'name': 'magic', 'type': str },
)

items = [
    {'more': 3, 'magic': 'just a test'},
    {'more': 3, 'magic': 'just another test'},
    {'more': 4, 'magic': 'quadrant'},
    {'more': 5, 'magic': 'even more magic'},
]

for item in items:
    server.test_api.tbl.insert(item)

The table is defined by providing a dictionary for each row that you want in the table. The two most important fields in the dictionary is name and type. The name field is used to supply a name for the field, and the type field is used to provide a type of the column. The type is denoted using a basic Python type constructor, which then maps internally to a SQL type. So, for example, int map to the SQL INT type, and bool map to the SQL type BIT(1). This choice of deciding to use Python types are simply because it is more natural for a Python programmer to define the tables from the data that the programmer want to store in the database. I this case, I would be less concerned with how the types are mapped, just assuming that it is mapped in a way that works. It is currently not possible to register your own mappings, but that is easy to add.So, why provide the type object and not just a string with the type name? The idea I had here is that since Python has introspection (it is a dynamic language after all), it would be possible to add code that read the provided type objects and do things with them, such as figuring out what fields there are in the type. It's not that I plan to implement this, but even though this is intended to be a simple database interface, there is no reason to tie ones hands from start, so this simple approach will provide some flexibility if needed in the future.

MySQL: Python, Meta-Programming, and Interceptors

2012-01-23T21:55:00.002+01:00

I recently found Todd's posts on interceptors which allow callbacks (called interceptors) to be registered with the connector so that you can intercept a statement execution, commit, or any of the many extension points supported by Connector/Java. This is a language feature that allow you to implement a number of new features without having to change the application code such as load-balancing policies, profiling queries or transactions, or debugging an application.

Since Python is a dynamic language, it is easy to add interceptors to any method in Connector/Python, without having to extend the connector with specific code. This is something that is possible in dynamic languages such as Python, Perl, JavaScript, and even some lesser known languages such as Lua and Self. In this post, I will describe how and also give an introduction to some of the (in my view) more powerful features of Python.

In order to create an interceptor, you need to be able to do these things:

Catch an existing method in a class and replace it with a new one.
Call the original function, if necessary.
For extra points: catch an existing method in an object and replace a new one.

You will in this post see how all three of these problems are solved in Python. You will see and use decorators to be able to define methods in existing classes and object, and closures to be able to call the original version of the methods. By picking this approach, it will not be necessary to change the implementation: in fact, you can use this code to replace any method in any class, not only in Connector/Python.

Table 1. Attributes for methods
	Method Instance
Name	Unbound	Bound
`__name__`	Name of Method
`im_func`	"Inner" function of the method
`im_self`	`None`	Class instance for the method
`im_class`	Class that the method belongs to

In addition to being able to replace methods in the class, we would also like to be able to replace methods in instances of a class ("objects" in the traditional sense). This is useful to create specialized objects, for example for tracking particular cases where a method is used.

In order to understand how the replacement works, you should understand that in Python (and the dynamic languages mentioned above), all objects can have attributes, including classes, functions, and a bunch of other esoteric constructions. Each type of object has a set of pre-defined attributes with well-defined meaning. For classes (and class instances), methods are stored as attributes of the class (or class instance) and can therefore be replaced with other methods that you build dynamically. However, it requires some tinkering to take an existing "normal" function definition and "imbue" it with whatever "tincture" that makes it behave as a method of the class or class instance.

Depending on where the method comes from, it can be either unbound and bound. Unbound methods are roughly equivalent to member function pointers in C++: they reference a function, but not the instance. In contrast, bound methods have an instance tied to it, so when you call them, they already know what instance they belong to and will use it. Methods have a set of attributes, of which the four in Table 1 interests us. If a method is fetched from a class (to be precise, from a class object), it will be unbound and im_self will be None. If the method is fetched from a class instance, it will be bound and im_self will be set to the instance it belongs to. These attributes are all the "tincture" you need make our own instance methods. The code for doing the replacement described above is simply:

import functools, types

def replace_method(orig, func):
    functools.update_wrapper(func, orig.im_func)
    new = types.MethodType(func, orig.im_self, orig.im_class)
    obj = orig.im_self or orig.im_class
    setattr(obj, orig.__name__, new)

The function uses two standard modules to make the job simpler, but the steps are:

Copy the meta-information from the original method function to the new function using update_wrapper. This copies the name, module information, and documentation from the original method function to make it look like the original method.
Create a new method instance from the method information of the original method using the constructor MethodType, but replace the "inner" function with the new function.
Install the new instance method in the class or instance by replacing the attribute denoting the original method with the new method. Depending on whether the function is given a bound or unbound instance, either the method in the class or in the instance is replaced.

Using this function you can now replace a method in a class like this:

from mysql.connector import MySQLCursor

def my_execute(self, operation, params=None):
  ...

replace_method(MySQLCursor.execute, my_execute)

This is already pretty useful, but note that you can also replace only a specific instance as well by using replace_method(cursor.execute, my_execute). It was not necessary to change anything inside Connector/Python to intercept a method there, so you can actually apply this to any method in any of the classes in Connector/Python that you already have available. In order to make it even easier to use you'll see how to define a decorator that will install the function in the correct place at the same time as it is defined. The code for defining a decorator and an example usage is:

import functools, types
from mysql.connector import MySQLCursor

def intercept(orig):
    def wrap(func):
        functools.update_wrapper(func, orig.im_func)
        meth = types.MethodType(func, orig.im_self, orig.im_class)
        obj = orig.im_self or orig.im_class
        setattr(obj, orig.__name__, meth)
        return func
    return wrap

# Define a function using the decorator
@intercept(MySQLCursor.execute)
def my_execute(self, operation, params=None):
  ...

The @intercept line before the definition of my_execute is where the new descriptor is used. The syntax is a shorthand that can be used to do some things with the function when defining it. It behaves as if the following code had been executed:

def _temporary(self, operation, params=None):
  ...
my_execute = intercept(MySQLCursor.execute)(_temporary)

As you can see here, whatever is given after the @ is used as a function and called with the function-being-defined as argument. This explains why the wrap function is returned from the decorator (it will be called with a reference to the function that is being defined), and also why the original function is returned from the wrap function (the result will be assigned to the function name).

Using a statement interceptor, you can catch the execution of statements and do some special magic on them. In our case, let's define an interceptor to catch the execution of a statement and log the result using the standard logging module. If you read the wrap function carefully, you probably noted that it uses a closure to access the value of orig when the decorator was called, not the value it happen to have when the wrap function is executed. This feature is very useful since a closure can also be used to get access to the original execute function and call it from within the new function. So, to intercept an execute call and log information about the statement using the logging module, you could use code like this:

from mysql.connector import MySQLCursor
original_execute = MySQLCursor.execute
@intercept(MySQLCursor.execute)
def my_execute(self, operation, params=None):
    if params is not None:
        stmt = operation % self._process_params(params)
    else:
        stmt = operation
    result = original_execute(self, operation, params)
    logging.debug("Executed '%s', rowcount: %d", stmt, self.rowcount)
    logging.debug("Columns: %s", ', '. join(c[0] for c in self.description))
    return result

Now with this, you could implement your own caching layer to, for example, do a memcached lookup before sending the statement to the server for execution. I leave this as an exercises to the reader, or maybe I'll show you in a later post. &smiley; Implementing a lifecycle interceptor is similar, only that you replace, for example, the commit or rollback calls. However, implementing an exception interceptor is not obvious. Catching the exception is straightforward and can be done using the intercept decorator:

original_init = ProgrammingError.__init__
@intercept(ProgrammingError.__init__)
def catch_error(self, msg, errno):
    logging.debug("This statement didn't work: '%s', errno: %d", msg, errno)
    original_init(self, msg, errno=errno)

However, in order to do something more interesting, such as asking for some additional information from the database, it is necessary to either get hold of the cursor that was used to execute the query, or at least the connection. It is possible to dig through the interpreter stack, or try to override one of the internal methods that Connector/Python uses, but since that is very dependent on the implementation, I will not present that in this post. It would be good if the cursor is passed down to the exception constructor, but this requires some changes to the connector code.

Even though I have been programming in dynamic languages for decades (literally) it always amaze me how easy it is to accomplish things in these languages. If you are interested in playing around with this code, you can always fetch Connector/Python on Launchpad and try out the examples above. Some links and other assorted references related to this post are:

Connector/Python is found at launchpad.net/myconnpy
Geert has a number of excellent posts on Connector/Python under geert.vanderkelen.org. Also, as you might already know, he is now working with developing Connector/Python and he's always interested in comments and suggestions. :)
Todd's Blog mysqlblog.fivefarmers.com is always interesting to read, and these articles on interceptors are the ones I read

Python Interface to MySQL

2011-09-26T10:27:00.001+02:00

There has been a lot of discussions lately about various non-SQL languages that provide access to databases without having to resort to using SQL. I wondered how difficult it would be to implement such an interface, so as an experiment, I implemented a simple interface in Python that similar to the document-oriented interfaces available elsewhere. The interface generate SQL queries to query the database, but does not require any knowlegdge of SQL to use. The syntax is inspired by JQuery, but since JQuery works with documents, the semantics is slightly different.

A simple example would look like this:

from native_db import *
server = Server(host='127.0.0.1')
server.test.t1.insert({'more': 3, 'magic': 'just a test', 'count': 0})
server.test.t1.insert({'more': 3, 'magic': 'just another test', 'count': 0})
server.test.t1.insert({'more': 4, 'magic': 'quadrant', 'count': 0})
server.test.t1.insert({'more': 5, 'magic': 'even more magic', 'count': 0})
for row in server.test.t1.find({'more': 3}):
  print "The magic is:", row['magic']
server.test.t1.update({'more': 3}, {'count': 'count+1'})
for row in server.test.t1.find({'more': 3}, ['magic', 'count']):
  print "The magic is:", row['magic'], "and the count is", row['count']
server.test.t1.delete({'more': 5})

The first line define a server to communicate with, which is simply done by creating a Server object with the necessary parameters. The constructor accepts the normal parameters for Connector/Python (which is what I'm using internally), but the user defaults to whatever getpass.getuser() returns, and the host default to 127.0.0.1, even though I've provided it here.

After that, the necessary methods are overridden so that server.database.table will refer to the table with name table in database with name database on the given server. One possibility would be to just skip the database and go directly on the table (using some default database name), but since this is just an experiment, I did this instead. After that, there are various methods defined to support searching, inserting, deleting, and updating.

Since this is intended to be a simple interface, autocommit is on. Each of the functions generate a single SQL statement, so they will be executed atomically if you're using InnoDB.

table.insert(row): This function will insert the contents of the dictionary into the table. using the keys of the dictionary as column names. If the table does not exist, it will be created with a "best effort" guess of what types to use for the columns.
table.delete(condition): This function will remove all rows in the table that matches the supplied dictionary. Currently, only equality mapping is supported, but see below for how it could be extended.
table.find(condition, fields="*"): This will search the table and return an iterable to the rows that match condition. If fields is supplied (as a list of field names), only those fields are returned.
table.update(condition, update): This will search for rows matching condition and update each matching row according to the update dictionary. The values of the dictionary is used on the right side of the assignments of the UPDATE statement, so expressions can be given here as strings.

That's all folks!

The code is available at http://mats.kindahl.net/python/native_db.py if you're interested in trying it out. The code is very basic, and there's potential for a lot of extensions. If there's interest, I could probably create a repository somewhere.

Note that this is not a replacement for an ORM library. The intention is not to allow storing arbitrary objects in the database: the intention is to be able to query the database using a Python interface without resorting to using SQL.

I'm just playing around and testing some things out, and I'm not really sure if there is any interest in anything like this, so what do you think? Personally, I have no problems with using SQL, but since I'm working with MySQL on a daily basis, I'm strongly biased on the subject. For simple jobs, this is probably easier to work with than a "real" SQL interface, but it cannot handle as complex queries as SQL can (at least not without extensions).

There is a number of open issues for the implementation (this is just a small list of obvious ones):

Only equality searching supported

Searching can only be done with equality matches, but it is trivial to extend to support more complex comparisons. To allow more complex conditions, the condition supplied to find, delete, and update can actually be a string, in which case it is used "raw".

Conditions could be extended to support something like {'more': '>3'}, or a more object-oriented approach would be to support something similar to {'more': operator.gt(3)}.

No support for indexes

There's no support for indexes yet, but that can easily be added. The complication is what kind of indexes should be generated.

For example, right now rows are identified by their content, but if we want unique rows to be handled as a set? Imagine the following (not supported) query where we insert :

server.test.t1.insert(content with some more=3).find({'more': eq(3)})

In this case, we have to fetch the row identifiers for the inserted rows to be able to manipulate exactly those rows and none other. Not sure how to do this right now, but auto-inventing a row-identifier would mean that tables lacking it cannot be handled naturally.

Creating and dropping tables

The support for creation of tables is to create tables automatically if they do not exist. A simple heuristic is used to figure out the table definition, but this has obvious flaws if later inserts have more fields than the first one.

To support extending the table, one would have to generate an ALTER TABLE statement to "fix" the table.

There is no support for dropping tables... or databases.

Binlog Group Commit Experiments

2011-07-27T15:12:00.002+02:00

Binlog Group Commit Experiments

It was a while ago since I talked about binary log group commit. I had to spend time on a few other things.

Since then, Kristian has released a version of binary log group commit that seems to work well. However, for a few reasons that will be outlined below, we decided to do experiments ourselves using the approach that I have described earlier. A very early version of what we will start doing benchmarks on are available at the MySQL labs. We have not done any any benchmarking on this approach before OSCON, so we we'll have to get back on that.

All of this started with Facebook pointing out a problem in how the group commit interacts with the binary log and proposed a way to handle the binary log group commit by demonstrating a patch to solve the problem.

What's in the patch

The patch involves implementing logic for handling binary log group commit and parallel writing of the binary log, including a minor change to the handler protocol by adding a persist callback. The extension of the handler interface is strictly speaking not necessary for the implementation, but it is natural to extend the interface in this manner and I belive that it can be used by storage engines to execute more efficiently). In addition to the new logic, three new options were added and one option was created as an alias of an old option.

binlog-sync-period=N: This is just a rename of the old sync-period option, which tell that fsync should be called for the binary log every N events. For many of the old options, it is not clear what they are configuring, so we are adding the binlog- prefix to options that affect the binary log. The old option is kept as an alias for this option.
binlog-sync-interval=msec: No transaction commit will wait for more than msec milliseconds before calling fsync on the binary log. If set to zero, it is disabled. You can set both this option and the binlog-sync-period option.
binlog-trx-committed={COMPLETE,DURABLE}: A transaction is considered committed when it is either in durable store or when it is completed. If set to DURABLE either binlog-sync-interval or binlog-sync-period has to be non-zero. If they are both zero, transactions will not be flushed to disk and hence they will never be considered durable.
master-trx-read=={COMPLETE,DURABLE}: A transaction is read from the binary log when it is completed or when it is durable. If set to DURABLE either binlog-sync-interval or binlog-sync-period has to be non-zero or an error will be generated. If it was possible for both zero, no transactions will ever be read from the binary log and hence never sent out.

The patch also contain code to eliminate the prepare_commit_mutex as well as moving release of row locks inside InnoDB (not completely applied yet, I will get it there as soon as possible) to the prepare phase. The focus on these changes is that we should maintain consistency, so we have not done any aggressive changes like moving the release of the write locks to the prepare phase: that could possibly lead to inconsistencies.

Figure 1. Binary log with transaction in different stages

The main changes are about how a transaction is committed. The details are explained in the previous articles, but for understanding the rest of this blog post, I'll briefly recapitulate how a transaction is committed in this solution. Each transaction pass through three states: prepared, completed (committed to memory), and durable (committed to disk), as seen in Figure 1. The transaction is pushed through these states using the following procedure:

The transaction is first prepared, which is now split into two steps:
1. In the reserve step, a slot is assigned for the transaction in the binary log and the storage engine is asked check if this transaction can be committed. At this point, the storage engine can abort the transaction if it is unable to fulfill the commit, but if it approves of the commit, the only thing that can abort the transaction after this point is a server crash. This check is currently done using the prepare call. This step is executed with a lock, but is intended to be short.
2. In the persist step, the persist function is called, which asks the storage engine to persist any data that it need to persist to guarantee that the transaction is fully prepared. After this step is complete, the transaction is fully prepared in the storage engine and in the event of a crash, it will be able to commit the transaction on recovery, if asked to do so. This step is executed without a lock and a storage engine that intend to handle group commit should defer any expensive operations to this step.
To record the decision, the transaction is written to the reserved slot in the binary log. Since the write is done to a dedicated place in the binary log reserved to this transaction, it is not necessary to hold any locks, which means that several threads can write the transaction to the binary log at the same time.
The commit phase is in turn split into two steps:
1. In the completion step, the thread waits for all preceeding transactions to be fully written to the binary log, after which the transaction is completed, which means that it is logically committed but not necessarily in durable storage.
2. In the durability, step, the thread waits for the transaction (and all preceeding transactions) to be written to disk. If this does not occur within the given time period, it will itself call fsync for the binary log. This will make all completed transactions durable.

After this procedure is complete, the transaction is fully committed and the thread can proceed with executing the next statement.

The different approaches

So, providing this patch begs the questions: why a third version of binary log group commit? There are three approaches: Facebook's patch (#1), Kristian's patch (#2), and my patch (#3). Before going over the rationale leading to a third version, it is necessary to understand how the Facebook patch and Krisian's patch work on a very high level. If you look at Figure 1, you see a principal diagram showing how the patches work. Both of them maintain a queue of threads with transactions to be written and will ensure that they are written in the correct order to the binary log.

The Facebook patch ensures that the transactions are written in the correct order by signalling each thread waiting in the queue in the correct order, after which the thread will take a lock on the binary log, append the transaction, and release the lock. When the decision to commit the outstanding transactions are made, fsync() is called. It has turned out that this lock-write-unlock loop can just be executed at a certain speed, which means that as the number threads waiting to write transactions increase, the system choke and is not able to keep up.

Kristian solves this by designating the first thread in the queue as the leader, and have it write the transactions for all threads in the queue instead of just having each thread do it individually and then broadcast to the other threads, who just return from the commit. This improves performance significantly as can be seen from the figures in the measurements that Mark did. Note, however, that a lock of the binary log is still kept while writing the transactions.

The approach we are experimenting with goes about this in another way and instead of queueing the data to be written, a place is immediately allocated in the binary log after which the thread proceed to write the data. This means that several threads can at the same time write in parallel to the binary log without needing to keep any locks. There is a need for a lock when allocating space in the binary log, but that is very short. Since the threads can finish writing in different order, it is necessary to keep logic around for deciding when a transaction is committed and when it's not. For details, you can look at the worklog (which is not entirely up to date, but I'll fix that). In this sense, the binary log itself is the queue (there is a queue in the implementation, but this is just for bookkeeping). The important differences leading us to a want to have a look at this third version are:

Approaches #1 and #2 keep a lock while writing the binary log while #3 doesn't.
Approaches #1 and #2 keep the transactions on the side (in the queue) and write them to the binary log when they are being committed. Approach #3 writes the transactions directly to the binary log, possibly before they are committed.

Figure 1. Sources of performance problems

Efficiently using Multiple Cores

Efficiently using a multi-threaded systems, especially one with multiple cores, is very hard. It requires knowledge of hardware issues, operating systems considerations, algorithms, and some luck. I will not cover all the issues revolving around designing a system for multi-core use, but I will focus on three of the parts that we are considering in this case. We split the sources of performance degradations when committing a transaction into three separate parts: CPU and memory issues, software lock contention, and I/O.

The CPU and memory issues has to do with how caches are handled on the CPU level, which can affect performance quite a lot. There some things that can be done, such as avoiding false sharing, handling data alignment, and checking the cache access patterns, but in general, it is hard to add as an afterthought and require quite a lot of work to get right. We are not considering this and view it as static.
The I/O can be reduced using either SSDs or use RAID solutions (which does not reduce latency, but improves the throughput and therefore reduce the I/O needed for each transaction). Also, reducing the number of accesses to disk using group commits will improve the situation significantly, which is what we're doing here.
To reduce the software lock contention there is only one solution: reduce the time each lock is kept. This can be as simple as moving the lock aquire and release, using atomic primitives instead of locks, but can also require re-designing algorithms to be able to run without locks.

So, assuming that we reduce the I/O portion of committing a transaction—and only I/O portion—as you can see in Figure 1, the software lock time start to become the problem and we need to start to work on reducing that. To do this, there are not many options except the approach described above. And if we take this approach to reduce lock contention, there's just a few additions to get the group commit as well.

Given this, it is rational to explore if this solution can solve the group commit problem as good as the other solutions and improve the scalability of the server at the same time.

Scaling out

One of the most central uses for replication is to achieve high-availability by duplicating masters and replicate between them to keep both up to date. For this reason, it is important to get the changes over to the other master as fast as possible. In this case, whether the data is durable on the original master or not is of a smaller concern since once the transaction has left the node, a crash will not cause the transaction to disappear since it has already been distributed. This means that for implementing multi-masters, we want replication to send transactions as soon as possible—and maybe even before that—since we can achive high-availablility by propagating the information as widely as possible.

On the other hand, transactions sent from the master to the slave might need to be durable on the master since otherwise the slave might be moving into an alternative future—a future where this transaction was committed—if the transactions sent to the slave are lost because of a crash. In this case, it is necessary for the master to not send out the transaction before it is in durable store. Having a master that is able to send out both completed transactions and durable transactions at the same time, all based on the requirements of the slave that connects, is a great feature and allow the implementation of both an efficient multi-master solution as well as slaves that does not diverge from the master even in the event of crashes. Currently, a master cannot both deliver transactions that are completed and transactions that are durable at the same time. With the patch presented in this article, it is possible to implement this, but in alternative #1 and #2 described above, all the transactions are kept "on the side" and not written to the binary log until they are being committed. This means that it is harder to support this scenario with the two other alternatives.

Concluding remarks

To sum up the discussion above: we are interested in exploring this approach since we think that it provides shorter lock time, hence scales better to multi-core machines, and in addition provide better scale-out capabilities, since it will be possible that the slaves can decide if they want to receive durable or completed transactions. Thanks to all in the community for the great work and discussions on binlog group commit. The next steps will be to benchmark this solution to see how it flies and it would be great to also get some feedback on this approach. As always, we are interested in getting a good and efficent solution that also can be maintained end evolved easily.

Round-Robin Multi-Source in Pure SQL

2011-04-13T16:23:00.004+02:00

With the addition of the new tables to implement crash-safe replication we also get access to replication information through the SQL interface. This might not seem like a big advantage, but it should not be taken lightly. To demonstrate the power of using this approach, I will show how to implement a multi-source round-robin replication described at other places (including our book). However, compared to the other implementations—where the implementation requires a client to parse the output of SHOW SLAVE STATUS—the twist is that the implementation is entirely done in the server, using pure SQL.

If you're familiar with replication, you know that a slave can just replication from a single master. The trick used to replicate from multiple master—this is usually called multi-source—is to switch between masters in a time-share fashion as illustrated in Figure 1. The schema used to pick the master to replicate can vary, but it is common to use a round robin schedule.

The steps necessary to switch master are:

Stop reading events from the master and empty the relay log. To stop reading events from the master, it is necessary to ensure that there are no outstanding events in the relay log before switching to another master. If this is not done, some will not be applied and will have to be re-fetched from the master.
1. Stop the I/O thread.
2. Wait for the events in the relay log to be applied.
3. Stop the SQL thread.
Save away the replication information.
Fetch the saved information about the next master to replicate from.
Change master using the new information.
Start the slave threads.

Simple, right? So, let's make an implementation! So, what pieces do we need?

To handle the periodic switching, we use an SQL event for executing the above procedure.
We need a table to store the state of each master. The table should contain all the necessary information for configuring the master, including the binlog position.
We need to be able to store what master we're currently replicating from.

Saving state information

**Figure 1.** Tables for storing information about masters
CREATE TABLE my_masters ( idx INT AUTO_INCREMENT PRIMARY KEY, host VARCHAR(50), port INT DEFAULT 3306, user VARCHAR(50), passwd VARCHAR(50), log_file VARCHAR(50), log_pos LONG, UNIQUE INDEX (host,port,user) ) ENGINE=InnoDB; CREATE TABLE current_master ( idx INT ) ENGINE=InnoDB;

Figure 1. Tables for storing information about masters

CREATE TABLE my_masters (
    idx INT AUTO_INCREMENT PRIMARY KEY,
    host VARCHAR(50), port INT DEFAULT 3306,
    user VARCHAR(50), passwd VARCHAR(50),
    log_file VARCHAR(50), log_pos LONG,
    UNIQUE INDEX (host,port,user)
) ENGINE=InnoDB;

CREATE TABLE current_master (
    idx INT
) ENGINE=InnoDB;

We need two tables: a table my_masters to record information about the available masters and a table current_master that keeps information about the current master. The my_masters table will contain information on how to connect to the masters as well as the last seen position. We assume that the user and password information is stored in the table and won't save away that information when switching master. To store the current master being replicated from, We cannot use a user defined variable—because each invocation of an event spawns a new session—so we store this information in a table.

Switching masters

To be able to execute a CHANGE MASTER statement with the information we need, it would be perfect to use a prepared statement, but unfortunately, the CHANGE MASTER statement is one of those statements that cannot be used inside a prepared statement, so we have to build the statement dynamically. To make it easier, we create a change_master procedure that does the job of building, preparing, executing, and deallocating a prepared statement. We also allow the file name and position passed to be NULL, in which case we start replication without these parameters, essentially starting from the beginning of the masters binary log.

delimiter $$
CREATE PROCEDURE change_master(
    host VARCHAR(50), port INT,
    user VARCHAR(50), passwd VARCHAR(50),
    name VARCHAR(50), pos LONG)
BEGIN
  SET @cmd = CONCAT('CHANGE MASTER TO ',
                    CONCAT_WS(', ',
                    CONCAT('MASTER_HOST = "', host, '"'),
                    CONCAT('MASTER_PORT = ', port),
                    CONCAT('MASTER_USER = "', user, '"'),
                    CONCAT('MASTER_PASSWORD = "', passwd, '"')));

  IF name IS NOT NULL AND pos IS NOT NULL THEN
    SET @cmd = CONCAT(@cmd,
                      CONCAT_WS(', ', '',
                                CONCAT('MASTER_LOG_FILE = "', name, '"'),
                                CONCAT('MASTER_LOG_POS = ', pos)));
  END IF;
  PREPARE change_master FROM @cmd;
  EXECUTE change_master;
  DEALLOCATE PREPARE change_master;
END $$
delimiter ;

The last step is to create the event that switch master for us. As a specific feature, we implement the event handling so that we can add and remove rows from the my_masters table and the event will just pick the next one in order. To solve this, we use queries to pick the next one in order based on the index of the last used master and then an additional query to handle the case of a wrap-around with a missing table at index 1.

To allow the table to be changed while the events are executing, we place all the updates of our tables into a transaction. That way, any updates done to the table while the event is executing will not affect the logic for picking the next table.

There are some extra logic added to handle the case that there are "holes" in the index numbers: it is possible that there is no master with index 1 and it is possible that the next master does not have the next index in sequence. This also allow the server ID of the master to be used, but in the current implementation, we use a simple index instead.

delimiter $$ CREATE EVENT multi_source ON SCHEDULE EVERY 10 SECOND DO BEGIN DECLARE l_host VARCHAR(50); DECLARE l_port INT UNSIGNED; DECLARE l_user TEXT; DECLARE l_pass TEXT; DECLARE l_file VARCHAR(50); DECLARE l_pos BIGINT; DECLARE l_idx INT DEFAULT 1;
SET SQL_LOG_BIN = 0;	Don't write any of this to the binary log. Since this is an event, it will automatically be reset at the end of the execution and not affect anything else.
STOP SLAVE IO_THREAD; SELECT master_log_name, master_log_pos INTO l_file, l_pos FROM mysql.slave_master_info; SELECT MASTER_POS_WAIT(l_file, l_pos); STOP SLAVE;	Stop the slave I/O thread and empty the relay log before switching master
START TRANSACTION;
UPDATE my_masters AS m, mysql.slave_relay_log_info AS rli SET m.log_pos = rli.master_log_pos, m.log_file = rli.master_log_name WHERE idx = (SELECT idx FROM current_master);	Save the position of the current master
SELECT idx INTO l_next_idx FROM my_masters WHERE idx > (SELECT idx FROM current_master) ORDER BY idx LIMIT 1;	Find the next master in turn. To handle that masters have been removed, we will pick the next one index-wise. Wrap-around is handled by using the default of 1 above.
SELECT idx INTO l_next_idx FROM my_masters WHERE idx >= l_next_idx ORDER BY idx LIMIT 1;	If we did a wrap-around, it might be the case that master with index 1 does not exist (the default for l_next_idx), so then we have to scan and find the first index that exists which is equal to or greater than l_next_idx.
UPDATE current_master SET idx = l_next_idx; SELECT host, port, user, passwd, log_pos, log_file INTO l_host, l_port, l_user, l_pass, l_pos, l_file FROM my_masters WHERE idx = l_next_idx; CALL change_master(l_host, l_port, l_user, l_pass, l_file, l_pos); COMMIT; START SLAVE; END $$ delimiter ;	Extract the information about the new master from our masters table `my_masters` and change to use that master.

That's all! No you go off and play with it and send me comments.

You can download the MySQL 5.6 Milestone Development Release MySQL Developer Zone (dev.mysql.com), which contain the new replication tables and you can find information in the previous post on how to set up the server to use the new tables.

Crash-safe Replication

2011-04-12T17:09:00.004+02:00

A common request is to have replication crash-safe in the sense that the replication progress information always is in sync with what has actually been applied to the database, even in the event of a crash. Although transactions are not lost if the server crashes, it could require some tweaking to bring the slaves up again.

In the latest MySQL 5.6 milestone development release, the replication team has implemented crash-safety for the slave by adding the ability of committing the replication information together with the transaction (see Figure 1). This means that replication information will always be consistent with has been applied to the database, even in the event of a server crash. Also, some fixes were done on the master to ensure that it recovers correctly.

If you're familiar with replication, you know that the replication information is stored in two files: master.info and relay-log.info. The update of these files are arranged so that they are updated after the transaction had been applied. This means that if you have a crash between the transaction commit and the update of the files, the replication progress information would be wrong. In other words, a transaction cannot be lost this way, but there is a risk that a transaction could be applied yet another time. The usual way to avoid this is to have a primary key on all your tables. In that case, a repeated update of the table would cause the slave to stop, and you would have to use SQL_SLAVE_SKIP_COUNTER to skip the transaction and get the slave up and running again. This is better than losing a transaction, but it is nevertheless a nuisance. Removing the primary key to prevent the slave from stopping will only solve the problem partially: it means that the transaction would be applied twice, which would both place a burden on the application to handle dual entries and also require that the tables to be cleaned regularly. Both of these approches require either manual intervention or scripting support to handle. This does not affect reliability, but it is so much easier to handle if the replication information is committed in the same transaction as the data being updated.

Crash-safe masters

Two problems related to crash-safe replication has been fixed in the master, both of which could cause some annoyance when the master recovered.

If the master crashed when a binary log was rotated, it was possible that some orphan binlog files ended up in the binary log index file. This was fixed in 5.1 but is also a piece in the pussle of having crash-safe replication.
Writing to the binary log is not an atomic operation, and if a crash occured while writing to the binary log, there were a possibility of a partial event at the end of the binary log.
Now, the master recovers from this by truncating the binary log to the last known good position, removing the partially written transaction and rolling back the outstanding transactions in the storage engines.

Figure 1. Moving position information update into transaction

Crash-safe slaves

Several different solutions for implementing crash-safety—or transactional replication, as it is sometimes known as—have been proposed, with Google's TransactionalReplication patch being the most known. This solution stores the replication positions in the InnoDB transaction log, but the MySQL replication team decided to instead implement crash-safety by moving the replication progress information into system tables. This is a more flexible solution and has several advantages compared to storing the positions in the InnoDB transaction log:

If the replication information and data is stored in the same storage engine, it will allow both the data and the replication position to be updated as a single transaction, which means that it is crash-safe.
If the replication information and data is stored in different storage engines, but both support XA, they can still be committed as a single transaction.
The replication information is flushed to disk together with the transaction data. Hence writing the replication information directly to the InnoDB redo log does not offer a speed advantage, but does not prevent the user from reading the replication progress information easily.
The tables can be read from a normal session using SQL commands, which also means that it can be incorporated into such things as stored procedures and stored functions.

**Table 1.** `slave_master_info`
Field	Line in file	Slave status column
Master_id
Number_of_lines	1
Master_log_name	2	`Master_Log_File`
Master_log_pos	3	`Read_Master_Log_Pos`
Host	3	`Master_Host`
User_name	4	`Master_User`
User_password	5
Port	6	`Master_Port`
Connect_retry	7	`Connect_Retry`
Enabled_ssl	8	`Master_SSL_Allowed`
Ssl_ca	9	`Master_SSL_CA_File`
Ssl_capath	10	`Master_SSL_CA_Path`
Ssl_cert	11	`Master_SSL_Cert`
Ssl_cipher	12	`Master_SSL_Cipher`
Ssl_key	13	`Master_SSL_Key`
Ssl_verify_servert_cert	14	`Master_SSL_Verify_Server_Cert`
Heartbeat	15
Bind	16	`Master_Bind`
Ignored_server_ids	17	`Replicate_Ignore_Server_Ids`
Uuid	18	`Master_UUID`
Retry_count	19	`Master_Retry_Count`

In addition to giving us crash-safe slaves the last of these advantages should not be taken lightly. Being able to handle replication from pure SQL put some of the key features in the hands of application developers.

As previously mentioned, the replication information is stored in two files:

master.info: This file contain information about the connection to the master—such as hostname, user, and password—but also information about how much of the binary log that has been transferred to the slave.
relay-log.info: This file contain information about the current state of replication, that is, how much of the relay log that has been applied.

Options to select replication information repository

In order to make the solution flexible, we introduced a general API for adding replication information repositories. This means that we can support multiple types of repositories for replication information, but currently, only the old system using files master.info and relay-log.info and the system using tables slave_master_info and slave_relay_log_info is supported. In order to select what type of repository to use, two new options were added. These options are also available as server variables.

master_info_repository: The type of repository to use for the master info data seen in Table 1.
relay_log_info_repository: The type of repository to use for the relay log info seen in Table 2.

Both of the variables can be set to either FILE or TABLE. If the variable is set to TABLE the new table-based system will be used and if it is set to FILE, the old file-based system will be used. The default is FILE, so make sure to set the value if you want to use the table-based system.

**Table 2.** `slave_relay_log_info`
Field	Line in file	Slave status column
Master_id
Number_of_lines	1
Relay_log_name	2	`Relay_Log_File`
Relay_log_pos	3	`Relay_Log_Pos`
Master_log_name	4	`Relay_Master_Log_File`
Master_log_pos	5	`Exec_Master_Log_Pos`
Sql_delay	6	`SQL_Delay`

If you look in Table 1 and Table 2 you can see the column names used for the tables as well as the line number in the corresponding file and the column name in the output of SHOW SLAVE STATUS. Since we are using tables, the column names are used for storing the data in the table, but when using a file, the column names are only used to identify the correct row to update and the value is inserted at the line number given in the table.

The format of the tables have been extended with an additional field that is not present in the files but which is present in the table: the Master_id field. The reason we added this is to make it possible to extend the server to track multiple masters. Note that we currently have no definite plans to add multi-source support, but as good engineers we do not want these tables to be a hindrance to adding multi-source.

Selecting replication repository engine

In contrast with most of the system tables in the server, the replication repositories can be configured to use any storage engine you prefer. The advantage of this is that you can select the same engine for the replication repositories as the data you're managing. If you do that, both the data and the replication information will be committed as a single transaction.

The new tables are created at installation using the mysql_install_db script, as usual, and the default engine for these tables are are the same as for all system tables: MyISAM. As you know MyISAM is not very transactional, so it is necessary to set this to use InnoDB instead if you really want crash-safety. To change the engine for these tables you can just use a normal ALTER TABLE.

slave> ALTER TABLE mysql.slave_master_info ENGINE = InnoDB;
slave> ALTER TABLE mysql.slave_relay_log_info ENGINE = InnoDB;

Note that this works for these tables because they were designed to allow any storage engine to be used for them, but it does not mean that you can change the storage engine for other system tables and expect it to work.

Event processing

This implementation of crash-safe slaves work naturally with both statement-based and row-based replication and there is nothing special that needs to be done in the normal cases. However, these tables interleave with the normal processing in a little different ways.

To understand how transactions are processed by the SQL thread, let us consider the following example transaction:

START TRANSACTION;
INSERT INTO articles(user, title, body)
      VALUE (4711, 'Taming the Higgs Boson using Clicker Training', '....');
UPDATE users SET articles = articles + 1 WHERE user_id = 4711;
COMMIT;

This transaction will be written to the binary log and then sent over to the slave and written to the relay log in the usual way. Once it is read from the relay log for execution, it will be executed as if an update statement where added to the end of the transaction, before the commit:

START TRANSACTION;
INSERT INTO articles(user, title, body)
      VALUE (4711, 'Taming the Higgs Boson using Clicker Training', '....');
UPDATE users SET articles = articles + 1 WHERE user_id = 4711;
UPDATE mysql.slave_relay_log_info
   SET Master_log_pos = @@Exec_Master_Log_Pos,
       Master_log_name = @@Relay_Master_Log_File,
       Relay_log_name = @@Relay_Log_File,
       Relay_log_pos = @@Relay_Log_Pos
COMMIT;

In this example, there is a number of pseudo-server variables (that is, they don't exist for real) that have the same name as the corresponding field in the result set from SHOW SLAVE STATUS. As you can see, the update of the position information is now inside the transcation and will be committed with the transaction, so if both articles and mysql.slave_relay_log_info are in the same transactional engine, they will be committed as a unit.

This works well for the SQL thread, but what about the I/O thread? There are no transactions executed here at all, so when is the information in this table committed?

Since a commit to the table is expensive—in the same way as syncing a file to disk is expensive when using files as replication information repository—the updates of the slave_master_info table is not updated with each processed event. Depending on the value of sync_master_info there are a few alternatives.

If sync_master_info = 0: In this case, the slave_master_info table is just updated when the slave starts or stops (for any reason, including errors), if the relay log is rotated, or if you execute a CHANGE MASTER command.
If sync_master_info > 0: Then the slave_master_info table will be updated every sync_master_info event.

This means that while the slave is running, you cannot really see how much data has been read to the slave without stopping it. If it is important to see how the slave progress in reading events from the master, then you have to set sync_master_info to some non-zero value, but you should be aware that there is a cost associated with doing this.

This does not usually pose a problem since the times you need to read the master replication information on a running replication is far and few between. It is much more common to read it when the slave has stopped for some reason: to figure out where the error is or to perform a master fail-over.

Closing remarks

We would be very interested in hearing any comments you have on this feature and how it is implemented. If you want to try this out for yourselves then you can download the MySQL 5.6 Milestone Development Release where all this is implemented from the MySQL Developer Zone (dev.mysql.com). If you want to find out more details, the section Slave Status Logs in the MySQL 5.6 reference manual will provide you with all the information. This is one of the features that presented by Lars Thalmann April 11, 2011 (yesterday) at 2:30pm, at the "MySQL Replication" talk at Collaborate 11 and April 12, 2011 (today) 10:50am "MySQL Replication Update" at the O'Reilly MySQL Conference & Expo.

Replication Event Checksum

2011-04-11T14:49:00.002+02:00

MySQL replication is fast, easy to use, and reliable, but once it breaks, it can be very hard to figure out what the problem is. One of the concerns often raised is that events are corrupted, either through failing hardware, network failure, or software bugs. Even though it is possible to handle errors during transfer over the network using an SSL connection, errors here is rarely the problem. A more common problem (relatively) is that the events are corrupted either due to a software bug, or hardware error.

To be able to better handle corrupted events, the replication team has added replication event checksums to MySQL 5.6 Milestone Development Release. The replication event checksums are added to each event as it is written to the binary log and are used to check that nothing happened with the event on the way to the slave. Since the checksums are added to all events in the binary log on the master and transfered both over the network and written to the relay log on the slave, it is possible to track events corrupted events both because of hardware problems, network failures, and software bugs.

Figure 1. Master and Slave with Threads

The checksum used is a CRC-32 checksum, more precisely ISO-3309, which is the one supplied with zlib. This is an efficient checksum algorithm, but there is of course a penalty since the checksum needs to be generated. At this time, we don't have any measurements on the performance impact.

If you look at Figure 1 you can see an illustration of how events propagate through the replication system. In the figure, the points where a checksum could be generated or checked are marked with numbers. In the diagram, you can see the threads that handle the processing of events, and an outgoing arrow from a thread can generate a checksum while an arrow going into a thread can validate a checksum. Note, however, that for pragmatic reasons not all validations or generations can be done.

To enable validation or generation three new options were introduced:

binlog_checksum

This option is used to control checksum generation. Currently, it can accept two different values: NONE and CRC32, with NONE being default (for backward compatibility).

Setting binlog_checksum to NONE means that no checksum is generated, while setting it to CRC32 means that an ISO-3309 CRC-32 checksum is added to each binary log event.

This means that a checksum will be generated by the session thread and written to the binary log, that is, at point 1 in Figure 1.

master_verify_checksum

This option can be set to either 0 or 1 (with default being 0) and indicates that the master should verify any events read from the binary log on the master, corresponding to point 2 in Figure 1. In addition to being read from the binary log by the dump thread events are also read when a SHOW BINLOG EVENTS is issued at the master and a check is done at this time as well.

Setting this flag can be useful to verify that the event really written to the binary log is uncorrupted, but it is typically not needed in a replication setting since the slave should verify the event on reception.

slave_sql_verify_checksum

Similar to master_verify_checksum, this option can be set to either 0 or 1 (but defaults to 1) and indicates that the SQL thread should verify the checksum when reading it from the relay log on the slave. Note that this means that the I/O thread writes a checksum to the event written to the relay log, regardless of whether it received an event with a checksum or not.

This means that this option will enable verification at point 5 in Figure 1 and also enable generation of a checksum at point 4 in the figure.

If you payed attention, you probably noticed that there is no checking for point 3 in the figure. This is not necessary since the checksum is verified when the event is written to the relay log at point 4, and the I/O thread just does a straight copy of the event (potentially adding a checksum, as noted above).

So, how does it look when we encounter a checksum error? Let's try it out and see what happens. We start by generating a simple binary log with checksums turned on and see what we get.

master> CREATE TABLE t1 (id INT AUTO_INCREMENT PRIMARY KEY, name CHAR(50));
Query OK, 0 rows affected (0.04 sec)

master> INSERT INTO t1(name) VALUES ('Mats'),('Luis');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

master> SHOW BINLOG EVENTS FROM 261;
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
| Log_name          | Pos | Event_type | Server_id | End_log_pos | Info                                                      |
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
| master-bin.000001 | 261 | Query      |         1 |         333 | BEGIN                                                     |
| master-bin.000001 | 333 | Intvar     |         1 |         365 | INSERT_ID=1                                               |
| master-bin.000001 | 365 | Query      |         1 |         477 | use `test`; INSERT INTO t1(name) VALUES ('Mats'),('Luis') |
| master-bin.000001 | 477 | Query      |         1 |         550 | COMMIT                                                    |
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
4 rows in set (0.00 sec)

Here, everything looks as before, so no sign of a checksum here, but let's edit the binlog file directly and change the 's' in 'Mats' to a 'z' and see what happens. First with MASTER_VERIFY_CHECKSUM set to 0, and then with it set to 1.

master> SHOW BINLOG EVENTS FROM 261;
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
| Log_name          | Pos | Event_type | Server_id | End_log_pos | Info                                                      |
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
| master-bin.000001 | 261 | Query      |         1 |         333 | BEGIN                                                     |
| master-bin.000001 | 333 | Intvar     |         1 |         365 | INSERT_ID=1                                               |
| master-bin.000001 | 365 | Query      |         1 |         477 | use `test`; INSERT INTO t1(name) VALUES ('Matz'),('Luis') |
| master-bin.000001 | 477 | Query      |         1 |         550 | COMMIT                                                    |
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
4 rows in set (0.00 sec)

master> SET GLOBAL MASTER_VERIFY_CHECKSUM=1;
Query OK, 0 rows affected (0.00 sec)

master> SHOW BINLOG EVENTS FROM 261;
ERROR 1220 (HY000): Error when executing command SHOW BINLOG EVENTS: Wrong offset or I/O error

Now, the error message generated is not the crystal clear, but there were an I/O error when reading the binary log: the checksum verification failed. You can see this because I could show the content of the binary log with MASTER_VERIFY_CHECKSUM set to 0, but not when set to 1. Since the checksum is checked when reading events from the binary log, we get a checksum failure when using SHOW BINLOG EVENTS.

So, if we restore the error and verify that it is correct by issuing a SHOW BINLOG EVENTS again, we can try to send it over to the slave and see what happens. The steps to do this (in case you want to try yourself) is:

Start the I/O thread and let it create the relay log using START SLAVE IO_THREAD.
Stop the slave using STOP SLAVE (this is necessary since the slave buffers part of the relay log).
Manually edit the relay log to corrupt one event (I replaced the 's' with a 'z'.
Start the slave using START SLAVE.

The result when doing this is an error, as you can see below. Removing the corruption and starting the slave again will apply the events as expected.

slave> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
                        .
                        .
                        .
              Master_Log_File: master-bin.000001
          Read_Master_Log_Pos: 550
               Relay_Log_File: slave-relay-bin.000002
                Relay_Log_Pos: 419
        Relay_Master_Log_File: master-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
                        .
                        .
                        .
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1594
               Last_SQL_Error: Relay log read failure: Could not parse
                               relay log event entry. The possible
                               reasons are: the master's binary log is
                               corrupted...
                        .
                        .
                        .
     Last_SQL_Error_Timestamp: 110406 09:41:40
1 row in set (0.00 sec)

Now, this is all very nice, but if you have a corruption, you also want to find out where the corruption is—and that preferably without having to start the server. To handle this, the mysqlbinlog program was extended to print the CRC checksum (if there is one) and also to verify it if you give the verify-binlog-checksum option to it.

$ client/mysqlbinlog --verify-binlog-checksum master-bin.000001
        .
        .
        .
# at 261
#110406  8:35:28 server id 1  end_log_pos 333 CRC32 0xed927ef2  Query   thread_id=1...
SET TIMESTAMP=1302071728/*!*/;
BEGIN
/*!*/;
# at 333
#110406  8:35:28 server id 1  end_log_pos 365 CRC32 0x01ed254d  Intvar
SET INSERT_ID=1/*!*/;
ERROR: Error in Log_event::read_log_event(): 'Event crc check failed! Most likely...
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

As you can see, an error is emitted for the offending event, and you can also see the CRC checksum value (which is 32 bits) in the output above, and it corresponds to the position where the slave stopped for my corrupted binary log.

This is just the beginning: there are many things that can be done using checksums, and many new things that are now possible to implement. If you think that this is a useful feature, please let us know, and if you think that it needs to be enhanced, changed, or extended, we would also like to hear from you.

Closing remarks

If you want to find out the details, the reference documentation for the replication checksum can be found together with the options mentioned above:

This is one of the features that are presented by Lars Thalmann today (April 11, 2011) at 2:30pm, at the "MySQL Replication" talk at Collaborate 11 and tomorrow (April 12, 2011) 10:50am "MySQL Replication Update" at the O'Reilly MySQL Conference & Expo.

Slave Type Conversions

2011-02-08T13:05:00.002+01:00

[Note: I'm testing to use googlecl to post this article.] Replication is typically used to replicate from a master to one or more slaves using the same definition of tables on the master and slave, but in some cases you want to replicate to tables with a different definition on the slave, for example:

Adding a timestamp column on the slave to see when the row was last updated.
Eliminating some columns on the slave because you don't need them and they take up space that you can use for better purposes.
Temporarily handling an on-line upgrade of a dual-master or circular replication setup.

Of these alternatives, the last one is critical to any deployment that want to stay available. If this case can be handled, most other changes can also be handled, so let's focus on that.

Figure 1. Table with an extra column on slave
Master	Slave
CREATE TABLE employee ( id SMALLINT AUTO_INCREMENT, name VARCHAR(64), email VARCHAR(64), PRIMARY KEY (id))	CREATE TABLE employee ( id SMALLINT AUTO_INCREMENT, name VARCHAR(64), email VARCHAR(64), ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP PRIMARY KEY (id))

Figure 1. Table with an extra column on slave

Master

Slave

CREATE TABLE employee (
    id SMALLINT AUTO_INCREMENT,
    name VARCHAR(64),
    email VARCHAR(64),

    PRIMARY KEY (id))

CREATE TABLE employee (
    id SMALLINT AUTO_INCREMENT,
    name VARCHAR(64),
    email VARCHAR(64),
    ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    PRIMARY KEY (id))

When using statement-based replication, the plain statements are replicated—this can at times can be an advantage, but not always, as you will soon see. The most obvious case is when you have more or fewer columns on the master than you have on the slave. To illustrate the problem, let us start with the table definitions in Figure 1. Here a timestamp column was added to the slave to see when the row was last changed. When using statement-based replication, we can properly replicate between these tables provided we always give column names to the statement on the master, for example:

master> INSERT INTO employee(name, email) VALUES ('Mats', 'mats@example.com');
master> DELETE FROM employee WHERE email = 'mats@example.com';
master> UPDATE employee SET name = 'Matz' WHERE email = 'mats@example.com';

In all these cases, the statements execute perfectly well with both table definition since the "missing" column has a default value and each statement gives exactly the names of the columns to update. The DELETE and UPDATE statements naturally refer only to the column on the master, but for INSERT it is necessary to add the column names even if the tuple matches the definition on the master since it could be different on the slave.

Having to give the column names all the time is fragile and if the user—or the application—makes a mistake and types the following statement, replication on the slave will stop with an error:

master> INSERT INTO employee VALUES (DEFAULT, 'Mats', 'mats@example.com');

In contrast to statement-based replication, row-based replication will do the right thing and throw away extra columns sent by the master or add default values to extra columns on the slave—if the column has a default value—provided that the columns are added or removed last in the table.

This works fine for the example above since the extra timestamp column is last in the table. The effect is to keep track of when the row was last updated on the slave, which could be used to see if the row is current.

Depending on what you want to accomplish, there could be better techniques for this, described in our book. The problem is that the timestamp might not have enough precision in a high-load situation.

So, row-based replication in MySQL 5.1 contain support for using more or fewer columns on the slave as compared to the master, but there were one case that was not supported: replicating between different column types. This is very important for basic upgrade scenarios where you, for example, change the size of some column during an upgrade.

Figure 2. Different types on master and slave
Master	Slave
CREATE TABLE employee ( id SMALLINT AUTO_INCREMENT, name CHAR(64), email CHAR(64), PRIMARY KEY (id))	CREATE TABLE employee ( id SMALLINT AUTO_INCREMENT, name VARCHAR(64), email VARCHAR(64), PRIMARY KEY (id))

Figure 2. Different types on master and slave

Master

Slave

CREATE TABLE employee (
    id SMALLINT AUTO_INCREMENT,
    name CHAR(64),
    email CHAR(64),
    PRIMARY KEY (id))

CREATE TABLE employee (
    id SMALLINT AUTO_INCREMENT,
    name VARCHAR(64),
    email VARCHAR(64),
    PRIMARY KEY (id))

For example, consider the table definition in Figure 2. In this case, the intention is to save space on the slave by storing the strings in a VARCHAR field instead of a CHAR field—recall that VARCHAR fields are variable length strings while CHAR fields occupy a fixed space in the row. (We don't care too much about the reasons for using CHAR on the master, we just use this example to illustrate the problem.)

When using statement-based replication, this works well since the actual statement is replicated. However, when using row-based replication we have the additional requirement (in 5.1) that the column types have to have identical base types. Unfortunately, CHAR and VARCHAR does not have the same base type, so replication will stop with an error when you try to execute the INSERT, which is not very helpful.

Fortunately, the replication team have extended row-based replication with a new feature in MySQL 5.5: that of converting between types when replicating from a master and to a slave with a different table definition. With this feature, a stricter type checking is also implemented and better error messages.

The conversion checks the declared types on the master and slave and decides before executing the transaction if the conversion is allowed. This means that it does not investigate the actual values replicated: only the types of the column on the master and the slave. In addition to better performance when not checking each value this check is done so that you can be sure that any value replicated between the tables will work, not just the values that you happened to have in your test suite.

When dealing with conversions, we are only considering conversions within the groups below.

Integer types: TINYINT, SMALLINT, MEDIUMINT, INT, BIGINT
Decimal types: DECIMAL, FLOAT, DOUBLE, NUMERIC
String types: CHAR(N), VARCHAR(N), TEXT even for different values of N on master and slave.
Binary types: BINARY(N), VARBINARY(N), BLOB even for different values for N on master and slave.
Bit types: Conversion between BIT(N) for different values of N on master and slave.

Since the string and binary types only differ in the character set they use—and replication is not aware of character sets yet—replication between string and binary types will be possible simply because the character set is not known. Don't rely on this though; as soon as Bug#47673 is fixed, string and binary types will be separated into distinct groups and replication will stop if the character sets don't allow conversion.

Within each group, we also have two types of conversions: non-lossy conversions and lossy conversions. With a non-lossy conversion you are guaranteed that no information is lost, but with lossy conversions it is possible that you lose some information. A typical example of a non-lossy conversion is converting from a CHAR(32) field to a CHAR(64) field—since the target field is wider than the source field, there is no risk that any part of the string is lost. Converting in the other direction, however, is a lossy conversion since a string with more than 32 characters cannot fit into a CHAR(32) field. A more odd example is conversion between FLOAT and DECIMAL(N,M), which are always considered lossy, regardless of the direction the conversion is done. Since it cannot be guaranteed that all floating-point numbers can be converted to decimal numbers without losing precision, and vice versa.

Controlling what conversions are allowed is controlled with a new server variable SLAVE_TYPE_CONVERSIONS, which is of the type SET('ALL_LOSSY','ALL_NON_LOSSY'), that is, it is a set of allowed conversions. The default for this variable is the empty set, meaning that no conversions are allowed at all.

If the ALL_NON_LOSSY constant is in the set, all conversions (within each group) that do not lose any information are allowed. For example, replicating from CHAR(32) to TINYTEXT is allowed since the conversion goes to a wider field (even if it is a different type).

If the ALL_LOSSY constant is in the set, all conversions (again, within the same group) that could potentially lose information is allowed. For example, conversion to a narrower field on the slave, such as CHAR(32) to CHAR(16) is allowed. Note that non-lossy conversions are not automatically allowed when ALL_LOSSY is set.

The prefix ALL is used since we were considering the possibility of allowing conversions within certain groups only, for example, to add the feature of only allowing lossy conversions for strings and non-lossy conversions for integers, we could set SLAVE_TYPE_CONVERSIONS to 'STRING_LOSSY,INTEGER_NON_LOSSY'. This is, however, pure speculations at this time.

If you are interested about the details of how slave type conversions work, you can find more information in the MySQL Reference Manual in Replication with Differing Tables on Master and Slave.

Have you seen my replication files?

2010-09-22T23:30:00.002+02:00

I recently started looking over how to get information about relay log file and binary log file using an SQL interface. Being able to do that can be quite handy when one is going to work with replication in various ways. In my particular case, I wanted to get the path to the relay log index file and binary log index file to be able to read the binary log files as well as the relay log files directly. You are probably familiar with the --relay-log-index and --relay-log options that can be set to specify where the index file. These options can either be used to set an absolute path or a relative path to use for the files. If the option starts with a /, it is considered an absolute path (drive letters are allowed on Windows though), otherwise the path is relative to the data directory (which is specified through the --datadir option). The values supplied to these options are provided from SQL as the system variables relay-log-index and relay-log respectively. The recommendation is to always set the --relay-log and --relay-log-index since the default value for these options contain the hostname. The problem with this is that if the database files is moved to a new machine with a different hostname, the server will not be able to pick up the files correctly and will assume that they do not exist. The logic for finding the location of the relay log files can be quite daunting; to find the location of the relay log index file:

If relay_log_index is set, this is the location of the relay log index file.
If relay_log_index is not set, then the value supplied to the relay_log option is used to figure out the name of the relay log index file.
If neither relay_log_index nor relay_log is set, then the name of the relay log index file is taken by stripping the directory and extension from the pid_file variable (set using the --pid-file option), if supplied, and adding -relay-bin.index to the end of the string.
- The pid_file variable has a default value which consists of datadir/hostname.pid, which would give the relay log index file a name of datadir/hostname-relay-bin.index.
If the path is a relative path—that is, the path does not start with a directory separator—then the value of datadir is prepended to the relay log index file name.

Keeping track of all these details is not something I want to spend my time on, so I wrote a stored function for computing the name of the relay log index file which I simply called relay_log_index_file:

CREATE FUNCTION relay_log_index_file () RETURNS VARCHAR(256)
  DETERMINISTIC
  READS SQL DATA
BEGIN
  DECLARE rli_name VARCHAR(256);
  IF @@relay_log_index IS NOT NULL THEN
    SET rli_name = @@relay_log_index;
  ELSEIF @@relay_log IS NOT NULL THEN
    SET rli_name = @@relay_log;
  ELSE
    BEGIN
      DECLARE l_pid_file VARCHAR(256);
      DECLARE l_pid_base VARCHAR(256);
      SET l_pid_file = SUBSTRING_INDEX(@@pid_file, '/', -1);
      SET l_pid_base = SUBSTRING_INDEX(l_pid_file, '.', 1);
      SET rli_name = CONCAT(l_pid_base, '-relay-bin.index');
    END;
  END IF;

  IF rli_name NOT LIKE '/%' THEN
    RETURN CONCAT(@@datadir, rli_name);
  END IF;

  RETURN rli_name;
END

This is a quite complicated way of figuring out the location of the relay log files and hardly something that I consider very useful. It would be much better if the relay_log_index variable gave the complete path to the file, regardless of what was given to the --relay-log-index option (or even if the option was given at all).

Being able to fetch the relay log index file is quite convenient, but being able to fetch the binary log index file would be even more convenient. Unfortunately, there is no such variable. The --log-bin option can be used to supply a base name to use for the binary log, but the log_bin variable can only be ON or OFF, which in my book is not very smart. To fix this, I created WL#5465, which introduces three new variables—log_bin_basename, relay_log_basename, and log_bin_index—and changes behaviour of relay_log_index.

log_bin_basename

This is a global read-only variable that contain the base file name used for the binary log files, the path to the files but omitting the extension.

If a full path was given to --log-bin-index, this will be stored in log_bin_index.
If a relative path was given to --log-bin-index, the contents of datadir will be used as directory and prepended to the value of --log-bin-index
Otherwise, the value of datadir will be used as the directory of the file and the base name is created by taking the basename of pid_file (name without extension) and adding '-bin'.

log_bin_index

This is a global read-only variable containing the full name to the binary log index file. If no value is given, the value of log_bin_basename is used and the extension '.index' is added.

relay_log_basename

This is a global read-only variable containing the base file name used for the relay log file, that is, the full path to the relay logs but not including the extension. The value of this variable is created in the same way as for log_bin_basename with the only difference that the '-relay-bin' suffix is used instead of '-bin'.

relay_log_index

This is a global read-only variable containing the full name of the relay log index file. If no value is given, the value of relay_log_basename is used and the extension '.index' is added.

With these new variables, fetching the full path of the binary log index file is as easy as doing a 'SELECT @@log_bin_index'.

If you're interested in if the patch for this worklog will be in any particular server version or f it is pushed at all, you have to check the status of the worklog. Even if I have described the architecture and implemented a patch, there is no way to know where it ends up or even if it is pushed at all.

An alternative: let the application do the job

Creating a stored function for computing the relay log index file name might be overkill in many situation. If the value is needed from serveral different connections it makes sense to create it as a stored function to allow it to be used by different applications. It can, however, just as well be placed in the application code which would then compute the location of the relay log index file using a single query to the server.

The information you need is the data directory from datadir, the pid file name from pid_file (in the event that the relay log or the relay log index option does not have a value), and the relay_log and relay_log_index values.

For example, the following Python code could be used to compute the data directory, the base use for creating relay log files, and the name of the index file using a single query to the database server:

import os.path

def get_relay_log_info(connection):
    cursor = connection.cursor()
    cursor.execute("SELECT @@datadir, @@pid_file, @@relay_log, @@relay_log_index")
    datadir, pid_file, relay_log, relay_log_index = cursor.fetchone()
               
    def _add_datadir(filename):
        if os.path.isabs(filename):
            return filename
        else:
            return os.path.join(datadir, filename)

    pidfile_base = os.path.basename(os.path.splitext(pid_file)[0])
    base_name = _add_datadir(relay_log or pidfile_base + '-relay-bin')
    index_file = _add_datadir(relay_log_index or base_name + '.index')

    return { 'datadir': datadir, 'base': base_name, 'index': index_file }

Binary Log Group Commit - Recovery

2010-08-18T20:49:00.005+02:00

It was a while since I wrote the previous article, but the merging of Oracle and Sun here resulted in quite a lot of time having to be spent on attending various events and courses for legal reason (one of the reasons I prefer working for smaller companies) and together with a summer vacation spent on looking over the house, there were little time for anything else. This is the second post of three, and in the last one I will cover some optimizations that improves performance significantly.

In the previous article, an approach was outlined to handle the binary log group commit. The basic idea is to use the binary log as a ticketing system by reserving space in it for the transactions that are going to be written. This will provide an order on the transactions as well as allowing writing the transactions in parallel to the binary log, thereby boosting performance. As noted in the previous post, a crash while writing transactions to the binary log requires recovery. To understand what needs to be changed, it is necessary to understand how the structure of the binary log as well as how recovery after a crash works currently together with the implementation of 2-phase commit that MySQL uses.

Figure 1. Binlog file structure

A quick intro to the structure of the binary log

Figure 1 gives the rough structure of the binary log with a set of binlog files and an binlog index file. The binlog index file just list the binlog files that makes up the binary log, while each binlog file have the real contents of the binary log that you can see when executing a SHOW BINLOG EVENTS.

Each binlog file consists of a sequence of binlog events, where the most important events from our perspective is the Format description event. In addition, each binlog file is also normally terminated by a Rotate event that refers to the next binlog file in the sequence.

The Format description event is used to describe the contents of the binlog file and therefore contain a a lot of information about the binlog file. In this case we are interested in a special flag called LOG_EVENT_BINLOG_IN_USE_F, which is used to tell if the binlog is actively being written by the server. When the server opens a new binlog file, this flag is set to indicate that the file is in use, and when the binary log is rotated and a new binlog file created, this flag is cleared when closing the old binlog file.

In the event of a crash, the flag will therefore be set and the server can see that the file was not closed properly and start with performing recovery.

Recovery and the binary log

When recovering, the server has to find all transactions that were partially executed and decide if they are going to be rolled back or committed properly. The deciding point when a transaction will be committed instead of rolled back is when the transaction has been written to the binary log. To do this, the server has to find all transactions that were written to the binary log and tell all storage engines to commit these transactions.

The recovery procedure is executed when the binary log is opened—which the server does calling TC_LOG_BINLOG::open during startup. When the binary log is opened, recovery is done if the last open binlog file was not closed properly. An outline of the procedure executed is:

Open the binlog index file and go through it to find the last binlog file mentioned there [TC_LOG_BINLOG::open]
Open this binlog file and check if the LOG_EVENT_BINLOG_IN_USE_F flag is set
If the flag was clear, then the server stopped properly and no recovery is necessary. Otherwise, the server did not stop properly and recovery starts by calling.
The last binlog file is now open, so the entire binlog file is scanned and the XID of each each Xid event is recorded. These XIDs denote the transactions that were properly written to the binary log—that is, the transactions that shall be committed [TC_LOG_BINLOG::recover].
Each storage engine is handed the list of XIDs of transactions to commit through the handlerton::recover interface function [ha_recover].
The storage engine will then commit each transaction in the list and roll back all the others.

Figure 2. Parallel binary log group commit

So, what's the problem?

The procedure above works fine, so what are the problems we have to solve to implement the procedure described in the previous article? If you look in Figure 2, you have a hint to what is the problem.

Now, assume that thread 1, 2, and 3 in Figure 2 is writing transactions to disk (starting at positions Trans_Pos₁, Trans_Pos₂, and Trans_Pos₃ respectively) and that a preceding thread (a thread that got a binlog position before Last_Complete) decides that it is time to call fsync to group commit the state this far. The binlog file will then be written in this state—where some transactions are partially written—and Last_Committed will be set to the value of Last_Complete, leading to the situation depicted in Figure 2.

As you can see in the figure, thread 2 has already finished writing data to the binary log and is therefore written to durable storage. Since thread 1—which precedes thread 2 in the binary log—has not completed yet, thread 2 has not yet committed and is still waiting for all the preceding transactions to complete. If a crash occurs in this situation, it is necessary to somehow find the XID of all transactions that have committed—excluding the transaction that thread 2 has completed—and commit them to the storage engine when recovering.

A proposal for a new recovery algorithm

In the original algorithm, the scan of the binlog file stopped when the file ended, but since there can be partially written events in the binlog file after the "real" end of the file (the binlog file ends logically at Last_Committed/Last_Complete), so we have to find some other way to detect the logical end of the file.

To handle this, it is necessary to somehow mark events that are not yet committed so that the recovery algorithm can find the correct position where the binlog file ends. The same problem occurs if one wants to persist the end of the binlog file preallocating the binlog file. There are basically three ways to handle this:

Write the end of the binlog file in the binlog file header (that is, the Format description log event).
Mark each event by zeroing out a field that cannot be zero—for example, the length, the event type, or event position—before writing the event to the binary log. Then write this field with the correct value after the entire event has been written.
Checksum the events and find the end of the worklog by scanning for the first event with an incorrect checksum.

Write the length in the binlog file header: Finding the length of the binlog in this case is easy: just inspect the header and find the length of the binlog file there. In this case, it is necessary to update the length after the event has been written since there may be an fsync call at any time between starting to write the event data and finishing writing the event. Normally, this means updating two block of the file for each event written, which can be a problem since it requires at least the block containing the header and all the blocks that was written since the last group commit to be written when calling fsync. If a large number of events is written between each fsync, this might not impose a large penalty, but if sync-binlog=1 it might become quite expensive. Some experiments done by Yoshinori showed a drop from 15k events/sec to 10k events/sec, which means that we lose one third in performance.
Digression. The measurements that Yoshinori did consisted of one pwrite to write the event, one pwrite to write the length to the header and then a call to fsync. It is, in other word, most similar to using sync_binlog=1. In reality, however, this will not be the case since a user that is using the binary log group commit will have several events written between each call to fsync. Since these writes will be to memory (the file pages are in memory), performance will not drop as much. To evaluate the behavior for a group commit situation better, writing 10 events at a time was compared as well (pretending to be sync_binlog=10). Straight append (using write) gave at that point 110k events/sec and write to the header before calling fsync gave 80k events/sec. This means a performance reduction of 27%, which is an improvement but still a very large overhead.
Use a marker field: The second alternative is to use one of the fields as a marker field. By setting one of the fields that cannot be zero to zero, it is possible to detect that the event is incorrect and stop at the event before that. Good candidates as fields is the length—which cannot be zero for any event and is four bytes—and the event type, which is one byte and where zero denotes an unknown event and never occurs naturally in a binlog file. The technique would be to first blank out the type field of the event, write the event to the binlog file, and then use pwrite to fill in the correct type code after the entire event is written. If an fsync occurs before the event type is written, the event will be marked as unknown and if a crash occurs before the event is completely written (and written to disk), it will be possible to scan the binlog file to find the first event that is marked as unknown. In order for this technique to work, it is necessary to zero the unused part of the binlog file before starting to write anything there (or at least zero out the event type). Otherwise, crash recovery will not be able to correctly detect where the last completely written event is located.
Compared to the previous approach, this does not require writing to locations far apart (except in rare circumstances when the event spans two pages). It also has the advantage of not requiring any change of the binlog format. This technique is likely to be quite efficient. (Note that most of the writes will be to memory, so there will not be any extraneous "seeks" over the disk to zero out parts of the file.)
Checksum on each event: The third alternative is to rely on an event checksum to detect events that are incompletely written. This approach is by far the most efficient of the approaches since the event checksum is naturally written last. It also has the advantage of not requiring the unused parts of the binlog file to be zeroed since it is unlikely that the checksum will be correct for the event unless the event has been fully written. This also makes it a very good candidate for detecting the end of the binlog file when preallocating the binlog file. The disadvantage is, of course, that it requires checksums to be enabled and implemented.

With this in mind, the best approach seems to be to checksum each event and use that to detect the end of the binary log. If necessary, the second approach can be implemented when the binlog is not checksummed.

The next article will wrap up the description by pointing out some efficiency issues and how to solve them to get an efficient implementation.

Binary Log Group Commit - An Implementation Proposal

2010-04-30T06:23:00.008+02:00

It is with interest that I read Kristian's three blogs on the binary log group commit. In the article, he mentions InnoDB's prepare_commit_mutex as the main hindrance to accomplish group commits—which it indeed is—and proposes to remove it with the motivation that FLUSH TABLES WITH READ LOCK can be used to get a good binlog position instead. That is a solution—but not really a good solution—as Kristian points out in the last post.

The prepare_commit_mutex is used to ensure that the order of transactions in the binary log is the same as the order of transactions in the InnoDB log—and keeping the same order in the logs is critical for getting a true on-line backup to work, so removing it is not really an option, which Kristian points out in his third article. In other words, it is necessary to ensure that the InnoDB transaction log and the binary log have the same order of transactions.

To understand how to solve the problem, it is necessary to take a closer look at the XA commit procedure and see how we can change it to implement a group commit of the binary log.

The transaction data is stored in a per-thread transaction cache and the transaction size is the size of the data in the transaction cache. In addition, each transaction will have a transaction binlog position (or just transaction position) where the transaction data is written in the binary log.

The procedure can be outlined in the following steps:

Prepare InnoDB [ha_prepare]:

Write prepare record to log buffer
fsync() log file to disk (this can currently do group commit)
Take prepare_commit_mutex

Log transaction to binary log [TC_LOG_BINLOG::log_xid]:

Lock binary log
Write transaction data to binary log
Sync binary log based on sync_binlog. This forces the binlog to always fsync() (no group commit) due to prepare_commit_mutex
Unlock binary log

Commit InnoDB:

Release prepare_commit_mutex
Write commit record to log buffer
Sync log buffer to disk (this can currently do group commit)
InnoDB locks are released

There are mainly two problems with this approach:

The InnoDB row level and table level locks are released very late in the sequence, which affects concurrency. Ideally, we need to release the locks very early, preferably as soon as we have prepared InnoDB.
It is not possible to perform a group commit in step 2

As you can see here, the prepare of the storage engines (in this case just InnoDB) is done before the binary log mutex is taken, and that means that if the prepare_commit_mutex is removed it is possible for another thread to overtake a transaction so that the prepare and writing to the binary log is done in different order.

To solve this, Mark suggests using a queue or a ticket system to ensure that transactions are committed in the same order, but we actually already have such a system that we can use to assign tickets: namely the binary log.

The idea is to allocate space in the binary log for the transaction to be written. This gives us a sequence number that we can use to order the transactions.

In the worklog on binary log group commits you will find the complete description as well as the status of the evolving work.

In this post, I will outline an approach that Harrison and I have discussed, which we think will solve the problems mentioned above. In this post, I will outline the procedure during normal operations, in the next post I will discuss recovery, and in the third post (but likely not the last on the subject), I will discuss some optimizations that can be done.

I want to emphasize that the fact that we have a worklog does not involve any guarantees or promises of what, when, or even if any patches will be pushed to any release of MySQL.

In Worklog #4007 an approach for writing the binary log is suggested where space is allocated for the transaction in the binary log before actually starting to write it. In addition to avoiding unnecessary locking of the binary log, it also allow us to use the binary log to order the transactions in-place. We will use this idea of reserving space in the binary log to implement the binary log group commit.

By re-structuring the procedure above slightly, we can ensure that the transactions are written in the same order in both the InnoDB transaction log and the binary log.

There are two ways to re-structure the code: one simple and one more complicated that potentially can render better performance. To simplify the presentation, it is assumed that pre-allocation is handled elsewhere, for example using Worklog #4925. In a real implementation, pre-allocation can either be handled when a new binlog file is created, or when transaction data is being written to the binary log.

The sequential write approach

Figure 1. Sequential binary log group commit

In the sequential write approach, the transactions are still written to the binary log in order and the code is just re-ordered to avoid keeping mutexes when calling fsync(). To describe the algorithm, three shared variables are introduced to keep track of the status of replication:

Next_Available: This variable keeps track of where a new transaction can be written
Last_Committed: This variable keeps track of the last committed transaction, meaning that all transactions preceding this position is actually on disc. This variable is not necessary in the real implementation, but it is kept here to simplify the presentation of the algorithm.
Last_Complete: This variable keeps track of the last complete transaction. All transactions preceding this point is actually written to the binary log, but are not necessarily flushed to disc yet.

You can see an illustration of how the variables are used with the binary log in Figure 1 where you can also see three threads each waiting to write a transaction. Both variables are initially is set to the beginning of the binary log and it is always true that Last_Committed ≤ Last_Complete ≤ Next_Available . The procedure can be described in the following steps:

Lock the binary log
Save value of Next_Available in a variable Trans_Pos and increase Next_Available with the size of the transaction.
Prepare InnoDB:

Write prepare record to log buffer (but do not fsync() buffer here)
Release row locks

Unlock binary log
Post prepare InnoDB:

fsync() log file to disk, which can now be done using group commit since no mutex is held.

Log transaction to binary log:

Wait until Last_Complete = Trans_Pos. (This can be implemented using a condition variable and a mutex.)
Write transaction data to binary log using pwrite. At this point, it is not really necessary to use pwrite since the transaction data is simply appended, but it will be used in the second algorithm, so we introduce it here.
Update Last_Complete to Trans_Pos + transaction size.
Broadcast the the new position to all waiting threads to wake them up.
Call fsync() to persist binary log on disk. This can now be group committed.

Commit InnoDB:

Write commit record to log buffer
Sync log buffer to disk, which currently can be group committed.

To implement group commit, it is sufficient to have a condition variable and wait for that for a specified interval. Once the interval has passed, the transaction data can call fsync(), after which it broadcasts the fact that data has been flushed to disc to other waiting threads so that they can skip this. Typically, the code looks something along these lines (we ignore checking error codes here to simplify the description):

pthread_mutex_lock(&binlog_lock);
while (Last_Complete ≥ Last_Committed) {
  struct timespec timeout;
  gettimeofday(&timeout, NULL);
  timeout.tv_usec += 1000;    /* 1 msec */
  int error= pthread_cond_timedwait(&binlog_flush, &binlog_lock, &timeout);
  if (error == ETIMEDOUT) {
    fsync(&binlog_file);
    Last_Committed = Last_Complete;
    pthread_cond_broadcast(&binlog_flush);
  }
}
pthread_mutex_unlock(&binlog_lock);

There are a few observations regarding this approach:

Step 6a requires a condition variable and a mutex when waiting for Last_Complete to reach Trans_Pos. Since there is just a single condition variable, it is necessary to broadcast a wakeup to all waiting threads, which each will evaluate the condition just to find a single thread that should continue, while the other threads go to sleep again.
This means that the condition will be checked O(N²) times to commit N transactions. This is a waste of resources, especially if there is a lot of threads waiting, and if we can avoid this, we can gain performance.
Since the thread has a good position in the binary log where it could write, it could just as well start writing instead of waiting. It will not interfere with any other threads, regardless if locks are kept or not.

These observations lead us to the second approach, that of writing transaction data to the binary log in parallel.

A parallel write approach

Figure 2. Parallel binary log group commit

In this approach, each session is allowed to write to the binary log at the same time using pwrite since the space for the transaction data has already been allocated when preparing the engines. Figure 2 illustrates how the binary log is filled in (grey areas) by multiple threads at the same time. Similar to the sequential write approach, we still have the Last_Complete, Last_Committed, and Next_Available variables.

Each thread does not have to wait for other threads before writing, but it does have to wait for the other threads to commit. This is necessary since we required the order of commits in the InnoDB log and the binary log to be the same. In reality, this does not pose a problem since the I/O is buffered, hence the writes are done to in-memory file buffers.

The algorithms look quite similar to the sequential write approach, but notice that in step 6, the transaction data is simply written to the binary log using pwrite.

Lock the binary log
Save value of Next_Available in a local variable Trans_Pos and increase Next_Available with the size of the transaction.
Prepare InnoDB:

Write prepare record to log buffer (but do not fsync() buffer here)
Release row locks

Unlock binary log
Post prepare InnoDB:

fsync() log file to disk, which can now be done using group commit since no mutex is held.

Log transaction to binary log:

Write transaction data to binary log using pwrite. There is no need to keep a lock to protect the binary log here since all threads will write to different positions.
Wait until Last_Complete = Trans_Pos.
Update Last_Complete to Trans_Pos + transaction size.
Broadcast the the new position to all waiting threads to wake them up.
Call fsync() to persist binary log on disk. This can now be group committed.

Commit InnoDB:

Write commit record to log
Sync log file to disk

This new algorithm has some advantages, but there are a few things to note:

When a transaction is committed, it is guaranteed that Trans_Pos ≥ Last_Committed for all threads (recall that Trans_Pos is a thread-local variable).
Writes are done in parallel, but when waiting for the condition in step 6b still requires a broadcast to wake up all waiting threads, while only one will be allowed to proceed. This means that we still have the O(N²) complexity of the sequential algorithm. However, for the parallel algorithm it is possible to improve the performance significantly, which we will demonstrate in the third part where we will discuss optimizations to the algorithms.
Recovery in the sequential algorithm is comparably simple since there are no partially written transactions. If you consider that a crash can occur in the situation described in Figure 2, it is necessary to device a method for correctly recovering. This we will discuss in the second part of these posts.

MySQL Conference Replication tutorial: Article and Demo Software

2010-04-13T20:07:00.003+02:00

The MySQL Conference and Expo started with me and Lars Thalmann doing the replication tutorial. Unfortunately, we cannot at this time distribute the slides (please watch the replication tutorial page at the conference site), but there is a replication tutorial package for easy setup of server to play around with—including some sample scripts—and a paper that both explains how the package can be used as well as giving some example setups.

The software package can be downloaded from the forge and requires Perl at least version 5.6.0 to execute.
The article can can also be downloaded from my site (PDF).

Going to the O'Reilly MySQL Conference & Expo

2010-03-05T14:10:00.002+01:00

As I've been doing the last couple of years, I will be going to the O'Reilly MySQL Conference & Expo. In addition to the tutorial and the replication sessions that I will be holding together with Lars, I will be holding a session about the binary log together with Chuck from the Backup team which the Replication team normally works very close with.

This year, O'Reilly also have a Friend of the Speaker discount of 25% that you can use when you register using the code mys10fsp.

The sessions that we are going to hold are listed below. Note that I am using Microformats, which will allow you to easily extract and add the events to your calendar using, for example, the Operator plugin for Firefox.

See you there!

Mysteries of the Binary Log: April 14th, 2010 10:50am - 11:50am Room: Ballroom F
New Replication Features: April 13th, 2010 2:00pm - 3:00pm Room: Ballroom A
Replication Tricks & Tips: April 14th, 2010 2:00pm - 3:00pm Room: Ballroom B
The Replication Tutorial: April 12th, 2010 8:30am - 12:00pm Room: Ballroom E

MySQL Replicant: Architecture

2010-02-03T21:19:00.003+01:00

MySQL Replicant Library
Class Design

In the previous post I described the first steps of a Python library for controlling the replication of large installations. The intention of the library is to provide a uniform interface to such installations and that will allow procedures for handling various situations to be written in a uniform language.

For the library to be useful, it is necessary to support installations that use different operating systems for the machines, as well as different versions of the servers. Specifically, it is necessary to allow some aspects of the system to vary.

Depending on the operating system, or even just how the server is installed on the machine, the procedures for bringing the server down and up will differ.
Configurations are managed different ways depending on the deployment and there are various other tools to manage configurations of large systems.

As part of the management of the topology, it is necessary to change the configuration files, but this should play well with other tools.

In either case, any specific method for configuration handling should neither be required nor enforced.
In the example in the previous article, the technique for cloning a server was demonstrated. In this case the naive method of copying the database files was used. For the general case, however, some backup method will be used, but it depends on the requirements of the deployment. In other words, it is necessary to parameterize the backup method as well.
Each server in the system has a specific role to fulfill. Some server are final slaves whose only purpose is to answer queries, at least one server is a master, and some servers are relay servers.

To allow the system to be parameterized on these aspects, a set of abstract classes is introduced. In the figure you can see a UML diagram describing the high-level architecture of the Replicant library.

In the figure, there are four abstract classes:

Machine: The responsibility of this class it to handle all issues that are specific to the remote operating system, for example, to fetch files or issue commands to start and stop the server.
Config: The responsibility of this class is to maintain the configuration of a server. To do this, it may need to parse configuration files to be able to extract the specific section containing the definition.
BackupMethod: The responsibility of this class is to provide the primitives to create a backup and restore a backup. In both cases, the class supports taking a backup and potentially placing the backup image at a different machine, and restoring it.
Role: The responsibility of this class is to provide all the information necessary to configure a server in a role. Since the role does not only entails pure configuration information, but can also involve keeping certain tables and other database objects available, this is modeled as a separate class.

The central Server class relies on a Machine instance and a Config instance to implement the interface to the machine and to the configuration, respectively.

Configuration Management

The configuration of the server is made part of the Replicant library since manipulating the server configuration is usually necessary when changing roles of servers.

Depending on the deployment, other configuration managers such as cfengine or puppet are used to administer the configuration of all servers, while others hand-edit the configuration files (which has to be for small configurations, since it would be a pain to administer larger deployments in this way).

Long-term, there should be support for some safety measures when working with server configurations, so implementing an interface for handling server configurations in a safe transaction-like manner—or maybe this should be called a RCU-style manner—seems like a good idea. To support that, the following methods to fetch and replace configurations are introduced.

Server.fetch_config(): Returns a Config instance of the configuration for the server.
Server.replace_config(config): Replace the configuration of the server with the modified configuration instance config.

This will allow an implementation to keep version numbers around to avoid conflicts, but is not required by the interface.

Each Config instance can then be manipulated by using the following methods:

Config.get(option): Get the value of option as a string.
Config.set(option[, value]): Set the value of option to value. If no value is supplied, None is used, which denotes that the option is set but not given a specific string value.
Config.remove(option): Remove the option from the configuration instance entirely.

So, for example, the log-bin option can be set in the following manner:

config = server.fetch_config()
config.set('log-bin', 'master-bin')
server.replace_config(config)

Machines

A MySQL server can run on many different machines and in many setups. A server can run on Linux, Solaris, or Windows, and even in those cases, there can be multiple servers on a single machine.

For a Linux machine with a single server, one usually uses the script /etc/init.d/mysql to start and stop the server—at least on my Ubuntu—but if multiple servers are used on a single machine, then mysqld_multi should be used instead.

For Windows and Solaris, the procedure for starting and stopping servers are entirely different. Windows starts and stops the servers using net start MySQL and net stop MySQL, while Solaris uses the svcadm(1M)

To parameterize the system over the various ways it can be installed, the concept of a Machine is introduced (I actually had problems figuring out a name for this, but this was suggested to me and seems to be good enough).

The responsibility of the Machine class is to provide an interface to access the installed server together with installation information such as the location of configuration files.

BackupMethod

One of the more important techniques when managing a set of server is the ability to clone a slave or a master to create new slaves. Cloning involves taking a backup of a server and then restoring the backup image on a the new slave. Since the techniques for taking backups vary a lot and different techniques will be used in different situations, parameterizing over the various backup methods is sensible.

BackupMethod.backup_to(server, url): This method will take a backup of server and store it at the location indicated by url.
BackupMethod.restore_from(server, url): This method will restore the backup image indicated by url into server.

Role

In a deployment, each server is configured to play a specific role. It can either be acting as a master, a slave, or even a relay. To represent a role, a separate Role class is introduced. Once a role is created, a server can be imbued with it.

Not every server have an assigned role.
Each server can just have a single role.
Each roles can be assigned to multiple servers.

Since a role may encompass much more than just setting some configuration parameters, this more flexible approach was chosen. When imbuing a server with a role, a piece of Python code is executed to configure the server correctly.

The use of roles in this case is actually just one of many choices, and when using this approach, there is actually two different ways that roles can be used. I am slightly undecided on the two and would like to hear comments on which one to use.

Roles are just applied to the initial deployment and does not play any role after the system have been deployed. Roles are imbued into a server initially, and then the configuration of the server can be changed by procedures to manipulate the deployment.
Roles exists in the entire deployment and when a server changes roles in the deployment, the Role instance will also change. Every server is assigned a role in the system, which is represented using a subclass of the Role class.

The first is by far the easiest to implement, which is why I chose this at this time. Since the roles are just containers for configuration options and other items that needs to be added, they are easy to write. Since this is what is used in the library currently, it is also what you see in the class design above.

The second approach seems better, but it has a number of consequences:

Every server has to have a role class associated with it, even the "initial" role is required.
If the role changes, another role class will be associated with it. This forces the role class to not only be able to imbue a server in a role, but to also unimbue the server from that role.
It cannot be possible to change the configuration of a server directly, it has to be in the form of defining a role and then changing the server to that role. Unimbuing the server from a role becomes very hard if the configuration of the server is changed outside the control of the role.

MySQL Replicant: a library for controlling replication deployments

2009-12-18T17:12:00.005+01:00

Keeping a MySQL installation up and running can be quite tricky at times, especially when having many servers to manage and monitor. In the replication tutorials at the annual MySQL Users' Conference, we demonstrate how to set up replication appropriately and also how to handle various issues that can arise. Many of these procedures are routine: bring down the server, edit the configuration file, bring the server up again, start a mysql client and add a user, etc.

It has always annoyed me that these procedures are perfect candidates for automation, but that we do not have the necessary interfaces to manipulate an entire installation of MySQL servers.

If there were an interface with a relatively small set of primitives—re-directing servers, bringing servers down, add a line to the configuration file, etc.—it would be possible to create pre-canned procedures that can just be executed.

To that end, I started writing on a library that would provide an interface like this. Although more familiar with Perl, Python was picked for this project, since it seems to be widely used by many database administrators (it's just a feeling I have, I have no figures to support it) and just to have a cool name on the library, we call it MySQL Replicant and it is (of course) available at Launchpad.

So what do we want to achieve with having a library like this? Well... the goal is to to provide an generic interface to complete installations and thereby make administration of large installations easy.

By providing such an interface, it will allow description of procedures in an executable format, namely as Python scripts.

In addition to making it easy to implement common tasks for experienced database administrators, it also promotes sharing by providing a way to write complete scripts for solving common problems. Having a pool of such scripts makes it easier for newcomers to get up and running.

The basic idea is that you create a model of the installation on a computer and then manipulate the model. When doing these manipulations, the appropriate commands—either as SQL commands to a running server or shell commands to the host where the server is running—will then be sent to the servers in the installation to configure them correctly.

So, to take small example, how does the code for re-directing a bunch of servers to a master look?

import mysqlrep, my_servers
for slave in my_server.slaves:
   mysqlrep.change_master(slave, my_servers.master)

In this case, the installation is defined in a separate file and is imported as a Python module. Right now, the interface for specifying a topology is quite rough, but this is going to change.

from mysqlrep import Server, User, Linux

servers = [Server(server_id=1, host="server1.example.com",
                  sql_user=User("mysql_replicant", "xyzzy"),
                  ssh_user=User("mysql_replicant"),
                  machine=Linux()),
           Server(server_id=2, host="server2.example.com",
                  sql_user=User("mysql_replicant", "xyzzy"),
                  ssh_user=User("mysql_replicant"),
                  machine=Linux()),
           Server(server_id=3, host="server3.example.com",
                  sql_user=User("mysql_replicant", "xyzzy"),
                  ssh_user=User("mysql_replicant"),
                  machine=Linux()),
           Server(server_id=4, host="server4.example.com",
                  sql_user=User("mysql_replicant", "xyzzy"),
                  ssh_user=User("mysql_replicant"),
                  machine=Linux())]
master = servers[0]
slaves = servers[1:]

Here, the Server class represents a server and to be able to do it's job, it is necessary to have one MySQL account on the server and one shell account on the host machine. Right now, it is also necessary to specify the server ID, but the plan is to just require the host, port, socket, SQL account name, and SSH account information. The remaining information can then be fetched from the configuration file of the server. Each server have a small set of primitives on top of which everything else is built:

Server.sql(SQL command): Execute the SQL command and return a result set.
Server.ssh(command list): Execute the command given by the command list return an iterator to the result output.
Server.start(): Start the server
Server.stop(): Stop the server.

There is a small set of commands defined on top of these primitives that can be used. Here is a list of just a few of them, but there are some more in the library at Launchpad.

change_master(slave, master, position=None): Change the master of slave to be master and start replicating from position.
fetch_master_pos(server): Fetch the master position of server, which is the position where the last executed statement ends in the binary log.
fetch_slave_pos(server): Fetch the slave position of server, which is the position where the last executed event ends.
flush_and_lock_database(server): Flush all tables on server and lock the database for read.
unlock_database(server): Unlock a previously locked database.

Using these primitives, it is easy to clone a master by executing the code below. For this example, I use the quite naive method of backing up a database by creating an archive of the database files and copying them to the new slave.

from mysqlrep import flush_and_lock_database, fetch_master_position
from subprocess import call

flush_and_lock_database(master)
position = fetch_master_position(master)
master.ssh("tar Pzcf " + backup_name + " /usr/var/mysql")
unlock_database(master)
call(["scp", source.host + ":" + backup_name, slave.host + ":."])
slave.stop()
slave.ssh("tar Pzxf " + backup_name + " /usr/var/mysql")
slave.start()
start_replication(slave)

What do you think? Would this be a valuable project to pursue? Here are some links related to this post:

MySQL Musings

MySQL Central: It's that time of the year

MySQL Fabric: Musings on Release 1.4.3

Beyond MySQL Fabric 1.4.3

MySQL Fabric: Tales and Tails from Percona Live

The Art of Framing the Fabric

Hey! I've got a server in my farm!

On the proximity of things...

The crux of the problem

More Fabric-aware Connectors

Interesting links

MySQL Fabric 1.4.2 Released

Do you want to participate?

Blogs about MySQL Fabric

MySQL Fabric: High Availability Groups

High-Availability Groups

Server properties in groups

What's up with transactions?

Transaction properties

Picking a server

Implementation of groups

Summary and considerations for the future

Related Links

MySQL Connect presentations on MySQL Fabric available on SlideShare

A Brief Introduction to MySQL Fabric

High-availability Groups

Sharding

Connecting to a MySQL Farm

More information about MySQL Fabric

Going to MySQL Connect 2013

Round Robin Replication using GTID

Full code for original implementation

Binary Log Group Commit in MySQL 5.6

Transactions galore...

The Infamous prepare_commit_mutex

Steady as she goes... NOT!

So, how does all this work then...

Performance, performance, performance...

Summary and closing remarks

Related links

Pythonic Database API: Now with Launchpad

Links

MySQL: Python, Meta-Programming, and Interceptors

Python Interface to MySQL

That's all folks!

Binlog Group Commit Experiments

Binlog Group Commit Experiments

What's in the patch

The different approaches

Efficiently using Multiple Cores

Scaling out

Concluding remarks

Round-Robin Multi-Source in Pure SQL

Saving state information

Switching masters

Crash-safe Replication

Crash-safe masters

Crash-safe slaves

Options to select replication information repository

Selecting replication repository engine

Event processing

Closing remarks

Replication Event Checksum

Closing remarks

Slave Type Conversions

Have you seen my replication files?

An alternative: let the application do the job

Binary Log Group Commit - Recovery

A quick intro to the structure of the binary log

Recovery and the binary log

So, what's the problem?

A proposal for a new recovery algorithm

Binary Log Group Commit - An Implementation Proposal

The sequential write approach

A parallel write approach

MySQL Conference Replication tutorial: Article and Demo Software

Going to the O'Reilly MySQL Conference & Expo

MySQL Replicant: Architecture

Configuration Management

Machines

The Infamous `prepare_commit_mutex`