tag:blogger.com,1999:blog-234960292024-01-24T14:17:30.288+01:00MySQL MusingsVarious musings on mainly the development and technical side of MySQL.Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.comBlogger52125tag:blogger.com,1999:blog-23496029.post-25380150240301106382014-09-16T20:11:00.000+02:002014-09-16T20:12:16.312+02:00MySQL Central: It's that time of the year<a href="http://www.oracle.com/openworld/mysql/index.html"><img class="figure-right" src="http://www.oracleimg.com/us/dm/h2fy11/oow-imspeaking-125x125-2225039.jpg"></a>
It's that time of the year again: yes, Oracle Open World is coming up and with that I'll be travelling to San Francisco. New for this year is that we are part of the main Open World event and therefore have our own <em><a href="http://www.oracle.com/openworld/mysql/index.html" >MySQL Central</a></em>. Here you will have the opportunity of meeting many of the engineers behind MySQL, discuss technical problems you have, and also learn some about how we look at the future of the MySQL ecosystem.<p>
This year, me and <a href="http://vnwrites.blogspot.se">Narayanan Venkateswaran</a> will be presenting two sessions:
<dl>
<div class="vevent">
<dt><strong><em><span class="summary">Elastic Scalability in MySQL Fabric with OpenStack</span></em> (<span class="dtstart" title="2014-10-02T13:15-08">Thursday, Oct 2, 1:15 PM</span>-<span class="dtend" title="2014-10-02T14:00-08">2:00 PM</span> in <span class="location">Moscone South, 252</span>)</strong></dt>
<dd><p class="description">In this session you will see how Fabric can use the new provisioning support to fetch servers from an <a href="https://www.openstack.org/">OpenStack</a> instance. The presentation will cover how to use the provisioning support to fetch servers from OpenStack Nova, OpenStack Trove, and also from Amazon AWS. You will also learn about the provisioning interface and how you can use it to create your own hardware registry support.</p>
</dd></div>
<div class="vevent">
<dt><strong><em><span class="summary">MySQL Fabric: High Availability at Different Levels</span></em> (<span class="dtstart" title="2014-10-01T14:00-08">Wednesday, Oct 1, 2:00 PM</span>-<span class="dtend" title="2014-10-01T14:45-08">2:45 PM</span> in <span class="location">Moscone South - 250</span>)</strong></dt>
<dd><p class="description">MySQL Fabric is a distributed system that requires coordination among different components to provide high availability: connectors, servers, and MySQL Fabric nodes must be orchestrated to create a solution resilient in the face of failures. Ensuring that each component alone is fault-tolerant does not guarantee that applications will continue working in the event of a failure. In this session, you will learn how all components in MySQL Fabric were designed to provide a high-availability solution and how they cooperate to achieve this goal. The presentation shows you how to create your own infrastructure to monitor MySQL servers and manage manual switchover or automatic failover operations together with MySQL Fabric.</p>
</dd></div>
</dl>
As every year, it's going to be fun to meet all the people in the MySQL community, both new and old, so I'm looking forward to meeting you all there.
Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com10tag:blogger.com,1999:blog-23496029.post-42546873566141602992014-05-27T14:47:00.000+02:002014-05-27T19:25:08.189+02:00MySQL Fabric: Musings on Release 1.4.3As you might have noticed in the <a href="http://www.oracle.com/us/corporate/press/2208808">press release</a>, we just released MySQL Utilities 1.4.3, containing MySQL Fabric, as a General Availability (GA) release. This concludes the first chapter of the MySQL Fabric story.<p>
It all started with the idea that it should be as easy to manage and setup a distributed deployments with MySQL servers as it is to manage the MySQL servers themselves. We also noted that some of the features that were most interesting were sharding and high-availability. Since we also recognized that every user had different needs and needed to customize the solution, we set of to create a <em>framework</em> that would support sharding and high-availability, but also other solutions.<p>
With the release of 1.4.3, we have a range of features that are now available to the community, and all under an open source license and wrapped in an easy-to-use package:
<ul>
<li>High-availability support using built-in slave promotion in a master-slave configuration.</li>
<li>A framework with an execution machinery, monitoring, and interfaces to support management of large server farms.</li>
<li>Sharding using hash and range sharding. Range sharding is currently limited to integers, but hash sharding support anything that looks like a string.</li>
<li>Shard management support to move and split shards.</li>
<li>Support for failure detectors, both built-in and custom ones.</li>
<li>Connectors with built-in load balancing and fail-over in the event of a master failure.</li>
</ul>
<h2>Beyond MySQL Fabric 1.4.3</h2>
As the MySQL Fabric story develop, we have a number of challenges ahead.<p>
<strong>Loss-less Fail-over. </strong>MySQL 5.7 have extended the support for semi-sync so that transactions that are not replicated to a slave server will not be committed. With this support, we can truly have a loss-less fail-over so that you cannot lose a transaction if a single server fails.<p>
<strong>More Fabric-aware connectors. </strong> We currently have support for Connector/J, Connector/PHP, and Connector/Python, but one common request is to have support for a Fabric-aware C API. This is both for applications developed using C/C++, but also to add Fabric support to connectors based on the MySQL C API, such as the Perl and Ruby connector.<p>
<strong>Multi-Node Fabric Instance. </strong>Many have pointed out that the Fabric node is a single point of failure, and it is instead a single node, but if the Fabric node goes down, the system do not stop working. Since the connectors cache the data, they can "run on the cache" for the time it takes for the Fabric node to be brought up again. Procedures being executed will stop, but once the Fabric node is on-line again, execution will resume from where it left off. To ensure that the meta-data (the information about the servers in the farm) is not lost in the event of a machine failure, <a href="http://www.mysql.com/products/cluster">MySQL Cluster</a> can be used as storage engine, and will then ensure that your meta-data is safe.<p>
There are, however, a few advantages in having support for multiple Fabric nodes:
<ul>
<li>The most obvious advantage is that <strong>execution can fail-over to another node</strong> and there will be no interruption in the execution of procedures. If the fail-over is built-in, you avoid the need for external clusterware to manage several Fabric nodes.</li>
<li>If you have several Fabric nodes available to deliver data, you <strong>improve responsiveness to bursts in meta-data requests</strong>. This can happen if you have a large bunch of connectors brought on-line at the same time.</li>
<li>If you have multiple data centers, having a local version of the data to serve the applications deployed in the same center <strong>improve locality of data</strong> and avoid an unnecessary round-trip over WAN to fetch some meta-data.</li>
<li>With several nodes to execute management procedures, you can <strong>improve scaling</strong> by being able to execute several management procedures in parallel. This would require some solution to avoid that that procedures do no step over each other.</li>
</ul>
<strong>Location Awareness. </strong> In deployments spread over several data-centers, the location of all the components suddenly become important. There is no reason for a connector to be directed to a remote server when a local one suffices, but that require some sort of location awareness in the model allowing the location of servers (or other components) to be given.<p>
Extending the model by adding data centers is not enough though. The location of components <em>withing</em> a data center might be important. For example, if a connector is located in a particular rack in the data center, going to a different rack to fetch data might be undesirable. For this reason, the location awareness need to be hierarchical and support several levels, e.g., continent, city, data center, hall, rack, etc.<p>
<strong>Multi-Shard Queries. </strong> Sharding can improve performance significantly since it split the data horizontally across several machines and each query therefore go directly to the right shard of the data. In some cases, however, you also need to send queries to multiple shards. There are a few reasons for this:
<ul>
<li>You do not have the shard key available, so you want to search all servers for some object of interest. This of course affect performance, but in some cases there are few alternatives. Consider, for example, searching for a person given name and address when the database is sharded on the SSN.</li>
<li>You want to generate a report of several items in the database, for example, find all customers above 50 that have more than 2 cars.</li>
<li>You want a summary of some statistic over the database, for example, generate a histogram over the age of all your customers.</li>
</ul>
<strong>Session Consistency Guarantees. </strong> As <a href="http://alfranio-distributed.blogspot.pt/2014/05/mysql-fabric-server-properties-scaling.html" >Alfranio point out</a>, when you use multiple servers in your farm, and transactions are sent to different servers at different times, it might well be that you write one transaction that goes to the master of a group and then try to read something from the same group. If the write transactions have not reached the server that you read from, then you might get an incorrect result from your transaction. In some cases, this is fine, but in other cases, you have certain guarantees that you want to have on your session. For example, you want to ensure that anything you write will also be available when you read in transactions following the write, you might want to guarantee that multiple reads read later data all the time (called "read monotonicity"), or other forms of guarantees on the result sets you get back from the distributed database. This might require connectors to wait for transactions to reach slaves before reading, but this should be transparent to the application.<p>
This is just a small set of the possibilities for the future, so it is really going to be interesting to see how the MySQL Fabric story develops.
<ul>
<li>You can download MySQL Utilities (which includes MySQL Fabric) from <a href="http://dev.mysql.com/downloads/tools/utilities">http://dev.mysql.com/downloads/tools/utilities</a></li>
<li>You can read MySQL Utilities documentation at <a href="http://dev.mysql.com/doc/mysql-utilities/1.4/en/index.html">http://dev.mysql.com/doc/mysql-utilities/1.4/en/index.html</a></li>
<li>You can report bugs or request features on <a href="http://bugs.mysql.com">http://bugs.mysql.com</a></li>
<li>MySQL Forum <a href="http://forums.mysql.com/list.php?144">Fabric, Sharding, HA, Utilities</a></li>
</ul>Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com9tag:blogger.com,1999:blog-23496029.post-58022301905525992732014-04-29T15:21:00.000+02:002014-04-29T15:23:27.636+02:00MySQL Fabric: Tales and Tails from Percona LiveGoing to Percona Live and presenting MySQL Fabric gave me the opportunity to meet a lot of people and get a lot of good feedback. I talked to developers from many different companies and got a lot of great feedback that will affect the priorities we make, so to all I spoke to I would like to say a great "Thank you!" for the interesting discussions that we had. Your feedback is very valuable.
It was very interesting to read the comments on <a href="http://dev.mysql.com/doc/mysql-utilities/1.4/en/fabric.html">MySQL Fabric</a> on <a href="http://www.mysqlperformanceblog.com/2014/04/25/managing-farms-of-mysql-servers-with-mysql-fabric/">MySQL Performance Blog</a>. The article discuss the <a href="http://dev.mysql.com/downloads/tools/utilities/">current version of MySQL Fabric</a> distributed with <a href="http://dev.mysql.com/doc/mysql-utilities/1.4/en/">MySQL Utilities</a> and give some brief points on features of <a href="http://dev.mysql.com/doc/mysql-utilities/1.4/en/fabric.html">MySQL Fabric</a>. I think it could be good to give some context to some of the points they raise, both to elaborate on the points and show what they mean in reality, and also to give some background to how we were thinking around these points.
<h1>The Art of Framing the Fabric</h1>
It was a deliberate decision to make MySQL Fabric extensible, so it is not surprising that it have the feel of a framework. By making MySQL Fabric extensible, we allow community and users to explore ideas or add user-specific support.<p>
In the MySQL Team at Oracle we are strong believers in the open source model and are working hard to keep it that way. There are many reasons to why we believe in this model, but <em>one</em> of the reasons is that we do not believe that one size fit all. For any users, there are always minor variations or tweaks that are required by the users own specific needs. This means that the ability to tweak and adapt the solution to their specific needs is very important. Without MySQL being open-source, this would not be possible. As you can see from <a href="http://webscalesql.org">WebScaleSQL</a>, this is not just a theoretical exercise, this is how companies really use MySQL.<p>
From the start, we therefore focused on building a framework and created the sharding and high-availability as plugins; granted, they are very important plugins, but they are nevertheless plugins. This took a little more effort, and a little more thinking, but by doing it this way we can ensure that the system is truly extensible for everybody.
<h1>Hey! I've got a server in my farm!</h1>
As noted, many if the issues related to high-availability and sharding require server-side support to get it really solid. This is also something we recognized quite early; the alternative would be to place the logic in the connectors or the Fabric node. We recognized that the right place to solve this is in the server, not in connector layer since that put a lot of complexity at the wrong place. Even if it was possible to handle everything in the connector, there is still a chance that something goes wrong if the constraints are not enforced in the server. This could be because of bugs, because of mistakes in the administration of the server, or any other number of reasons, so to build a solid solution, constraints on the data should be enforced by the servers and not in the connectors or in a proxy.<p>
An example given is that there is no way to check that a row ends up in the right shard, which is very true. A generic solution to this would be to add <code>CHECK</code> constraint on the server, but unfortunately, this is a very big change in the server code-base. Adding triggers to the tables on the server is probably a good short-term solution, but that require managing and deploying extra code on all servers, which in turn is an additional burden on managing the servers, which is something we would like to avoid (the more "special" things you have to do with the servers, the higher the risk is of something going wrong).
<h1>On the proximity of things...</h1>
One of the central components of MySQL Fabric are the <em>high-availability groups</em> (or just <em>groups</em>, when it is clear from the context) that were discussed in <a href="http://mysqlmusings.blogspot.se/2013/10/mysql-fabric-high-availability-groups.html">an earlier post</a>. The central idea around a group is that each group manages the same piece of data and MySQL Fabric is designed to handle and coordinate multiple groups into a federation of databases. The feature of being able to manage multiple groups is something that is critical to create a sharded system. On thing that is quite often raised is that it should be possible for a server to belong to multiple groups, but I think this comes from a misunderstanding on what a group represents. It is not a "replica set", which gives information about the topology, that is, how replication is set up, nor does it say anything about how the group is deployed. It is perfectly OK to have members of the group in different data centers (for geographical redundancy), and it is perfectly OK to have replication <em>between</em> groups to support, for example, functional partitioning. If a server belonged to two different groups, it would mean that it manages two different sets of data at the same time.<p>
The fact that group members can be located in different data centers raises another important aspect, something that was often mentioned at Percona Live, that of managing the <em>proximity</em> of components in the system. There is some support for this in Hadoop where you have rack-awareness, but we need a slightly more flexible model. Imagine that you have a group set up with two servers in different data centers and you further have scale-out slaves attached locally. You have connectors deployed in both data centers, but when reading data you do <em>not</em> want to go to the other data center to execute the transaction, it should always be done locally. So, is it sufficient to be able to just have a simple grouping of the components? No, because you can have multiple levels of proximity, for example, data centers, continents, and even rooms or racks within a data center. You can also have different facets that you want to model, such as latency, throughput, or other properties that are interesting for particular uses. For that reason, whatever proximity model we deploy, it need to support a hierarchy and also have a more flexible cost model where you can model different aspects. Given that this problem have been raised several times on Percona Live and also by others, it is likely to be something we need to prioritize.<p>
<h1>The crux of the problem</h1>
As most of you have already noted, there is a single Fabric node running that everybody talk to. Isn't this a single point of failure? It is indeed, but there is more to the story than just this. A single point of failure is a problem because if it goes down, so does the system... but in this case, it doesn't really go down, it will keep running most of the time.<p>
The Fabric node does a lot of things: it keeps track of the status of all the components of the farm, execute procedures to handle fail-over, and deliver information about the farm on request. However, the connectors are the ones that route the transactions to the correct place, and to avoid having to ask the Fabric node about information each time, the connectors maintain caches. This means that in the event of a Fabric node failure, connectors might not even notice that it is gone unless they had to re-fill their caches. This means that if you restart the Fabric node, it will be able to serve the information again.<p>
Another thing that stops when the Fabric node goes down is that no more fail-overs can be done and ongoing procedures are stopped in their tracks, which could potentially leave the farm in an unknown state. However, the state of the execution of any ongoing procedures are stored in the backing store, so when you bring up the Fabric node again, it will restore the procedures from the backing store and continue executing. This feature alone do not help against a complete loss of the machine where the Fabric node and the backing store are put, but, MySQL Fabric is not relying on specific storage engine features, any transactional engine will do, so by using MySQL Cluster as the storage engine it is possible to ensure safe-keeping of the state.<p>
There are still good reasons to support multi-node Fabric instances:
<ul>
<li>If one Fabric node goes down, it should automatically fail over to another and continue execution. This will prevent any downtime in handling executions.</li>
<li>Detecting and bringing up a secondary Fabric node can become very complicated in the case of network partitions since it require handling split-brain scenarios reliably. It is then better to have this built into MySQL Fabric since it makes deployment and management significantly simpler.</li>
<li>Management of a farm does not put any significant pressure on the database back-end, but having a single Fabric node can be a bottleneck. In this case, it would be good to be able to execute multiple independent procedures on different Fabric nodes and coordinate the updates.</li>
<li>If a lot of connectors are required to fill their caches at the same time, we have a risk of a thundering herd. Having a set of Fabric nodes for read scale-out can then be beneficial.</li>
<li>If a group is deployed in two very remote data centers, it is desirable to have a local Fabric node for read-only purposes instead of having to go to the other data center.</li>
</ul>
<h1>More Fabric-aware Connectors</h1>
Currently we support connectors for Python, Java, and PHP, but one point that pop up quite often (both at Percona Live and elsewhere) is the lack of a Fabric-aware C connector. It is the basis for implementing both the Perl Database Interface MySQL driver <code>DBD::mysql</code> and for the Ruby connector, but is also desirable in itself for applications using C or C++ connector. All I can say at this point is that we are aware of the situation and know that it is something desired and important.
<h1>Interesting links</h1>
<ul>
<li><a href="https://blogs.oracle.com/mysqltesting/entry/mysql_fabric_setup_using_ndb?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+orana+%28OraNA%29">MySQL Fabric Setup using ndb Cluster</a></li>
</ul>
Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com1tag:blogger.com,1999:blog-23496029.post-13594567263456638792014-04-01T05:17:00.000+02:002014-04-02T06:47:37.473+02:00MySQL Fabric 1.4.2 ReleasedAs you saw in the <a href="http://www.oracle.com/us/corporate/press/2180737">press release</a>, MySQL Fabric 1.4.2 is now released! If you're interested in learning more about MySQL Fabric, there is a session <span class="vevent"><time class="dtstart" datetime="2014-04-03T11:10-08:00">April 3, 2014 11:10</time>–<time class="dtend" datetime="2014-04-03T12:00-08:00">12pm</time> titled <a class="url summary" href="https://www.percona.com/live/mysql-conference-2014/sessions/sharding-and-scale-out-using-mysql-fabric">Sharding and Scale-out using MySQL Fabric</a> in <span class="location">Ballroom G</span></span>.
MySQL Fabric is a relatively new project in the MySQL ecosystem and it focuses on building a framework for working with large deployments of MySQL Servers. The architecture of MySQL Fabric is such that it allows extensions to be added and the first two extensions that we added were support for high-availability using <i>High-Availability groups</i> (<i>HA groups</i>) and sharding to manage very large databases. The first version of sharding have hash and range sharding implemented as well as procedures for moving and splitting shards.<br />
A critical part of working with a collection of servers is the ability to route transactions to the correct servers, and for efficiency reasons we quite early decided to put this routing logic into the connectors. This avoid one extra network hop and hence improve performance by reducing latency, but it does require that the connectors containing routing logic, caches, and support for fetching data from MySQL Fabric. Putting the routing logic into the connector also make it easy to extend the API to add new support that applications can require.<br />
MySQL Fabric 1.4.2 is distributed as part of MySQL Utilities 1.4.2. To avoid confusion, we have changed the version numbering to match the version of MySQL Utilities it is distributed in.
<br />
<ul>
<li>You can download MySQL Utilities 1.4.2 from <a href="http://dev.mysql.com/downloads/tools/utilities">http://dev.mysql.com/downloads/tools/utilities</a></li>
<li>You can read MySQL Utilities 1.4.2 documentation at <a href="http://dev.mysql.com/doc/mysql-utilities/1.4/en/index.html">http://dev.mysql.com/doc/mysql-utilities/1.4/en/index.html</a></li>
</ul>
We have just done a few public releases, even though we did a few internal releases as well, but a brief history of our releases this far is:
<br />
<ul>
<li><b>MySQL Fabric 1.4.0</b>
<ul>
<li>First public release</li>
<li>High-Availability groups for modeling farms</li>
<li>Event-driven Executor for execution of management procedures.</li>
<li>Simple failure detector with fail-over procedures.</li>
<li>Hash and Range sharding allowing management of large databases.</li>
<li>Shard move and shard split to support management of a sharded database.</li>
<li>Connector interfaces to support federated database systems.</li>
<li>Fabric-aware Connector/Python (labs)</li>
<li>Fabric-aware Connector/J (labs)</li>
<li>Fabric-aware Connector/PHP (labs)</li>
</ul>
</li>
<li><b>MySQL Fabric 1.4.1</b>
<ul>
<li>More solid scale-out support in connectors and MySQL Fabric</li>
<li>Improvements to the Executor to avoid stalling reads</li>
<li>Connector/Python 1.2.0 containing:</li>
<ul>
<li>Range and Hash sharding</li>
<li>Load-balancing support</li>
</ul>
</ul>
</li>
<ul>
<li>Labs release of Connector/J with Fabric-support</li>
</ul>
</ul>
<li style="margin-left: 40px;"><b>MySQL Fabric 1.4.2</b>
<ul>
<li>Credentials in MySQL Fabric</li>
<li>External failure reporting interfaces supporting external failure detectors</li>
<li>Support for unreliable failure detectors in MySQL Fabric</li>
<li>Credentials support in Connector/Python</li>
<li>Connector/Python 1.2.1 containing:</li>
<ul>
<li>Failure reporting</li>
<li>Credentials Support</li>
</ul>
</ul>
</li>
<li style="margin-left: 80px;">Connector/J 5.1.30 containing Fabric support</li>
<br />
<h2>
Do you want to participate?</h2>
There is a lot you can do if you want to help improve MySQL Fabric.
<br />
<ul>
<li>If you find bugs or want specific feature, please report a bug at <a href="http://bugs.mysql.com/">http://bugs.mysql.com</a></li>
<li>MySQL Forum <a href="http://forums.mysql.com/list.php?144">Fabric, Sharding, HA, Utilities</a></li>
</ul>
<h2>
Blogs about MySQL Fabric</h2>
<ul>
<li><a href="http://alfranio-distributed.blogspot.com/">Alfranio Correia blogs on High-Availability and MySQL Fabric</a>
</li>
<li><a href="http://vnwrites.blogspot.com/">Narayanan Venkateswaran blogs on sharding and MySQL Fabric</a>
</li>
<li><a href="http://geert.vanderkelen.org/">Geert Vanderkelen blogs on Connector/Python and MySQL Fabric</a>
</li>
<li><a href="http://schlueters.de/blog/">Johannes Schlüter blogs on Connector/PHP and MySQL Fabric</a>
</li>
<li><a href="http://blog.ulf-wendel.de/">Ulf Wendel blogs about Connector/PHP and MySQL Fabric</a></li>
</ul>
Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com0tag:blogger.com,1999:blog-23496029.post-78087829477796773062013-10-21T22:13:00.000+02:002013-10-22T06:55:48.552+02:00MySQL Fabric: High Availability GroupsAs you might have noticed, we have released a framework for managing farms (or grids, as <a href="http://swanhart.livejournal.com/">Justin</a> suggested) of MySQL servers called <i>MySQL Fabric</i>. MySQL Fabric is focused on being easy to use and extensible, and two extensions are currently part of the framework: one to manage high-availability and one to implement sharding.<p>
<table class="figure-right"><caption>High-Availability Group</caption> <tbody>
<tr><td><object data="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiA-y1zugc3JTn08COy_c3jTA8H2fyqmNrNDeTWq4SQJviiMtpCj9iCBbNpWgjpyiooq-N5IrQ_BkXKa_xQau_wlaORZrE42iZSrni2MIzp8gshaMXwqfhBdS2J0-cGVJx-kdWK/s1600/ha_group.png" height="196" type="image/png" width="150"></object></td></tr>
</tbody></table>
<h2>
High-Availability Groups</h2>
One of the central concepts used to construct a farm is the <i>high-availability group</i> (or just <i>group</i> when there is no risk of confusion) and is introduced by the high-availability extension. As mentioned in the <a href="http://mysqlmusings.blogspot.se/2013/09/brief-introduction-to-mysql-fabric.html">previous post</a>, the group concept does not really represent anything new but is rather a formalization of how we think and work with the structure of the farm. The key to supporting high-availability is to have <i>redundancy</i> in the system: if one component fail, another one should be ready to pick up the job of the failing component. Hardening the systems (by using hardware less prone to fail or hardware with built-in redundancy) can help reduce the chance of a component failing, but not completely eliminate it. Even a hardened system is susceptible to failure in a power outage or an earthquake. With this in mind, we introduced the group concept for managing pieces of data in our farm: <p>
each group consists of several machines that are responsible for managing the same piece of data.The concept of a group is an abstraction to model the basic concept that we're after, but does not really say anything about how it is implemented. This is intentional: it should be concrete enough to support all the operations we need, but abstract enough to not restrict how it is implemented. This is important because connectors (or any other "outside" observer) that work with groups should not have to be updated whenever new implementations are added. For example, it should not make a difference to a connector if the group is implemented using a traditional Master-Slave setup, a MySQL Cluster, or using replicated storage such as DRBD.<p>
<h2>
Server properties in groups</h2>
There are a few key properties that we assume for groups:
<ul>
<li>A server belong to (at most) one group.</li>
<li>At any time, each server in the group have a designated a <i>status</i>.</li>
<li>At any time, each server has a <i>mode</i> indicating if it accepts reads, writes, both, or neither.</li>
<li>Each server also has a <i>weight</i>, which is the relative power of the server and is used to balance the load.</li>
</ul>
Note that these properties might change over time, depending on events that happen.For handling load-balancing and high-availability the properties <i>Status</i>, <i>Mode</i>, and <i>Weight</i> where introduced. The mode and weight properties are used by a connector when it comes to deciding where to send a transaction, while the status property is used by the Fabric to keep track of the status of the server. Let's take a closer look at the properties.<p>
<table class="figure-right"><caption>Figure 1. Server Status</caption> <tbody>
<tr><td><object data="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFHcVGO8eLAQtAekHx7zfOVoBJ-XZbzVmG-b_0459JVfc_8-ogIjyFLrNz3ZKfvD_pO-zjlKERlSOog4F7rxvI5ngq1r_uDwREk6vNVDEkZVv5My_aMRREHQqI9xZOdYkDpi28/s320/server-states.png" height="396" type="image/png" width="492"></object></td></tr>
</tbody></table>
<b>Server Status (or Role).</b> The status of the server provide information about what the server is currently doing in the group. The status of a server is Fabric's view of the status of the server and changes as time passes and the Fabric notice changes. A <i>primary</i> server accept both a write and a read load and sending high-priority read transactions here mean that they get current data. A <i>secondary</i> server can handle reads but, in the case of a master-slave configuration, it should not accept writes since that would lead to a <i>split-brain</i> situation. <i>Secondary</i> servers are servers waiting to pick up the job of the primary if it fails. <i>Spare</i> servers do not accept reads nor writes, but are ready and running and can therefore change status in the group to replace other servers in the event of failures. In addition, spare servers can be used to handle reads.<p>
In Figure 1 you can see an example of how servers could change status, but note that at this time, we do not track all states. For example, we are considering how to handle the provisioning of new servers in flexible and extensible way, but more about that in a separate post.<p>
<b>Server Mode.</b> The mode of the server gives information on whether it can be read or written and provide information for the connector on how it should send queries. For now, we only have three modes: <i>Offline</i>, <i>Read-only</i>, and <i>Read-Write</i>. <i>Offline</i> servers cannot be read from or written to, and usually does not accept connections. <i>Read-only</i> servers can only be read from and write transactions should not be sent to these. <i>Read-Write</i> servers are usually primaries of the group. They can accept writes and will propagate them correctly to other servers in the group.<p>
<b>Server Weight.</b> The weight of a server is used to balance the load between servers. The weight represent the relative power of the server. When balancing the load between servers, the connector will figure out what servers are eligible for accepting a transaction and then pick one of the servers in such a way that the distribution over time will be proportional to the weight of the server.
<div class="sidebar figure-right" width="400pt">
<h2>
What's up with transactions?</h2>
So what is it with transactions that makes it necessary to declare the properties of the transaction <i>before</i> it actually? The reason is quite simple: the transaction is fed to the server one line at a time. This means that it is not always possible to know if it is a read-only or read-write transaction until the entire transaction has been seen.<p>
An example is this transaction (yeah, it's a little constructed, but it's just an example):
<pre class="brush: sql;">START TRANSACTION;
SELECT salary INTO @salary FROM salaries
WHERE emp_no = 20101 AND to_date = DATE('9999-01-01');
UPDATE titles SET to_date = CURRENT_DATE()
WHERE emp_no = 20101 and to_date = DATE('9999-01-01');
UPDATE salaries SET to_date = CURRENT_DATE()
WHERE emp_no = %s and to_date = DATE('9999-01-01');
INSERT INTO titles VALUES
(20101, 'Master of the Universe', CURRENT_DATE(), DATE('9999-01-01'));
INSERT INTO salaries VALUES
(20101, 10 * @salary, CURRENT_DATE(), DATE('9999-01-01'));
COMMIT;</pre>
On line 1 or 2, it is not possible to know if this is a read-only or a read-write transaction, it cannot be known until line 3, where an <code>UPDATE</code> is executed.</div>
<h2>
Transaction properties</h2>
As mentioned <a href="http://mysqlmusings.blogspot.se/2013/09/brief-introduction-to-mysql-fabric.html">before</a>, one of the goals is to support sharding in the presence of transactions and to make that work correctly, it is necessary to declare up-front what the transaction will contain. Not everything, but the key elements of the transaction: what tables it will access, what sharding key is used, and if it is a read-only or read-write transaction. The first two properties are only necessary if you are working with a sharded system, so we skip those for now; the last one, however, is important for handling load-balancing in the connector.When executing transactions using a Fabric-aware connector, you provide the information about the transaction using <i>transaction properties</i>. There are several properties available, but we will focus on the ones related to group handling: <code>group</code> and <code>type</code>. The <code>group</code> property is used to provide the name of the group you want to connect to (you can have several), and the <code>type</code> property is used to tell if this is a read-only or read-write transaction. In the future, we might add more properties such as <i>priority</i> to indicate that this is an urgent transaction and a prompt reply is needed. For example, the following code is using a Fabric-aware connector to promote an employee.<p>
<pre class="brush: python;">from mysql.connector.fabric import (
TYPE_READWRITE,
)
def promote_employee(conn, emp_no):
stmts = [
("SELECT salary INTO @salary FROM salaries"
" WHERE emp_no = %s AND to_date = DATE('9999-01-01')"),
("UPDATE titles SET to_date = CURRENT_DATE()"
" WHERE emp_no = %s and to_date = DATE('9999-01-01')"),
("UPDATE salaries SET to_date = CURRENT_DATE()"
" WHERE emp_no = %s and to_date = DATE('9999-01-01')"),
("INSERT INTO titles VALUES"
" (%s, 'Master of the Universe', CURRENT_DATE(), DATE('9999-01-01'))"),
("INSERT INTO salaries VALUES"
" (%s, 10 * @salary, CURRENT_DATE(), DATE('9999-01-01'))"),
]
# Use the group for the ACME company
conn.set_property('group', 'ACME')
conn.set_property('type', TYPE_READWRITE)
conn.start_transaction()
cur = conn.cursor()
for stmt in stmts:
print "Executing:", stmt % (emp_no,)
cur.execute(stmt, (emp_no,))
conn.commit()</pre>
On line 20 and 21 you see how the properties of the transaction is set. In this case, we declare the group that we will access (for example, a fictional company "ACME") and also the type of the transaction. After that, a transaction is started as normal and executed. The Fabric-aware connector will pick the right server to send the transaction to and you will get the result back in the normal fashion.<p>
<div class="note">
Note that the property <code>type</code> is not yet implemented in Connector/Python, some work remains to make it support load-balancing fully.</div>
<h2>
Picking a server</h2>
But are these server and transaction properties sufficient for a connector to make a decision on what to do with a transaction? Let's take a look and see how the server can be selected.A server can be chosen by first selecting a set of candidates and then picking one of the candidates based on the weight of the server. Picking the candidates are done by matching the transaction properties and the server properties to find all server that are eligible for accepting the transaction. When a list of candidates are available you can, for example, pick one at random based on the weight of the servers. You can see an example Python code below that illustrates how this could be done. The first function <span class="symbol">find_candidates</span> computes the set of candidates from the set of all servers <span class="symbol">SERVERS</span>, while the second function <span class="symbol">pick_server</span> pick one of the servers at random based on the weight of the server.<p>
<pre class="brush: python;">def find_candidates(props):
candidates = []
for srv in SERVERS:
if props.group == srv.group and (props.mode & srv.mode):
candidates.append(srv)
return candidates
def pick_server(servers):
random_weight = random() * sum(srv.weight for srv in servers)
sum_weight = 0.0
for idx, srv in enumerate(servers):
sum_weight += srv.weight
if sum_weight > random_weight:
return servers[idx]
return servers[-1] # Last server in list
# Example code for picking a server based on transaction properties
pick_server(find_candidates(trans.props))</pre>
<h2>
Implementation of groups</h2>
The reason to why we introduced the group concept in this manner is to be able to vary the implementation of a group, so the question is then, does it work? To see if it works, it is good to consider some sample implementations of high-availability groups and see if they can be described in this manner, so let's do that. Note that the only version that is currently implemented is the <i>primary-secondary</i> approach: the other ones are just food for thought (at this point).<p>
<table class="figure-right"><tbody>
<tr><td><object data="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjNoJ-tHxZSQ2HsrruEbo8vhpVzPxFtoX5LzleJdYm3HEmdBDNP_q2j50QpLpdpSrSZohLwLuBR4VzKpW90O3c5flY5fJ8uoZMd2spRmZaucnIcZtU54ccBFb3TDLRoOCKBfOo6/s1600/ms_group.png" height="196" type="image/png" width="150"></object></td></tr>
</tbody></table>
The <b>primary-secondary</b> approach (also known as <i>primary-backup</i> or <i>master-slave</i>) is the traditional way to set up MySQL servers for high-availability. The idea is that there is a single <i>primary</i> managing the data and one or more <i>secondaries</i> that replicate the data from the primary and are ready to become primary in the event that the primary dies. In addition, there is a number of pure read slaves that are used to scale-out reads.<p>
In this approach, the primary would be in read-write mode, and the secondaries could either be offline or in read mode. Secondaries cannot accept writes since that might cause a split-brain situation, but they can either be in read-only mode or offline. Not loading the servers with read-only transactions can make it easier for the secondaries to be up to date with the primary, but this depends on the general load on the system. Scale-out slaves added would then, of course, be pure read-only servers, and they cannot be promoted to be masters because they do not have the binary log enabled. However, if the primary master fails, they still need to fail-over to the new primary.<p>
If (when?) the primary master fails, MySQL Fabric will detect the failure and start executing a procedure to promote one of the secondary master to be primary instead of the one that failed. MySQL Fabric have to do this because the servers do not know how to handle the fail-over themselves, and in addition it is necessary to inform the connectors about the new topology. In this procedure, the scale-out servers have to be moved over to the new primary as well.<p>
<table class="figure-right"><tbody>
<tr><td><object data="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8cYlpmcK5jk99T-8hVHO0RbflSQdbcInXyMbh31YQfuAeuQplUpXvAhSYK84fiFEhYCxnXQfXCnCB3hTicnqk-1-xQdksRPmuaMezDQEfLOc9GpafXYdPgPBZn8RmJ0zX5hxx/s1600/drbd_group.png" height="196" type="image/png" width="150"></object></td></tr>
</tbody></table>
Another popular solution for high-availability <b>shared storage</b> (for example, using shared network disks) or <b>replicated storage</b> (for example, using <a href="http://dev.mysql.com/doc/refman/5.0/en/ha-drbd.html">DRBD to replicate the block device</a>). In this case, one of the server will be on-line, but the other will be on standby. For both DRBD and shared storage, it is necessary that the standby is completely offline and in the case of DRBD the server should not even be running on the standby machine. In addition to the primary and the secondary, you could have read slaves attached to the primary.<p>
In this setup, the primary would then be a read-write server, while the standby server would be in offline mode. Scale-out servers would be in read-only mode, in this case attached to the primary.
<br clear="right" />
<table class="figure-right"><tbody>
<tr><td><object data="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_m4rPDII0O5fidchDgfBs_Ypu-4IokmEDwUnoRgdSnhsiFO_I8OOxK6PLg8ILHIZ5joHNH24osC3z9Um6t2ryHvY9fDCBrq9kjQjFoZlvy2d4TkOjIc8Kth0suMgST9FW0DXM/s1600/mysql_cluster_group.png" height="196" type="image/png" width="150"></object></td></tr>
</tbody></table>
Another approach is to use <a href="http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster.html"><b>MySQL Cluster</b></a> as a group. The cluster consists of several <i>data nodes</i> and employ a shared-nothing architecture to ensure high availability. In this case, all the servers will be both write and read servers, and all might be primaries. In the event that an NDB data node fails, the other nodes are always ready to pick up the job, so a MySQL Cluster group is self-managing. (There is an excellent overview of MySQL Cluster at <a href="http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-overview.html">http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-overview.html</a>.)<p>
The two solutions above employ different fail-over procedures that are executed by the Fabric node when it notices the failure. In contrast with the solutions above, MySQL Cluster is self-governing and does not require any fail-over handling implemented in the Fabric node.
<br clear="right" />
<h2>
Summary and considerations for the future</h2>
For the examples above, the properties we have outlined is definitely sufficient, but there might be other cases where more information is needed.One property that is missing in the current implementation is a way to select a server based on the proximity to the connector. For example, it could be possible to put the primaries and secondaries in a group in different data centers to ensure that it can handle a catastrophic failure. This, however, opens two issues:<p>
<ol>
<li>There will be a set of read servers in each data center that should be connected to the primary or secondary in the same data center.</li>
<li>When the connector picks one of the candidates, it should prefer to use those in the same data center.</li>
</ol>
Both these issues mean that we need some measure of the distance between the servers and connectors so that when adding new scale-out servers it is not added in such a way that the same data is shipped several times between data centers, nor should a connector connect to a server in a different data center. Adding a complete matrix with distances between each and every server would not really work well, so it is likely that some other way to model the proximity is needed.Another case that might require some additional information is if the Fabric node fails or is unavailable temporarily (for instance, is restarting). Such an event should not block the entire system and since the connectors have information it would be possible to "run on the cache" for a brief period. The key issues here is that nothing can be updated, so each connector need to have a fallback plan in the event that a server fails. For example, if a master fails, the fail-over will not be executed, but it would still be possible to read information from the slaves.
<h2>
Related Links</h2>
<ul>
<li>MySQL Forum <a href="http://forums.mysql.com/list.php?144">Fabric, Sharding, HA, Utilities</a></li>
<li>Presentation from MySQL Connect: <a href="http://www.slideshare.net/mkindahl/mysql-sharding-tools-and-best-practices-for-horizontal-scaling">MySQL Sharding: Tools and Best Practices for Horizontal Scaling</a></li>
<li><a href="http://alfranio-distributed.blogspot.com/2013/09/tips-to-build-fault-tolerant-database.html">Tips to Build a Fault-Tolerant Database Application</a></li>
<li><a href="http://alfranio-distributed.blogspot.com/2013/09/writing-fault-tolerant-database.html">Writing a Fault-tolerant Application using MySQL Fabric</a></li>
<li><a href="http://vnwrites.blogspot.com/2013/09/mysqlfabric-sharding-introduction.html">Fabric Sharding &emdash; Horizontal Scaling of MySQL</a></li>
<li><a href="http://vnwrites.blogspot.com/2013/09/mysqlfabric-sharding-migration.html">MySQL Fabric Sharding - Migrating From an Unsharded to a Sharded Setup</a></li>
</ul>Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com5tag:blogger.com,1999:blog-23496029.post-32141861927126662912013-10-08T22:00:00.000+02:002013-10-10T16:57:49.784+02:00MySQL Connect presentations on MySQL Fabric available on SlideShareGoing to MySQL Connect was truly a blast. We got a lot of good questions and feedback in the sessions and there were a lot of interest in both MySQL Fabric and the MySQL Applier for Hadoop.<br />
<br />
A big thank you to all that attended the talks, I got a lot of good questions and comments that will help us build good solutions.<br />
<br />
The talks are available on SlideShare: <br />
<ul>
<li><a href="http://www.slideshare.net/mkindahl/mysql-sharding-tools-and-best-practices-for-horizontal-scaling">MySQL Sharding: Tools and Best Practices for Horizontal Scaling</a> </li>
<li><a href="http://www.slideshare.net/mkindahl/my-sql-connect2013hadoopapplier">MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS</a></li>
<li><a href="http://schlueters.de/blog/archives/175-Sharding-PHP-with-MySQL-Fabric.html">Sharding PHP with MySQL Fabric</a></li>
<li><a href="http://www.slideshare.net/alfranio1/mysql-connect-fabrichacon2300">MySQL High Availability: Managing Farms of Distributed Servers</a> </li>
</ul>
Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com2tag:blogger.com,1999:blog-23496029.post-75110564655754350872013-09-21T18:35:00.000+02:002013-09-21T18:57:49.633+02:00A Brief Introduction to MySQL FabricAs you saw on the keynote, we are introducing an integrated framework for managing <i>farms</i> of MySQL servers with support for both high-availability and sharding. It should be noted that this is a very early alpha and that it at this point is not ready for production use.<br />
<br />
MySQL Fabric is an integrated system for managing a collection of MySQL servers and is the framework on which high-availability and sharding is built. MySQL Fabric is open-source and is intended to be extensible, easy to use, and support procedure execution even in the presence of failure, an execution model we call <i>resilient execution</i>.<br />
<br />
To ensure high-availability, it is necessary to have redundancy in the system. For database systems, the redundancy traditionally takes the form of having a primary server acting as a master and using replication to keep secondaries available to take over in case the primary fails. This means that the "server" that the application connects to is in reality a collection of servers, not a single server. In a similar manner, if the application is using a sharded database, it is in reality working with a collection of servers, not a single server. In this case, we refer to a collection of servers as a <i>farm</i>.<br />
<br />
Now, just having a collection of servers does not really help us that much: it is necessary to have some structure imposed on the farm as well as an API to work with the farm, and this is where MySQL Fabric comes in.<br />
<br />
Before going over the concepts, have a look at the farm below. In the figure, there is an application that want to connect to the farm and there are a set of database servers at the bottom organized into groups called <i>high-availability groups</i>. To manage the structure of the farm, there is a <i>MySQL Fabric Node</i> available that keeps, among other things, track of the meta-data as well as handles procedure execution. Both the application and an operator can connect to the Fabric node to get information about the farm being managed.<br />
<br />
<table class="figure-inflow"><tr><td><object data="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKWcYvw4bbgpiFow5dC5pw0jLDVPHPO2lR6_3UHeidTejjiIXzjxRIJzLWGamQNoDdT1OgQ3Dxbn6dvY3104mHV1Zx61wyakVNTWidvB6Igc5G38i0UtW-2mKKLPjmA1Sz20LK/s1600/mysql_fabric_farm.png" type="image/png" width="856" height="531"></object></td></tr>
</table><br />
The <i>MySQL Fabric nodes</i> are responsible for keeping track of the structure of the farm as well as all information about their status and is also where any management procedures are executed. If a server fails, a the Fabric node will handle the procedure of promoting one of the slaves to be the new master, but it also contain the logic for handling shard moving and shard splitting.<br />
<br />
<table class="figure-right"><caption>High-Availability Group</caption>
<tr><td><object data="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg84VVtciZqoFivRGTbyqCn09SzoUWVm9svy8EXSXUTcuhhuJLcIX3mNJCCS35mAoee9UfxPvZjsU_gnsZ_I1w-2-up0tjRaOaC7VLCuHU1XBWKqqlYoeqvYeeaL311b1dU6a9g/s1600/ha_group.png" type="image/png" width="175" height="182"></object></td></tr>
</table><br />
<h2>High-availability Groups</h2><br />
The central concept to handling high-availability in Fabric are the <i>high-availability groups</i>. These are collections of servers that shall work together to deliver a database service to the application that connects. The groups are introduced to give structure to the farm and allow you to describe how the servers are organized to support the redundancy necessary to ensure high-availability.<br />
<br />
Inside the group, all the servers manage the same data and hence have the same schema. In addition, each server have distinct roles and no server can belong to more than one group. The group concept does not really represent anything new: it is just a way to structure the farm in such a manner that managing it is easy to understand and the roles of servers are clear. To manage redundancy, a traditional <i>Master-Slave Setup</i> is used, a topology that we all are familiar with (it is often called the <i>Primary-Secondary Approach</i>, hence the names that follow). Each group have a <i>primary</i> that is is the master for all the data. Any queries that update data is sent here and that data is propagated to the other servers in the group. Redundancy is achieved by keeping one or more <i>secondaries</i> in the group that receive changes from the primary and are ready to take over the role as primary should the primary dissapear. To handle scale-out, the group also contain <i>scale-out servers</i> which are severs that receive changes from the primary but are not eligable for being promoted to primary. In the group, there are also <i>spares</i>, which are servers that are available for use but which are not assigned any active role yet.<br />
<br />
<h2>Sharding</h2><br />
In addition to high-availability support, MySQL Fabric also offer support for <i>sharding</i>, which is a technique for handling very large databases and/or very high write loads. The database is split into a large number of <i>shards</i>, where each shard contain a fragment of the data in the database. Each shard is stored on a separate server (or a separate <i>set</i> of servers if you want to ensure high-availability) and the transactions are directed to each shard based on the shard key. Splitting the database in this way allow you both to manage a larger database by separating it onto more servers, but it also scale the write traffic because you can execute writes independently.<br />
<br />
When using sharding, MySQL Fabric separate tables into <i>sharded tables</i> MySQL Fabric allow you to shard just some of the tables in the database and keep the other tables available on all shards, which we call <i>global tables</i>and <i>global tables</i>. Since databases usually have multiple tables with foreign key relationships between, it is critical to be able to shard several tables the same way (but possibly on different columns), which is something that MySQL Fabric supports. Using this support, you can give all the tables that are sharded and what column should be used as the sharding key and MySQL Fabric will shard all tables and distribute the rows on the shards. Tables that are not sharded are the global tables and they will be available on all shards.<br />
<br />
If you want to know more about how Fabric support sharding, or about sharding in general, you should come to the session <span class="vevent"> <i><span class="summary">MySQL Sharding: Tools and Best Practices for Horizontal Scaling</span></i>, <span class="dtstart" title="2013-09-21T16:00-08">September 21, 4:00pm</span>-<span class="dtend" title="2013-09-21T17:00-08">5:00pm</span> in <span class="location">Imperial Ballroom B</span></span>.<br />
<br />
<br />
<h2>Connecting to a MySQL Farm</h2><br />
To provide better control when working with a MySQL farm we have extended the connector API in such a manner that it hides the complexities of handling fail-over in the event of a server failure as well as dispatching transactions to shards correctly. There is currently support for Fabric-aware versions of Connector/J, Connector/PHP, Connector/Python as well as some rudimentary support for Hibernate and Doctrine. If you are interested in how the extensions to the interface look and how you can use them to scale your application, you should come to <span class="vevent"> <i><span class="summary">Scaling PHP Applications</span></i>, <span class="dtstart" title="2013-09-22T10:00-08">September 22, 10:00am</span>-<span class="dtend" title="2013-09-22T11:00-08">11:00am</span> in <span class="location">Union Square Room 3/4</span></span>.<br />
<br />
<br />
<h2>More information about MySQL Fabric</h2><br />
There are several blogs being published on high-availability and sharding from the developers working on the Fabric system or the connectors.<br />
<br />
<ul><li><a href="http://alfranio-distributed.blogspot.com/2013/09/tips-to-build-fault-tolerant-database.html">Tips to Build a Fault-Tolerant Database Application</a></li>
<li><a href="http://alfranio-distributed.blogspot.com/2013/09/writing-fault-tolerant-database.html">Writing a Fault-tolerant Application using MySQL Fabric</a></li>
<li><a href="http://vnwrites.blogspot.com/2013/09/mysqlfabric-sharding-introduction.html">Fabric Sharding - Horizontal Scaling of MySQL</a></li>
<li><a href="http://vnwrites.blogspot.com/2013/09/mysqlfabric-sharding-migration.html">MySQL Fabric Sharding - Migrating From an Unsharded to a Sharded Setup</a></li>
</ul><br />
If you are interested in discussing and asking questions about MySQL Fabric, or sharding and high-availability in general, the forum on <a href="http://forums.mysql.com/list.php?144">Fabric, Sharding, HA, Utilities</a> is an excellent place for discussions. Also, if you are at MySQL Connect, going to <span class="vevent"><i><span class="summary">MySQL Sharding, Replication, and HA</span></i> (<span class="dtstart" title="2013-09-21T17:30-08:00">September 21, 5:30</span>-<span class="dtend" title="2013-09-21T18:30-08:00">6:30pm</span> in <span class="location">Imperial Ballroom B</span>)</span> is an excellent opportunity to learn more about the project, meet us developers and team leads, and provide feedback to us. The BOF will cover several areas, some overlapping, but the discussion is likely to cover MySQL Fabric and MySQL replication.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com7tag:blogger.com,1999:blog-23496029.post-21156200936829349242013-08-29T20:10:00.001+02:002013-08-29T20:12:03.579+02:00Going to MySQL Connect 2013<a href="http://www.oracle.com/go/?&Src=7328809&Act=414&pcode=WWMK11054264MPP004"><img class="figure-right" src="http://www.oracleimg.com/us/dm/h2fy11/183037-mysql-tk-imspeaking-125x125-1951645.gif"></a><br />
<br />
MySQL Connect 2013 is coming up with several interesting new sessions. Some sessions that I am participating in got accepted for the conference, so if you are going there, you might find the following sessions interesting. For your convenience, the sessions have <a rel="profile" href="http://microformats.org/profile/hcalendar">hCalendar</a> markup, so it should be easier to add them to your calendar.<br />
<br />
<dl><div class="vevent">
<dt><strong><em><span class="summary">MySQL Sharding, Replication, and HA</span></em> (<span class="dtstart" title="2013-09-21T17:30-08:00">September 21, 5:30</span>-<span class="dtend" title="2013-09-21T18:30-08:00">6:30pm</span> in <span class="location">Imperial Ballroom B</span>)</strong></dt>
<dd class="description"> <p>This session is an opportunity for you to meet the MySQL engineering team and discuss the latest tools and best practices for sharding MySQL across distributed server farms while maintaining high availability.</p><p>Come here and meet Lars, Luis, Johannes, and me, and bring up any questions or comments you have regarding sharding, high-availability, and replication. We might have some informal presentations available to discuss sharding and high-availability issues.</p></dd></div><div class="vevent">
<dt><strong><em><span class="summary">MySQL Sharding: Tools and Best Practices for Horizontal Scaling</span></em> (<span class="dtstart" title="2013-09-21T16:00-08">September 21, 4:00pm</span>-<span class="dtend" title="2013-09-21T17:00-08">5:00pm</span> in <span class="location">Imperial Ballroom B</span>)</strong></dt>
<dd><p class="description">In this session, Alfranio and I will discuss how to create and manage a sharded MySQL database system. The session covers tools, techniques, and best practices for handing various aspects of sharding such as: <ul><li>Hybrid approaches mixing sharding and global tables.</li>
<li>Sharding in the precense of transactions and how that affect<br />
applications</li>
<li>Turning an unsharded database into a sharded database<br />
system.</li>
<li>Handling schema changes to a sharded database.</li>
</ul></p></dd></div><div class="vevent">
<dt><strong><em><span class="summary">MySQL and Hadoop: Big Data Integration—Unlocking New Insights</span></em> (<span class="dtstart" title="2013-09-22T11:30-08">September 22,<br />
11:30am</span>-<span class="dtend" title="2013-09-22T12:30-08">12:30pm</span> in <span class="location">Taylor</span>)</strong></dt>
<dd><p class="description">Hadoop enables organizations to gain deeper insight into their customers, partners, and processes. As the world's most popular open source database, MySQL is a key component of many big data platforms. This session discusses technologies that enable integration between MySQL and Hadoop, exploring the lifecycle of big data, from acquisition via SQL and NoSQL APIs through to delivering operational insight and tools such as Sqoop and the Binlog API with Hadoop Applier enabling both batch and real-time integration.</p></dd></div><div class="vevent">
<dt><strong><em><span class="summary">MySQL High Availability: Managing Farms of Distributed Servers</span></em> (<span class="dtstart" title="2013-09-22T17:30-08">September 22, 5:30pm</span>-<span class="dtend" title="2013-09-22T18:30-08">6:30pm</span> in <span class="location">Imperial Ballroom B</span>)</strong></dt>
<dd> <p class="description">In this session, Alfranio and I will discuss tools, best practices, and frameworks for delivering high availability using MySQL. For example, handling such issues as ensuring server redundancy, handling failure detection and executing recovery, re-directing clients in the event of failure.</p></dd> </div><div class="vevent">
<dt><strong><em><span class="summary">Scaling PHP Applications</span></em> (<span class="dtstart" title="2013-09-22T10:00-08">September 22, 10:00am</span>-<span class="dtend" title="2013-09-22T11:00-08">11:00am</span> in <span class="location">Union Square Room 3/4</span>)</strong></dt>
<dd> <p class="description">When building a PHP application, it is necessary to consider both scaling and high-availability issues. In this session, Johannes and I will discuss how to ensure that your PHP application scales well and can work in a high-availability environment.</p></dd></div></dl>Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com1tag:blogger.com,1999:blog-23496029.post-23080553972056588002012-10-11T16:25:00.001+02:002013-10-08T22:31:50.693+02:00Round Robin Replication using GTIDIn a <a href="http://mysqlmusings.blogspot.se/2011/04/round-robin-multi-source-in-pure-sql.html">previous post</a> I showed how to implement multi-source round-robin replication in pure SQL using the tables that are needed for crash-safe replication. I also outlined a revised version of this approach in the <a href="http://www.slideshare.net/mkindahl/replication-tips-tricks">Replication Tips & Tricks presentation</a> I gave at <a href="http://www.oracle.com/mysqlconnect">MySQL Connect</a>. This was, however, before the GTID (Global Transaction ID) implementation was done. Now that they are introduced, multi-source replication is even easier since you no longer have to keep track of the positions.<br />
<br />
<table class="figure-right"><caption><b>Figure 1. </b>Tables for storing information about masters</caption> <tbody>
<tr><td><pre class="brush: sql;">CREATE TABLE my_masters (
idx INT AUTO_INCREMENT,
host CHAR(50) NOT NULL,
port INT NOT NULL DEFAULT 3306,
PRIMARY KEY (idx),
UNIQUE INDEX (host,port)
);
CREATE TABLE current_master (
idx INT
);
CREATE TABLE replication_user(
name CHAR(40),
passwd CHAR(40)
);</pre></td></tr>
</tbody></table><br />
One caveat is that this only works if you are replicating from servers that have GTID enabled, so if you are trying to replicate from a pre-5.6 server, you can use the original implementation. I have added a re-factored version of the code last in this post, and you can also find some utility procedures that I use in the version described here.<br />
<br />
For the version that uses GTID, we still keep the two tables we were using in the original implementation around, but we remove the file and position from the tables, giving the definitions seen in Figure 1. Also, the user and password is stored in a separate table and is assumed to be identical for all machines.<br />
<br />
To fetch the new master, I created a <code>fetch_next_master</code> procedure that fetches the next master in turn and then advance <code>current_master</code> to the next master. The second select in the code below is used to handle the case that you have a table with masters defined as in Table 1.<br />
<br />
<br />
<pre class="brush: sql;">delimiter $$
CREATE PROCEDURE fetch_next_master(
OUT p_host CHAR(50), OUT p_port INT UNSIGNED,
OUT p_file CHAR(50), OUT p_pos BIGINT)
BEGIN
DECLARE l_next_idx INT DEFAULT 1;
SELECT idx INTO l_next_idx FROM my_masters
WHERE idx > (SELECT idx FROM current_master)
ORDER BY idx LIMIT 1;
SELECT idx INTO l_next_idx FROM my_masters
WHERE idx >= l_next_idx
ORDER BY idx LIMIT 1;
UPDATE current_master SET idx = l_next_idx;
SELECT host, port INTO p_host, p_port
FROM my_masters WHERE idx = l_next_idx;
END $$
delimiter ;</pre><br />
Since we no longer need to save the position, the code for <code>multi_source</code> event is significantly simpler. All that is necessary is to change master to the next master in turn: the server remembers what transactions are missing automatically and will start replicating from the correct position.<br />
<br />
<pre class="brush: sql;">delimiter $$
CREATE EVENT multi_source
ON SCHEDULE EVERY 1 MINUTE DO
BEGIN
DECLARE l_host CHAR(50);
DECLARE l_port INT UNSIGNED;
DECLARE l_user CHAR(40);
DECLARE l_passwd CHAR(40);
DECLARE l_file CHAR(50);
DECLARE l_pos BIGINT;
SET SQL_LOG_BIN = 0;
CALL stop_slave_gracefully();
START TRANSACTION;
CALL fetch_next_master(l_host, l_port);
SELECT name, passwd INFO l_user, l_passwd FROM replication_user;
CALL change_master(l_host, l_port, l_user, l_passwd);
COMMIT;
START SLAVE;
END $$
delimiter ;</pre><br />
<h2>Full code for original implementation</h2><br />
Here is the code for replicating from pre-5.6 to 5.6 using the replication tables added for implementing crash-safe slaves. <br />
<br />
Compared to the version described in the earlier post, I have added a few utility procedures such as a procedure to stop the slave gracefully. The procedure will first stop the I/O thread, and then empty the relay log before stopping the SQL thread. This is mainly to avoid having to re-transfer a lot of events from the master. Compared to the <a href="http://mysqlmusings.blogspot.se/2011/04/round-robin-multi-source-in-pure-sql.html">version provided in the previous post</a>, I factored out some separate procedures. You can see the re-factored version last in the post.<br />
<br />
<br />
<pre class="brush: sql;">delimiter $$
CREATE PROCEDURE change_master(
p_host CHAR(40), p_port INT,
p_user CHAR(40), p_passwd CHAR(40),
p_file CHAR(40), p_pos LONG)
BEGIN
SET @cmd = CONCAT('CHANGE MASTER TO ',
CONCAT('MASTER_HOST = "', p_host, '", '),
CONCAT('MASTER_PORT = ', p_port, ', '),
CONCAT('MASTER_USER = "', p_user, '", '),
CONCAT('MASTER_PASSWORD = "', p_passwd, '"'));
IF p_file IS NOT NULL AND p_pos IS NOT NULL THEN
SET @cmd = CONCAT(@cmd,
CONCAT(', MASTER_LOG_FILE = "', p_file, '"'),
CONCAT(', MASTER_LOG_POS = ', p_pos));
END IF;
PREPARE change_master FROM @cmd;
EXECUTE change_master;
DEALLOCATE PREPARE change_master;
END $$
delimiter ;
delimiter $$
CREATE PROCEDURE save_position()
BEGIN
DECLARE l_idx INT UNSIGNED;
DECLARE l_msg CHAR(60);
UPDATE my_masters AS m,
mysql.slave_relay_log_info AS rli
SET m.log_pos = rli.master_log_pos,
m.log_file = rli.master_log_name
WHERE idx = (SELECT idx FROM current_master);
END $$
delimiter ;
delimiter $$
CREATE PROCEDURE fetch_next_master(
OUT p_host CHAR(40), OUT p_port INT UNSIGNED,
OUT p_file CHAR(40), OUT p_pos BIGINT)
BEGIN
DECLARE l_next_idx INT DEFAULT 1;
SELECT idx INTO l_next_idx FROM my_masters
WHERE idx > (SELECT idx FROM current_master)
ORDER BY idx LIMIT 1;
SELECT idx INTO l_next_idx FROM my_masters
WHERE idx >= l_next_idx
ORDER BY idx LIMIT 1;
UPDATE current_master SET idx = l_next_idx;
SELECT host, port, log_pos, log_file
INTO p_host, p_port, p_pos, p_file
FROM my_masters
WHERE idx = l_next_idx;
END $$
delimiter ;
delimiter $$
CREATE EVENT multi_source
ON SCHEDULE EVERY 10 SECOND DO
BEGIN
DECLARE l_host CHAR(40);
DECLARE l_port INT UNSIGNED;
DECLARE l_user CHAR(40);
DECLARE l_passwd CHAR(40);
DECLARE l_file CHAR(40);
DECLARE l_pos BIGINT;
SET SQL_LOG_BIN = 0;
STOP SLAVE;
START TRANSACTION;
CALL save_position();
CALL fetch_next_master(l_host, l_port, l_file, l_pos);
SELECT name, passwd INTO l_user, l_passwd FROM replication_user;
CALL change_master(l_host, l_port, l_user, l_passwd, l_file, l_pos);
COMMIT;
START SLAVE;
END $$
delimiter ;</pre><br />
<br />
<br />
<br />
Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com2tag:blogger.com,1999:blog-23496029.post-65703676435772551042012-06-05T17:51:00.001+02:002012-06-07T11:22:53.053+02:00Binary Log Group Commit in MySQL 5.6<!-- Background --> With the release of MySQL 5.6 binary log group commit is included, which is a feature focused on improving performance of a server when the binary log is enabled. In short, binary log group commit improve performance by grouping several writes to the binary log instead of writing them one by one, but let me digress a little on how transactions are logged to the binary log before going into the details. Before going into details about the problem and the implementation, let look at what you do to turn it on.<p>Nothing.<p>Well... we actually have a few options to tweak it, but nothing required to turn it on. It even works for existing engines since we did not have to extend the handlerton interface to implement the binary log group commit. However, InnoDB has some optimizations to take advantage of the binary log group commit implementation.<br />
<dl><dt><code>binlog_order_commits={0|1}</code> <dd>This is a global variable that can be set without stopping the server.<p>If this is off (0), transactions may be committed in parallel. In some circumstances, this might offer some performance boost. For the measurements we did, there were no significant improvement in throughput, but we decided to keep the option anyway since there are special cases were it can offer improvements.<br />
<dt><code>binlog_max_flush_queue_time=<var>microseconds</var></code> <dd>This variable controls when to stop skimming the flush queue (more about that below) and move on as soon as possible. Note that this is <em>not</em> a timeout on how often the binary log should be written to disk since grabbing the queue and writing it to disk takes time. </dl><h2>Transactions galore...</h2>As the server executes transactions the server will collect the changes done by the transaction in a per-connection <em
class="def">transaction cache</em>. If statement-based replication is used, the statements will be written to the transaction cache, and if row-based replication is used, the actual rows changed will be written to the transaction cache. Once the transaction commits, the transaction cache is written to the binary log as one single block. This allow each session to execute independently of each others and only need to take a lock on the binary log when writing the transaction data to it. Since transactions are isolated from each others it is enough to serialize the transactions on commits. (Of course, this is in an ideal world. Transactions can see other transactions changes if you set a different transaction isolation level. You would never do that unless you knew exactly what you're doing... right?) <table class="figure-inflow"><caption>Figure 1. Two-Phase Commit Protocol (2PC)</caption>
<tr><td> <object title="Two-Phase Commit Protocol (2PC)"
data="http://www.kindahl.net/images/2pc_msc.svg" type="image/svg+xml" width="600" height="302"> </object> </td></tr>
</table>In order to keep the storage engine and the binary log in sync, the server employs a <em class="def">two-phase commit protocol</em> (or just <em>2PC</em>) that you can see in Figure 1. As you can see, there is a call to <code>write()</code> and one call to <code>fsync()</code> in the diagram: I'll get to that is just a moment, so stay tuned.<p>The entire point of using a two-phase commit protocol is to be able to guarantee that the transaction is either both in the engine and the binary log (or in neither) even in the event that the server crashes after the prepare, and subsequently recovers. That is, it should not be possible that the transaction is in the engine but not in the binary log, or vice verse. Two-phase commit solves this by requiring that once a transaction is prepared in the engine, it can be either fully committed or fully rolled back even if the server crashes and recover. So, on recovery, the storage engine will then provide the server with all the transactions that are prepared but not yet committed, and the server will then commit the transaction if it can be found in the binary log, and roll it back otherwise.<p>This is, however, only possible if the transaction can be guaranteed to be persistent in the binary log before committing the transaction in the engine. Since disks are slow and memory fast, the operating system tries to improve performance by keeping part of the file in memory instead of writing directly to disk. Once enough changes have been written, or the memory is needed for something else, the changes are written to disk. This is good for the operating system (and also for anybody using the computer), but causes a problem for the database server since if the server crashes, it is possible that the transaction is committed in the storage engine, but there is no trace of it in the binary log.<p>For recovery to work properly, it is therefore necessary to ensure that the file is really on disk, which is why there is a call to <code>fsync()</code> in Figure 1, which makes the in-memory part of the file to be written to disk.<p><h2>The Infamous <code>prepare_commit_mutex</code></h2><table class="figure-right"><caption>Figure 2. Missing un-committed transactions</caption>
<tr><td> <object title="Missing un-committed transactions"
data="http://www.kindahl.net/images/binlog-uncommitted.svg" type="image/svg+xml"
width="325.40039" height="308.55518"> </object> </td></tr>
</table>When the server recovers, it has access to the binary log and can therefore decide what to commit and what to rollback, but what if there is no binary log?<p>In the general case, a recovery can just roll back all prepared transactions and start again. After all, the transactions that were just prepared but not committed are safe to roll back. They just move the database to the state it had just before starting those transactions. Any clients being connected has not got an indication that the transaction is committed, so they will realize that the transactions have to be re-executed.<p>There is another case where recovery is being used in this way and that is when using on-line backup methods such as InnoDB Hot Backup (which is used in the MySQL Enterprise Backup). These tools take a copy of the database files and InnoDB transaction logs directly—which is an easy way to take a backup—but it means that the transaction logs contain transactions that have just been prepared. On recovery, they roll back all the transactions and have a database in a consistent state.<p>Since these on-line backup methods are often used to bootstrap new slaves, the binary log position of the last committed transaction is written in the header of the InnoDB redo log. On recovery, the recovery program print the binary log position of the last committed transaction and you can use this information with the <code>CHANGE MASTER</code> command to start replicating from the correct position. For this to work correctly, it is necessary that all the transactions are committed in the same order as they are written to the binary log. If they are not, there can be "holes" where some transactions are written to the binary log, but not yet committed, which cause the slave to miss transactions that were not committed. The problematic case that can arise is what you see in Figure 3 below. <table class="figure-inflow"><caption>Figure 3. Committing in parallel</caption>
<tr><td> <object title="Commit Procedure Stages"
data="http://www.kindahl.net/images/ibbackup_snapshot.svg" type="image/svg+xml"
width="600" height="376"> </object> </td></tr>
</table>You can see an example of this in Figure 2, where replication will start from the last committed position, but there is a transaction that were just prepared and hence was rolled back when the backup was restored on the slave.<p>To solve this, InnoDB added a mutex called the <code>prepare_commit_mutex</code> that was taken when preparing a transaction and released when committing the transaction. This is a simple solution to the problem, but causes some problems that we will get to in a minute. Basically, the <code>prepare_commit_mutex</code> solve the problem by forcing the call sequence to be as in Figure 4. <table class="figure-inflow"><caption>Figure 4. Sequencing transactions</caption>
<tr><td> <object title="Sequencing transactions"
data="http://www.kindahl.net/images/ibbackup_prepare_commit_mutex.svg" type="image/svg+xml"
width="600" height="291"> </object> </td></tr>
</table><!-- Problem --> <h2>Steady as she goes... NOT!</h2>Since disk writes are slow, writing every transaction to disk will affect performance quite a lot... well, actually very much...<p>To try to handle that, there is a server option <code>sync_binlog</code> that can be set to how often the binary log should be written to disk. If it is set to 0, the operating system will decide on when the file pages should be written to disk, if it is set to 1, then <code>fsync()</code> will be called after every transaction being written to the binary log. In general, if you set <code>sync_binlog</code> to <em
class="math">N</em>, you can at most lose <em class="math">N-1</em> transactions, so in practice there are just two useful settings: <code>sync_binlog=0</code> means that you accept that some transactions can be lost and handle it some other way, and <code>sync_binlog=1</code> means that you do not accept to lose any transactions at all. You could of course set it to some other value to get something in between, but in reality you can either handle transaction loss or not.<p>To improve performance, the common case is to bundle many writes with each sync: this is what the operating system does, and that is what we should be able to do. However, if you look at Figure 4 you see that there is no way to place a <code>fsync()</code> call in that sequence so that several transactions are written to disk at the same time. Why? Because at any point in that sequence there is at most one prepared and written transaction. However, if you go back to Figure 3, you can see that it would be possible to place an <code>fsync()</code> as shown and write several transactions to disk at the same time. If it was possible, then all transactions written to the binary log before the <code>fsync()</code> call would be written to disk at once. But this means that it is necessary to order the commits in the same order as the writes without using the <code>prepare_commit_mutex</code>. <!-- Implementation --> <h2>So, how does all this work then...</h2>The binary log group commit implementation used split the commit procedure into several stages as you can see in Figure 5. The stages are entirely internal to the binary log commit procedure and does not affect anything else. In theory, it would be possible to have another replication implementation with another policy for ordering commits. Since the commit procedure is separated into stages, there can be several threads processing transactions at the same time, which also improves throughput. <table class="figure-inflow"><caption>Figure 5. Commit Procedure Stages</caption>
<tr><td> <object title="Commit Procedure Stages"
data="http://www.kindahl.net/images/stages.svg" type="image/svg+xml"
width="560.28125" height="126"> </object> </td></tr>
</table>For each stage, there is an input queue where sessions queue up for processing. If a thread registers in an empty queue, it is considered the <em class="def">stage leader</em> otherwise, the session is a <em
class="def">follower</em>. Stage leaders will bring all the threads in the queue through the stage and then register the leader and all followers for the next stage. Followers will move off to the side and wait for a leader to signal that the entire commit is done. Since it is possible that a leader registers to a non-empty queue, a leader can decide to become a follower and go off waiting as well, but a follower can never become a leader.<p>When a leader enters a stage, it will grab the entire queue in one go and process it in order according to the stage. After the queue is grabbed, other sessions can register for the stage while the leader processes the old queue.<p>In the <strong>flush stage</strong>, all the threads that registered will have their caches written to the binary log. Since all the transactions are written to an internal I/O cache, the last part of the stage is writing the memory cache to disk (which means it is written to the file pages, also in memory).<p>In the <strong>sync stage</strong>, the binary log is synced to disk according to the settings of <code>sync_binlog</code>. If <code>sync_binlog=1</code> all sessions that were flushed in the flush stage will be synced to disk each time.<p>In the <strong>commit stage</strong>, the sessions will that registered for the stage will be committed in the engine in the order they registered, all work is here done by the stage leader. Since order is preserved in each stage of the commit procedure, the writes and the commits will be made in the same order.<p>After the commit stage has finished executing, all threads that were in the queue for the commit stage will be marked as done and will be signaled that they can continue. Each session will then return from the commit procedure and continue executing.<p>Thanks to the fact that the leader registers for the next queue and is ready to become a follower, the stage that is slowest will accumulate the most work. This is typically the sync stage, at least for normal hard disks. However, it is critical to fill the flush stage with as many transactions as possible, so the flush stage is treated a little special.<p>In the flush stage, the leader will skim the the sessions one by one from the flush queue (the input queue for the flush stage). As long as the last session was not remove the from the queue, or the first session was unqueued more than <code>binlog_max_flush_queue_time</code> microseconds ago, this process will continue. There are two different conditions that can stop the process: <ul><li>If the queue is empty, the leader immediately advanced to the next stage and registers all sessions processed to the sync stage queue.</li>
<li>If the timeout was reached, the entire queue is grabbed and the sessions transaction caches are flushed (as before). The leader then advance to the sync stage.</li>
</ul><!-- Benchmarks --> <table class="figure-right"><caption>Figure 6. Comparing 5.6.5 and 5.6 June labs release</caption>
<tr><td> <object title="Comparing 5.6.5 and 5.6 June labs release"
data="http://www.kindahl.net/images/with-binlog.svg" type="image/svg+xml"
width="650" height="400"> </object> </td></tr>
</table><h2>Performance, performance, performance...</h2>I'm sure you all wonder what the improvements are, so without further delay, let's have a look at the results of some benchmarks we have done on the labs tree. There has been several improvements that is not related to the binary log, so I will just focus on the results involving the binary log. In Figure 6 you see a benchmark comparing the 5.6.5 release with the 5.6 June labs release using the binary log. These benchmarks were executed on an 2.00 GHz 48-core Intel® Xeon® 7540 with 512 GiB memory and using SSD disks.<p>As you can see, the throughput has increased tremendously, with increases ranging from a little less than 2 and approaching 4 times that of 5.6.5. To a large extent, the improvements are in the server itself, but what is interesting is that with binary log group commit, the server is able to keep the pace. Even with <code>sync_binlog=0</code> on 5.6.5 and <code>sync_binlog=1</code> on the 5.6 labs release, the 5.6 labs release outperforms 5.6.5 by a big margin.<p>Another interesting aspect is that even with <code>sync_binlog=1</code> the server performs nearly as well when using <code>sync_binlog=0</code>. On higher number of connections (roughly more than 50), the difference in throughput is varying between 0% [sic] and with a more typical throughput between 5% and 10%. However, there is a drop of roughly 20% in the lower range. This looks very strange, especially in the light that the performance is almost equal in the higher range, so what is causing that drop and is there anything that can be done about it? <table class="figure-right"><caption>Figure 7. Benchmark of Binary Log Group Commit</caption>
<tr><td> <object title="Benchmarks of Binary Log Group Commit"
data="http://www.kindahl.net/images/detailed.svg" type="image/svg+xml"
width="650" height="400"> </object> </td></tr>
</table>The answer comes from some internal benchmarks done while developing the feature. For these tests we were using Sysbench on a 64-core Intel® Xeon® X7560 running at 2.27GHz with 126 GB memory and a HDD.<p>In the benchmarks that you can see in Figure 7 the enhanced version of 5.6 with and without the binary log group commit is compared. The enhanced version of 5.6 include some optimizations to improve performance that are not available in the latest 5.6 DMR, but most are available in the labs tree. However, these are benchmarks done while developing, so it is not really possible to compare them with Figure 6 above, but it will help understand why we have a 20% drop at the lower number of connections.<p>The bottom line in Figure 7 is the enhanced 5.6 branch without binary log group commit and using <code>sync_binlog=1</code>, which does not scale very well. (This is nothing new, and is why implementing binary log group commit is a good idea.) Note that even at <code>sync_binlog=0</code> the staged commit architecture scale better than the old implementation. If you look at the other lines in the figure, you can see that even when the enhanced 5.6 server is running with <code>sync_binlog=0</code>, the binary log group commit implementation outperforms the enhanced 5.6 branch at roughly 105 simultaneous threads with <code>sync_binlog=1</code>.<p>Also note that the difference between <code>sync_binlog=1</code> and <code>sync_binlog=0</code> is diminishing as the number of simultaneous connections is increased, to vanish completely at roughly 160 simultaneous connections. We haven't made a deep analysis of this, but while using the performance schema to analyze the performance of each individual stage, we noted that the sync time was completely dominating the performance (no surprise there, just giving the background), and that all available transactions "piled up" in the sync stage queue. Since each connection can at most have one ongoing transaction, it means that at 32 connections, there can never be more than 32 transactions in the queue. As a matter of fact, one can expect that over a long run, roughly half of the connections are in the queue and half of the connections are inside the sync stage (this was also confirmed in the measurements mentioned above), so at lower number of connections it is just not possible to fill the queue enough to utilize the system efficiently.<p>The conclusion is that reducing the sync time would probably make the difference between <code>sync_binlog=0</code> and <code>sync_binlog=1</code> smaller even on low number of connections. We didn't do any benchmarks using disks with battery-backed caches (which should reduce the sync time significantly, if not entirely eliminates it), but it would be really interesting to see the effect of that on performance. <!-- Summary and Links --> <h2>Summary and closing remarks</h2><ul><li>The binary logging code has been simplified and optimized, leading to improved performance even when using <code>sync_binlog=0</code>.</li>
<li>The <code>prepare_commit_mutex</code> is removed from the code and instead the server orders transactions correctly.</li>
<li>Transactions can be written and committed as groups without losing any transactions, giving around 3 times improvement in performance on both <code>sync_binlog=1</code> and <code>sync_binlog=0</code>.</li>
<li>The difference between <code>sync_binlog=0</code> and <code>sync_binlog=1</code> is small and reduces as the load increases on the system.</li>
<li>Existing storage engines benefit from binary log group commit since there are no changes to the handlerton interface.</li>
</ul>Binary log group commit is one of a range of important new enhancements to replication in MySQL 5.6, including global transaction IDs (GTIDs), multi-threaded slave, crash safe slave and binary log, replication event checksums, and some more. You can learn more about all of these from our DevZone article: <blockquote><a href="http://dev.mysql.com/tech-resources/articles/mysql-5.6-replication.html">dev.mysql.com/tech-resources/articles/mysql-5.6-replication.html</a></blockquote><p>You can also try out binary log group commit today by downloading the latest MySQL 5.6 build that is available on <a
href="http://labs.mysql.com">labs.mysql.com</a><br />
<br />
<h3>Related links</h3><ul><li>It all started with <a href="http://www.facebook.com/note.php?note_id=438641125932">this post</a> where Mark points out the problem and show some results of their implementation.<br />
<li>The next year, <a href="http://kristiannielsen.livejournal.com/">Kristian Nielsen</a> implemented binary log group commit for MariaDB and has a lot of <a href="http://kristiannielsen.livejournal.com/12254.html">good</a> <a href="http://kristiannielsen.livejournal.com/12408.html">posts</a> on the <a href="http://kristiannielsen.livejournal.com/12553.html">technical challenges</a> in implementing it. This implementation is using an atomic queue and does flushing and syncing of the transactions as a batch, after which the sessions are signalled in order and commit their transactions.<br />
</ul>Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com11tag:blogger.com,1999:blog-23496029.post-6032228330984615472012-02-20T21:24:00.001+01:002014-09-28T07:11:24.807+02:00Pythonic Database API: Now with LaunchpadIn a <a href="http://mysqlmusings.blogspot.com/2011/09/python-interface-to-mysql.html">previous post</a>, I demonstrated a simple Python database API with a syntax similar to jQuery. The goal was to provide a simple API that would allow Python programmers to use a database without having to resort to SQL, nor having to use any of the good, but quite heavy, ORM implementations that exist. The code was just an experimental implementation, and I was considering putting it up on Launchpad.<br />
I did some basic cleaning of the code, turned it into a Python package, and <a href="https://launchpad.net/mysql-python-api">pushed it to Launchpad</a>. I also added some minor changes, such as introducing a <code class="symbol">define</code> function to define new tables instead of automatically creating one when an insert was executed. Automatically constructing a table from values seems neat, but in reality it is quite difficult to ensure that it has the right types for the application. Here is a small code example demonstrating how to use the <code class="symbol">define</code> function together with some other operations. <br />
<pre class="brush: python;">import mysql.api.simple as api
server = api.Server(host="example.com")
server.test_api.tbl.define(
{ 'name': 'more', 'type': int },
{ 'name': 'magic', 'type': str },
)
items = [
{'more': 3, 'magic': 'just a test'},
{'more': 3, 'magic': 'just another test'},
{'more': 4, 'magic': 'quadrant'},
{'more': 5, 'magic': 'even more magic'},
]
for item in items:
server.test_api.tbl.insert(item)
</pre>The table is defined by providing a dictionary for each row that you want in the table. The two most important fields in the dictionary is <var>name</var> and <var>type</var>. The <var>name</var> field is used to supply a name for the field, and the <var>type</var> field is used to provide a type of the column. The type is denoted using a basic Python type constructor, which then maps internally to a SQL type. So, for example, <code class="symbol keyword">int</code> map to the SQL <code>INT</code> type, and <code class="symbol keyword">bool</code> map to the SQL type <code>BIT(1)</code>. This choice of deciding to use Python types are simply because it is more natural for a Python programmer to define the tables from the data that the programmer want to store in the database. I this case, I would be less concerned with how the types are mapped, just assuming that it is mapped in a way that works. It is currently not possible to register your own mappings, but that is easy to add.So, why provide the type object and not just a string with the type name? The idea I had here is that since Python has introspection (it is a dynamic language after all), it would be possible to add code that read the provided type objects and do things with them, such as figuring out what fields there are in the type. It's not that I plan to implement this, but even though this is intended to be a simple database interface, there is no reason to tie ones hands from start, so this simple approach will provide some flexibility if needed in the future. <br />
<h2>Links</h2>Some additional links that you might find useful:<br />
<dl><dt><a href="http://launchpad.net/myconnpy">Connector/Python</a> </dt>
<dd>You need to have Connector/Python installed to be able to use this package. </dd>
<dt><a href="http://sequelizejs.com/">Sequalize</a> </dt>
<dd>This is a JavaScript library that provide a similar interface to a database. It claims to be an ORM layer, but is not really. It is more similar to what I have written above. </dd>
<dt><a href="http://rpbouman.blogspot.com/">Roland's</a> <a href="http://code.google.com/p/mql-to-sql/">MQL to SQL</a> and <a href="http://www.slideshare.net/rpbouman/mql-tosql-a-jsonbased-rdbms-query-language">Presentation on SlideShare</a> is also some inspiration for alternatives. </dt>
<dt> </dt>
<dd></dd><dd></dd>
</dl><br />
Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com0tag:blogger.com,1999:blog-23496029.post-88537180096880260242012-01-23T21:55:00.002+01:002013-10-07T15:39:24.396+02:00MySQL: Python, Meta-Programming, and InterceptorsI recently found Todd's <a href="http://mysqlblog.fivefarmers.com/2011/10/17/connectorj-extension-points-lifecycle-interceptors/" >posts</a> <a href="http://mysqlblog.fivefarmers.com/2011/10/17/connectorj-extension-points-statement-interceptors/" >on</a> <a href="http://mysqlblog.fivefarmers.com/2011/11/21/connectorj-extension-points-%E2%80%93-exception-interceptors/">interceptors</a> which allow callbacks (called <em>interceptors</em>) to be registered with the connector so that you can intercept a statement execution, commit, or any of the many extension points supported by Connector/Java. This is a language feature that allow you to implement a number of new features without having to change the application code such as load-balancing policies, profiling queries or transactions, or debugging an application.<p>Since Python is a dynamic language, it is easy to add interceptors to <em>any</em> method in Connector/Python, without having to extend the connector with specific code. This is something that is possible in dynamic languages such as Python, Perl, JavaScript, and even some lesser known languages such as Lua and Self. In this post, I will describe how and also give an introduction to some of the (in my view) more powerful features of Python.<p>In order to create an interceptor, you need to be able to do these things: <ul><li>Catch an existing method in a class and replace it with a new one.</li>
<li>Call the original function, if necessary.</li>
<li>For extra points: catch an existing method in an <em>object</em> and replace a new one.</li>
</ul>You will in this post see how all three of these problems are solved in Python. You will see and use <em>decorators</em> to be able to define methods in existing classes and object, and closures to be able to call the original version of the methods. By picking this approach, it will not be necessary to change the implementation: in fact, you can use this code to replace <em>any</em> method in <em>any</em> class, not only in Connector/Python.<p><table border="" width="" class="figure-right"><caption>Table 1. Attributes for methods</caption>
<tr> <th></TH> <th colspan="2">Method Instance</TH> </TR>
<tr> <th>Name</TH> <th>Unbound</TH> <th>Bound</TH></TR>
<tr> <td><code class="symbol">__name__</code></TD> <td align="center" colspan="2">Name of Method</TD> </TR>
<tr> <td><code class="symbol">im_func</code></TD> <td align="center" colspan="2">"Inner" function of the method</TD> </TR>
<tr> <td><code class="symbol">im_self</code></TD> <td><code class="constant">None</code></TD> <td align="center">Class instance for the method</TD> </TR>
<tr> <td><code class="symbol">im_class</code></TD> <td align="center" colspan="2">Class that the method belongs to</TD> </TR>
</table>In addition to being able to replace methods in the class, we would also like to be able to replace methods in instances of a class ("objects" in the traditional sense). This is useful to create specialized objects, for example for tracking particular cases where a method is used.<p>In order to understand how the replacement works, you should understand that in Python (and the dynamic languages mentioned above), all objects can have attributes, including classes, functions, and a bunch of other esoteric constructions. Each type of object has a set of pre-defined attributes with well-defined meaning. For classes (and class instances), methods are stored as attributes of the class (or class instance) and can therefore be replaced with other methods that you build dynamically. However, it requires some tinkering to take an existing "normal" function definition and "imbue" it with whatever "tincture" that makes it behave as a method of the class or class instance.<p>Depending on where the method comes from, it can be either <em>unbound</em> and <em>bound</em>. Unbound methods are roughly equivalent to member function pointers in C++: they reference a function, but not the instance. In contrast, bound methods have an instance tied to it, so when you call them, they already know what instance they belong to and will use it. Methods have a set of attributes, of which the four in Table 1 interests us. If a method is fetched from a class (to be precise, from a class object), it will be unbound and <code>im_self</code> will be <code>None</code>. If the method is fetched from a class <em>instance</em>, it will be bound and <code class="symbol">im_self</code> will be set to the instance it belongs to. These attributes are all the "tincture" you need make our own instance methods. The code for doing the replacement described above is simply:<br />
<pre class="brush: python;">import functools, types
def replace_method(orig, func):
functools.update_wrapper(func, orig.im_func)
new = types.MethodType(func, orig.im_self, orig.im_class)
obj = orig.im_self or orig.im_class
setattr(obj, orig.__name__, new)
</pre>The function uses two standard modules to make the job simpler, but the steps are: <ol><li>Copy the meta-information from the original method function to the new function using <code>update_wrapper</code>. This copies the name, module information, and documentation from the original method function to make it look like the original method.</li>
<li>Create a new method instance from the method information of the original method using the constructor <code>MethodType</code>, but replace the "inner" function with the new function.</li>
<li>Install the new instance method in the class or instance by replacing the attribute denoting the original method with the new method. Depending on whether the function is given a bound or unbound instance, either the method in the class or in the instance is replaced.</li>
</ol>Using this function you can now replace a method in a class like this:<br />
<pre class="brush: python;">from mysql.connector import MySQLCursor
def my_execute(self, operation, params=None):
...
replace_method(MySQLCursor.execute, my_execute)
</pre>This is already pretty useful, but note that you can also replace only a specific instance as well by using <code>replace_method(cursor.execute, my_execute)</code>. It was not necessary to change anything inside Connector/Python to intercept a method there, so you can actually apply this to any method in any of the classes in Connector/Python that you already have available. In order to make it even easier to use you'll see how to define a <em>decorator</em> that will install the function in the correct place at the same time as it is defined. The code for defining a decorator and an example usage is: <pre class="brush: python;">import functools, types
from mysql.connector import MySQLCursor
def intercept(orig):
def wrap(func):
functools.update_wrapper(func, orig.im_func)
meth = types.MethodType(func, orig.im_self, orig.im_class)
obj = orig.im_self or orig.im_class
setattr(obj, orig.__name__, meth)
return func
return wrap
# Define a function using the decorator
@intercept(MySQLCursor.execute)
def my_execute(self, operation, params=None):
...
</pre>The <code>@intercept</code> line before the definition of <code>my_execute</code> is where the new descriptor is used. The syntax is a shorthand that can be used to do some things with the function when defining it. It behaves as if the following code had been executed: <pre class="brush: python;">def _temporary(self, operation, params=None):
...
my_execute = intercept(MySQLCursor.execute)(_temporary)
</pre>As you can see here, whatever is given after the <code>@</code> is used as a function and called with the function-being-defined as argument. This explains why the <code>wrap</code> function is returned from the decorator (it will be called with a reference to the function that is being defined), and also why the original function is returned from the <code>wrap</code> function (the result will be assigned to the function name).<p>Using a statement interceptor, you can catch the execution of statements and do some special magic on them. In our case, let's define an interceptor to catch the execution of a statement and log the result using the standard <a href="http://docs.python.org/library/logging.html"><code>logging</code></a> module. If you read the <code>wrap</code> function carefully, you probably noted that it uses a <dfn>closure</dfn> to access the value of <var>orig</var> when the decorator was <em>called</em>, not the value it happen to have when the <code>wrap</code> function is executed. This feature is very useful since a closure can also be used to get access to the original <code>execute</code> function and call it from within the new function. So, to intercept an execute call and log information about the statement using the <a href="http://docs.python.org/library/logging.html"><code>logging</code></a> module, you could use code like this: <pre class="brush: python;">from mysql.connector import MySQLCursor
original_execute = MySQLCursor.execute
@intercept(MySQLCursor.execute)
def my_execute(self, operation, params=None):
if params is not None:
stmt = operation % self._process_params(params)
else:
stmt = operation
result = original_execute(self, operation, params)
logging.debug("Executed '%s', rowcount: %d", stmt, self.rowcount)
logging.debug("Columns: %s", ', '. join(c[0] for c in self.description))
return result
</pre>Now with this, you could implement your own caching layer to, for example, do a memcached lookup before sending the statement to the server for execution. I leave this as an exercises to the reader, or maybe I'll show you in a later post. &smiley; Implementing a lifecycle interceptor is similar, only that you replace, for example, the commit or rollback calls. However, implementing an exception interceptor is not obvious. Catching the exception is straightforward and can be done using the <code>intercept</code> decorator: <pre class="brush: python;">original_init = ProgrammingError.__init__
@intercept(ProgrammingError.__init__)
def catch_error(self, msg, errno):
logging.debug("This statement didn't work: '%s', errno: %d", msg, errno)
original_init(self, msg, errno=errno)
</pre>However, in order to do something more interesting, such as asking for some additional information from the database, it is necessary to either get hold of the cursor that was used to execute the query, or at least the connection. It is possible to dig through the interpreter stack, or try to override one of the internal methods that Connector/Python uses, but since that is very dependent on the implementation, I will not present that in this post. It would be good if the cursor is passed down to the exception constructor, but this requires some changes to the connector code.<p>Even though I have been programming in dynamic languages for decades (literally) it always amaze me how easy it is to accomplish things in these languages. If you are interested in playing around with this code, you can always fetch Connector/Python on Launchpad and try out the examples above. Some links and other assorted references related to this post are: <ul><li>Connector/Python is found at <a href="https://launchpad.net/myconnpy" >launchpad.net/myconnpy</a></li>
<li>Geert has a number of excellent posts on Connector/Python under <a href="http://geert.vanderkelen.org/" >geert.vanderkelen.org</a>. Also, <a href="http://geert.vanderkelen.org/post/817" >as you might already know</a>, he is now working with developing Connector/Python and he's always interested in comments and suggestions. :)<br />
<li>Todd's Blog <a href="http://mysqlblog.fivefarmers.com/" >mysqlblog.fivefarmers.com</a> is always interesting to read, and these articles on interceptors are the ones I read <ul><li><a href="http://mysqlblog.fivefarmers.com/2011/10/17/connectorj-extension-points-lifecycle-interceptors/" >Lifecycle Interceptors</a><br />
<li><a href="http://mysqlblog.fivefarmers.com/2011/10/17/connectorj-extension-points-statement-interceptors/" >Statement Interceptors</a><br />
<li><a href="http://mysqlblog.fivefarmers.com/2011/11/21/connectorj-extension-points-%E2%80%93-exception-interceptors/">Exception Interceptors</a> </ul></ul>Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com3tag:blogger.com,1999:blog-23496029.post-12977034267215179092011-09-26T10:27:00.001+02:002011-09-26T10:27:28.008+02:00Python Interface to MySQLThere has been a lot of discussions lately about various non-SQL languages that provide access to databases without having to resort to using SQL. I wondered how difficult it would be to implement such an interface, so as an experiment, I implemented a simple interface in Python that similar to the document-oriented interfaces available elsewhere. The interface generate SQL queries to query the database, but does not require any knowlegdge of SQL to use. The syntax is inspired by <a href="http://jquery.org">JQuery</a>, but since JQuery works with documents, the semantics is slightly different.<p>
A simple example would look like this:
<pre class="code lineno">
from native_db import *
server = Server(host='127.0.0.1')
server.test.t1.insert({'more': 3, 'magic': 'just a test', 'count': 0})
server.test.t1.insert({'more': 3, 'magic': 'just another test', 'count': 0})
server.test.t1.insert({'more': 4, 'magic': 'quadrant', 'count': 0})
server.test.t1.insert({'more': 5, 'magic': 'even more magic', 'count': 0})
for row in server.test.t1.find({'more': 3}):
print "The magic is:", row['magic']
server.test.t1.update({'more': 3}, {'count': 'count+1'})
for row in server.test.t1.find({'more': 3}, ['magic', 'count']):
print "The magic is:", row['magic'], "and the count is", row['count']
server.test.t1.delete({'more': 5})
</pre>
The first line define a server to communicate with, which is simply done by creating a <code>Server</code> object with the necessary parameters. The constructor accepts the normal parameters for Connector/Python (which is what I'm using internally), but the user defaults to whatever <code>getpass.getuser()</code> returns, and the host default to 127.0.0.1, even though I've provided it here.<p>
After that, the necessary methods are overridden so that <code><em>server</em>.<em>database</em>.<em>table</em></code> will refer to the table with name <em>table</em> in database with name <em>database</em> on the given server. One possibility would be to just skip the database and go directly on the table (using some default database name), but since this is just an experiment, I did this instead. After that, there are various methods defined to support searching, inserting, deleting, and updating.<p>
Since this is intended to be a simple interface, autocommit is on. Each of the functions generate a single SQL statement, so they will be executed atomically if you're using InnoDB.<p>
<dl>
<dt><em>table</em>.insert(<em>row</em>)
<dd>This function will insert the contents of the dictionary into the table. using the keys of the dictionary as column names. If the table does not exist, it will be created with a "best effort" guess of what types to use for the columns.
<dt><em>table</em>.delete(<em>condition</em>)
<dd>This function will remove all rows in the table that matches the supplied dictionary. Currently, only equality mapping is supported, but see below for how it could be extended.
<dt><em>table</em>.find(<em>condition</em>, <em>fields</em>="*")
<dd>This will search the table and return an iterable to the rows that match <em>condition</em>. If <em>fields</em> is supplied (as a list of field names), only those fields are returned.
<dt><em>table</em>.update(<em>condition</em>, <em>update</em>)
<dd>This will search for rows matching <em>condition</em> and update each matching row according to the <em>update</em> dictionary. The values of the dictionary is used on the right side of the assignments of the <code>UPDATE</code> statement, so expressions can be given here as strings.
</dl>
<h2>That's all folks!</h2>
The code is available at <a href="http://mats.kindahl.net/python/native_db.py">http://mats.kindahl.net/python/native_db.py</a> if you're interested in trying it out. The code is very basic, and there's potential for a lot of extensions. If there's interest, I could probably create a repository somewhere.<p>
Note that this is not a replacement for an ORM library. The intention is not to allow storing arbitrary objects in the database: the intention is to be able to query the database using a Python interface without resorting to using SQL.<p>
I'm just playing around and testing some things out, and I'm not really sure if there is any interest in anything like this, so what do you think? Personally, I have no problems with using SQL, but since I'm working with MySQL on a daily basis, I'm strongly biased on the subject. For simple jobs, this is probably easier to work with than a "real" SQL interface, but it cannot handle as complex queries as SQL can (at least not without extensions).<p>
There is a number of open issues for the implementation (this is just a small list of obvious ones):
<dl>
<dt><strong>Only equality searching supported</strong>
<dd>Searching can only be done with equality matches, but it is trivial to extend to support more complex comparisons. To allow more complex conditions, the condition supplied to <code>find</code>, <code>delete</code>, and <code>update</code> can actually be a string, in which case it is used "raw".<p>
Conditions could be extended to support something like <code>{'more': '>3'}</code>, or a more object-oriented approach would be to support something similar to <code>{'more': operator.gt(3)}</code>.<p>
<dt><strong>No support for indexes</strong>
<dd>There's no support for indexes yet, but that can easily be added. The complication is <em>what</em> kind of indexes should be generated.<p>
For example, right now rows are identified by their content, but if we want unique rows to be handled as a set? Imagine the following (not supported) query where we insert :
<blockquote><code>server.test.t1.insert(<em>content with some more=3</em>).find({'more': eq(3)})</code></blockquote>
In this case, we have to fetch the row identifiers for the inserted rows to be able to manipulate <em>exactly</em> those rows and none other. Not sure how to do this right now, but auto-inventing a row-identifier would mean that tables lacking it cannot be handled naturally.<p>
<dt><strong>Creating and dropping tables</strong>
<dd>The support for creation of tables is to create tables automatically if they do not exist. A simple heuristic is used to figure out the table definition, but this has obvious flaws if later inserts have more fields than the first one.<p>
To support extending the table, one would have to generate an <code>ALTER TABLE</code> statement to "fix" the table.<p>
There is no support for dropping tables... or databases.
</dl>
Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com0tag:blogger.com,1999:blog-23496029.post-5670392311903820002011-07-27T15:12:00.002+02:002011-07-27T15:59:30.805+02:00Binlog Group Commit Experiments<h1>Binlog Group Commit Experiments</h1>
It was a while ago since I <a href="http://mysqlmusings.blogspot.com/2010/04/binary-log-group-commit-implementation.html">talked</a> <a href="http://mysqlmusings.blogspot.com/2010/08/binary-log-group-commit-recovery.html">about</a> binary log group commit. I had to spend time on a <a href="http://mysqlmusings.blogspot.com/2011/02/slave-type-conversions.html">few</a> <a href="http://mysqlmusings.blogspot.com/2011/04/replication-event-checksum.html">other</a> <a href="http://mysqlmusings.blogspot.com/2011/04/round-robin-multi-source-in-pure-sql.html">things</a>.<p>
Since then, Kristian has <a href="https://lists.launchpad.net/maria-developers/msg03693.html">released a version of binary log group commit</a> that seems to <a href="http://www.facebook.com/notes/mark-callaghan/group-commit-in-mariadb-is-fast/10150211546215933">work well</a>.
However, for a few reasons that will be outlined below, we decided to do experiments ourselves using the approach that I have described earlier. A <em>very</em> early version of what we will start doing benchmarks on are available at the <a href="">MySQL labs</a>. We have not done any any benchmarking on this approach before OSCON, so we we'll have to get back on that.<p>
All of this started with <a href="http://www.facebook.com/note.php?note_id=438641125932">Facebook pointing out a problem</a> in how the group commit interacts with the binary log and proposed a way to handle the binary log group commit by demonstrating a patch to solve the problem.
<h2>What's in the patch</h2>
The patch involves implementing logic for handling binary log group commit and parallel writing of the binary log, including a minor change to the handler protocol by adding a <code>persist</code> callback. The extension of the handler interface is strictly speaking not necessary for the implementation, but it is natural to extend the interface in this manner and I belive that it can be used by storage engines to execute more efficiently).
In addition to the new logic, three new options were added and one option was created as an alias of an old option.
<dl>
<dt><code>binlog-sync-period=</code><var>N</var>
<dd>This is just a rename of the old <code>sync-period</code> option, which tell that <code>fsync</code> should be called for the binary log every <var>N</var> events. For many of the old options, it is not clear what they are configuring, so we are adding the <code>binlog-</code> prefix to options that affect the binary log. The old option is kept as an alias for this option.
<dt><code>binlog-sync-interval=</code><var>msec</var>
<dd>No transaction commit will wait for more than <var>msec</var> milliseconds before calling <code>fsync</code> on the binary log. If set to zero, it is disabled. You can set both this option and the <code>binlog-sync-period</code> option.
<dt><code>binlog-trx-committed={COMPLETE,DURABLE}</code>
<dd>A transaction is considered committed when it is either in durable store or when it is completed. If set to <code>DURABLE</code> either <code>binlog-sync-interval</code> or <code>binlog-sync-period</code> has to be non-zero. If they are both zero, transactions will not be flushed to disk and hence they will never be considered durable.
<dt><code>master-trx-read=={COMPLETE,DURABLE}</code>
<dd>A transaction is read from the binary log when it is completed or when it is durable. If set to <code>DURABLE</code> either <code>binlog-sync-interval</code> or <code>binlog-sync-period</code> has to be non-zero or an error will be generated. If it was possible for both zero, no transactions will ever be read from the binary log and hence never sent out.
</dl>
The patch also contain code to eliminate the <code>prepare_commit_mutex</code> as well as moving release of row locks inside InnoDB (not completely applied yet, I will get it there as soon as possible) to the prepare phase. The focus on these changes is that we should maintain consistency, so we have not done any aggressive changes like moving the release of the write locks to the prepare phase: that could possibly lead to inconsistencies.<p>
<table class="figure-right">
<caption>Figure 1. Binary log with transaction in different stages</caption>
<tr>
<td>
<object data="http://images.kindahl.net/trans-stages.svg" width="100" height="300" type="image/svg+xml">
</object>
</td>
</tr>
</table>
The main changes are about how a transaction is committed. The details are explained in the previous articles, but for understanding the rest of this blog post, I'll briefly recapitulate how a transaction is committed in this solution. Each transaction pass through three states: <em>prepared</em>, <em>completed</em> (committed to memory), and <em>durable</em> (committed to disk), as seen in Figure 1. The transaction is pushed through these states using the following procedure:
<ol>
<li>The transaction is first <strong>prepared</strong>, which is now split into two steps:
<ol>
<li>In the <strong>reserve</strong> step, a slot is assigned for the transaction in the binary log and the storage engine is asked check if this transaction can be committed. At this point, the storage engine can abort the transaction if it is unable to fulfill the commit, but if it approves of the commit, the only thing that can abort the transaction after this point is a server crash. This check is currently done using the <code>prepare</code> call. This step is executed with a lock, but is intended to be short.</li>
<li>In the <strong>persist</strong> step, the <code>persist</code> function is called, which asks the storage engine to persist any data that it need to persist to guarantee that the transaction is fully prepared. After this step is complete, the transaction is fully prepared in the storage engine and in the event of a crash, it will be able to commit the transaction on recovery, if asked to do so. This step is executed without a lock and a storage engine that intend to handle group commit should defer any expensive operations to this step.</li>
</ol></li>
<li>To record the decision, the transaction is written to the reserved slot in the binary log. Since the write is done to a dedicated place in the binary log reserved to this transaction, it is not necessary to hold any locks, which means that several threads can write the transaction to the binary log at the same time.</li>
<li>The <strong>commit</strong> phase is in turn split into two steps:
<ol>
<li>In the <strong>completion</strong> step, the thread waits for all preceeding transactions to be fully written to the binary log, after which the transaction is <em>completed</em>, which means that it is logically committed but not necessarily in durable storage.</li>
<li>In the <strong>durability</strong>, step, the thread waits for the transaction (and all preceeding transactions) to be written to disk. If this does not occur within the given time period, it will itself call <code>fsync</code> for the binary log. This will make all completed transactions durable.</li>
</ol></li>
</ol>
After this procedure is complete, the transaction is fully committed and the thread can proceed with executing the next statement.
<h2>The different approaches</h2>
So, providing this patch begs the questions: why a third version of binary log group commit?
There are three approaches: Facebook's patch (#1), Kristian's patch (#2), and my patch (#3). Before going over the rationale leading to a third version, it is necessary to understand how the Facebook patch and Krisian's patch work on a very high level. If you look at Figure 1, you see a principal diagram showing how the patches work. Both of them maintain a queue of threads with transactions to be written and will ensure that they are written in the correct order to the binary log.<p>
The Facebook patch ensures that the transactions are written in the correct order by signalling each thread waiting in the queue in the correct order, after which the thread will take a lock on the binary log, append the transaction, and release the lock. When the decision to commit the outstanding transactions are made, <code>fsync()</code> is called. It has turned out that this lock-write-unlock loop can just be executed at a certain speed, which means that as the number threads waiting to write transactions increase, the system choke and is not able to keep up.<p>
Kristian solves this by designating the first thread in the queue as the leader, and have it write the transactions for <em>all</em> threads in the queue instead of just having each thread do it individually and then broadcast to the other threads, who just return from the commit. This improves performance significantly as can be seen from the figures in the <a href="http://www.facebook.com/notes/mark-callaghan/group-commit-in-mariadb-is-fast/10150211546215933">measurements that Mark did</a>. Note, however, that a lock of the binary log is still kept while writing the transactions.<p>
The approach we are experimenting with goes about this in another way and instead of queueing the data to be written, a place is immediately allocated in the binary log after which the thread proceed to write the data. This means that several threads can at the same time write in parallel to the binary log without needing to keep any locks. There is a need for a lock when allocating space in the binary log, but that is very short. Since the threads can finish writing in different order, it is necessary to keep logic around for deciding when a transaction is committed and when it's not. For details, you can look at the <a href="http://forge.mysql.com/worklog/task.php?id=5223">worklog</a> (which is not entirely up to date, but I'll fix that). In this sense, the binary log itself is the queue (there is a queue in the implementation, but this is just for bookkeeping).
The important differences leading us to a want to have a look at this third version are:
<ul>
<li>Approaches #1 and #2 keep a lock while writing the binary log while #3 doesn't.</li>
<li>Approaches #1 and #2 keep the transactions on the side (in the queue) and write them to the binary log when they are being committed. Approach #3 writes the transactions directly to the binary log, possibly before they are committed.</li>
</ul>
<table class="figure-right">
<caption>Figure 1. Sources of performance problems</caption>
<tr>
<td>
<object data="http://images.kindahl.net/locking-issues.svg" width="320" height="450" type="image/svg+xml">
</object>
</td>
</tr>
</table>
<h2>Efficiently using Multiple Cores</h2>
Efficiently using a multi-threaded systems, especially one with multiple cores, is very hard. It requires knowledge of hardware issues, operating systems considerations, algorithms, and some luck. I will not cover all the issues revolving around designing a system for multi-core use, but I will focus on three of the parts that we are considering in this case. We split the sources of performance degradations when committing a transaction into three separate parts: CPU and memory issues, software lock contention, and I/O.
<ul>
<li>The CPU and memory issues has to do with how caches are handled on the CPU level, which can affect performance quite a lot. There some things that can be done, such as avoiding false sharing, handling data alignment, and checking the cache access patterns, but in general, it is hard to add as an afterthought and require quite a lot of work to get right. We are not considering this and view it as static.</li>
<li>The I/O can be reduced using either SSDs or use RAID solutions (which does not reduce latency, but improves the throughput and therefore reduce the I/O needed for each transaction). Also, reducing the number of accesses to disk using group commits will improve the situation significantly, which is what we're doing here.</li>
<li>To reduce the software lock contention there is only one solution: reduce the time each lock is kept. This can be as simple as moving the lock aquire and release, using atomic primitives instead of locks, but can also require re-designing algorithms to be able to run without locks.</li>
</ul>
So, assuming that we reduce the I/O portion of committing a transaction—and <em>only</em> I/O portion—as you can see in Figure 1, the software lock time start to become the problem and we need to start to work on reducing that. To do this, there are not many options except the approach described above. And if we take this approach to reduce lock contention, there's just a few additions to get the group commit as well.<p>
Given this, it is rational to explore if this solution can solve the group commit problem as good as the other solutions and improve the scalability of the server at the same time.
<h2>Scaling out</h2>
One of the most central uses for replication is to achieve high-availability by duplicating masters and replicate between them to keep both up to date. For this reason, it is important to get the changes over to the other master as fast as possible. In this case, whether the data is durable on the original master or not is of a smaller concern since once the transaction has left the node, a crash will not cause the transaction to disappear since it has already been distributed. This means that for implementing multi-masters, we want replication to send transactions as soon as possible—and maybe even before that—since we can achive high-availablility by propagating the information as widely as possible.<p>
On the other hand, transactions sent from the master to the slave <em>might</em> need to be durable on the master since otherwise the slave might be moving into an alternative future—a future where this transaction was committed—if the transactions sent to the slave are lost because of a crash. In this case, it is necessary for the master to not send out the transaction before it is in durable store.
Having a master that is able to send out both completed transactions and durable transactions at the same time, all based on the requirements of the slave that connects, is a great feature and allow the implementation of both an efficient multi-master solution as well as slaves that does not diverge from the master even in the event of crashes. Currently, a master cannot both deliver transactions that are <em>completed</em> and transactions that are <em>durable</em> at the same time. With the patch presented in this article, it is possible to implement this, but in alternative #1 and #2 described above, all the transactions are kept "on the side" and not written to the binary log until they are being committed. This means that it is harder to support this scenario with the two other alternatives.
<h2>Concluding remarks</h2>
To sum up the discussion above: we are interested in exploring this approach since we think that it provides shorter lock time, hence scales better to multi-core machines, and in addition provide better scale-out capabilities, since it will be possible that the slaves can decide if they want to receive durable or completed transactions.
Thanks to all in the community for the great work and discussions on binlog group commit. The next steps will be to benchmark this solution to see how it flies and it would be great to also get some feedback on this approach. As always, we are interested in getting a good and efficent solution that also can be maintained end evolved easily.Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com5tag:blogger.com,1999:blog-23496029.post-69039838180732345932011-04-13T16:23:00.004+02:002011-04-18T08:53:27.930+02:00Round-Robin Multi-Source in Pure SQLWith the <a href="crash-safe-replication.html">addition of the new tables to implement crash-safe replication</a> we also get access to replication information through the SQL interface. This might not seem like a big advantage, but it should not be taken lightly. To demonstrate the power of using this approach, I will show how to implement a multi-source round-robin replication described at <a href="http://dom.as/2008/05/14/trainwreck-external-mysql-replication-agent/">other places</a> (including <a href="http://oreilly.com/catalog/9780596807290">our book</a>). However, compared to the other implementations—where the implementation requires a client to parse the output of <code>SHOW SLAVE STATUS</code>—the twist is that the implementation is entirely done in the server, using pure SQL.<p>
If you're familiar with replication, you know that a slave can just replication from a single master. The trick used to replicate from multiple master—this is usually called <em>multi-source</em>—is to switch between masters in a time-share fashion as illustrated in Figure 1. The schema used to pick the master to replicate can vary, but it is common to use a round robin schedule.<p>
The steps necessary to switch master are:
<ol>
<li>Stop reading events from the master and empty the relay log. To stop reading events from the master, it is necessary to ensure that there are no outstanding events in the relay log before switching to another master. If this is not done, some will not be applied and will have to be re-fetched from the master.<ol>
<li>Stop the I/O thread.</li>
<li>Wait for the events in the relay log to be applied.</li>
<li>Stop the SQL thread.</li></ol></li>
<li>Save away the replication information.</li>
<li>Fetch the saved information about the next master to replicate from.</li>
<li>Change master using the new information.</li>
<li>Start the slave threads.</li>
</ol>
Simple, right? So, let's make an implementation! So, what pieces do we need?
<ul>
<li>To handle the periodic switching, we use an SQL event for executing the above procedure.</li>
<li>We need a table to store the state of each master. The table should contain all the necessary information for configuring the master, including the binlog position.</li>
<li>We need to be able to store what master we're currently replicating from.</li>
</ul>
<h4>Saving state information</h4>
<table class="figure-right">
<caption><strong>Figure 1. </strong>Tables for storing information about masters</caption>
<tr><td><pre class="code">CREATE TABLE my_masters (
idx INT AUTO_INCREMENT PRIMARY KEY,
host VARCHAR(50), port INT DEFAULT 3306,
user VARCHAR(50), passwd VARCHAR(50),
log_file VARCHAR(50), log_pos LONG,
UNIQUE INDEX (host,port,user)
) ENGINE=InnoDB;
CREATE TABLE current_master (
idx INT
) ENGINE=InnoDB;
</pre></td></tr></table>
We need two tables: a table <code>my_masters</code> to record information about the available masters and a table <code>current_master</code> that keeps information about the current master. The <code>my_masters</code> table will contain information on how to connect to the masters as well as the last seen position. We assume that the user and password information is stored in the table and won't save away that information when switching master. To store the current master being replicated from, We cannot use a user defined variable—because each invocation of an event spawns a new session—so we store this information in a table.<p>
<h4>Switching masters</h4>
To be able to execute a <code>CHANGE MASTER</code> statement with the information we need, it would be perfect to use a prepared statement, but unfortunately, the <code>CHANGE MASTER</code> statement is one of those statements that cannot be used inside a prepared statement, so we have to build the statement dynamically. To make it easier, we create a <code>change_master</code> procedure that does the job of building, preparing, executing, and deallocating a prepared statement. We also allow the file name and position passed to be NULL, in which case we start replication without these parameters, essentially starting from the beginning of the masters binary log.
<pre class="code">
delimiter $$
CREATE PROCEDURE change_master(
host VARCHAR(50), port INT,
user VARCHAR(50), passwd VARCHAR(50),
name VARCHAR(50), pos LONG)
BEGIN
SET @cmd = CONCAT('CHANGE MASTER TO ',
CONCAT_WS(', ',
CONCAT('MASTER_HOST = "', host, '"'),
CONCAT('MASTER_PORT = ', port),
CONCAT('MASTER_USER = "', user, '"'),
CONCAT('MASTER_PASSWORD = "', passwd, '"')));
IF name IS NOT NULL AND pos IS NOT NULL THEN
SET @cmd = CONCAT(@cmd,
CONCAT_WS(', ', '',
CONCAT('MASTER_LOG_FILE = "', name, '"'),
CONCAT('MASTER_LOG_POS = ', pos)));
END IF;
PREPARE change_master FROM @cmd;
EXECUTE change_master;
DEALLOCATE PREPARE change_master;
END $$
delimiter ;
</pre>
The last step is to create the event that switch master for us. As a specific feature, we implement the event handling so that we can add and remove rows from the <code>my_masters</code> table and the event will just pick the next one in order. To solve this, we use queries to pick the next one in order based on the index of the last used master and then an additional query to handle the case of a wrap-around with a missing table at index 1.<p>
To allow the table to be changed while the events are executing, we place all the updates of our tables into a transaction. That way, any updates done to the table while the event is executing will not affect the logic for picking the next table.<p>
There are some extra logic added to handle the case that there are "holes" in the index numbers: it is possible that there is no master with index 1 and it is possible that the next master does not have the next index in sequence. This also allow the server ID of the master to be used, but in the current implementation, we use a simple index instead.
<table>
<tr><td><pre class="code">delimiter $$
CREATE EVENT multi_source
ON SCHEDULE EVERY 10 SECOND DO
BEGIN
DECLARE l_host VARCHAR(50);
DECLARE l_port INT UNSIGNED;
DECLARE l_user TEXT;
DECLARE l_pass TEXT;
DECLARE l_file VARCHAR(50);
DECLARE l_pos BIGINT;
DECLARE l_idx INT DEFAULT 1;
</pre></td><td>
</td></tr>
<tr><td valign="top"><pre class="code"> SET SQL_LOG_BIN = 0;</pre></td><td valign="top"><em>Don't write any of this to the binary log. Since this is an event, it will automatically be reset at the end of the execution and not affect anything else.</em></td></tr>
<tr><td valign="top"><pre class="code"> STOP SLAVE IO_THREAD;
SELECT master_log_name, master_log_pos
INTO l_file, l_pos
FROM mysql.slave_master_info;
SELECT MASTER_POS_WAIT(l_file, l_pos);
STOP SLAVE;
</pre></td><td valign="top">
<em>Stop the slave I/O thread and empty the relay log before switching master</em></td></tr>
<tr><td><pre class="code"> START TRANSACTION;
</pre></td><td>
</td></tr>
<tr><td valign="top"><pre class="code"> UPDATE my_masters AS m,
mysql.slave_relay_log_info AS rli
SET m.log_pos = rli.master_log_pos,
m.log_file = rli.master_log_name
WHERE idx = (SELECT idx FROM current_master);
</pre></td><td valign="top"><em>Save the position of the current master</em></td></tr>
<tr><td valign="top"><pre class="code"> SELECT idx INTO l_next_idx FROM my_masters
WHERE idx > (SELECT idx FROM current_master)
ORDER BY idx LIMIT 1;
</pre></td><td valign="top"><em>Find the next master in turn. To handle that masters have been removed, we will pick the next one index-wise. Wrap-around is handled by using the default of 1 above.</em></td></tr>
<tr><td valign="top"><pre class="code"> SELECT idx INTO l_next_idx FROM my_masters
WHERE idx >= l_next_idx
ORDER BY idx LIMIT 1;
</pre></td><td valign="top"><em>If we did a wrap-around, it might be the case that master with index 1 does not exist (the default for l_next_idx), so then we have to scan and find the first index that exists which is equal to or greater than l_next_idx.</em></td></tr>
<tr><td valign="top"><pre class="code"> UPDATE current_master SET idx = l_next_idx;
SELECT host, port, user, passwd, log_pos, log_file
INTO l_host, l_port, l_user, l_pass, l_pos, l_file
FROM my_masters
WHERE idx = l_next_idx;
CALL change_master(l_host, l_port, l_user,
l_pass, l_file, l_pos);
COMMIT;
START SLAVE;
END $$
delimiter ;
</pre></td><td valign="top"><em>Extract the information about the new master from our masters table <code>my_masters</code> and change to use that master.</em></td></tr>
</table>
That's all! No you go off and play with it and send me comments.<p>
You can download the MySQL 5.6 Milestone Development Release <a href="http://dev.mysql.com">MySQL Developer Zone (<code>dev.mysql.com</code>)</a>, which contain the new replication tables and you can find information in the <a href="crash-safe-replication.html">previous post</a> on how to set up the server to use the new tables.Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com1tag:blogger.com,1999:blog-23496029.post-17082304104696194382011-04-12T17:09:00.004+02:002011-04-12T17:19:45.114+02:00Crash-safe ReplicationA common request is to have replication crash-safe in the sense that the replication progress information always is in sync with what has actually been applied to the database, even in the event of a crash. Although transactions are not lost if the server crashes, it could require some tweaking to bring the slaves up again.<p>
In the latest MySQL 5.6 milestone development release, the replication team has implemented crash-safety for the slave by adding the ability of committing the replication information together with the transaction (see Figure 1). This means that replication information will always be consistent with has been applied to the database, even in the event of a server crash. Also, some fixes were done on the master to ensure that it recovers correctly.<p>
If you're familiar with replication, you know that the replication information is stored in two files: <code>master.info</code> and <code>relay-log.info</code>.
The update of these files are arranged so that they are updated after the transaction had been applied. This means that if you have a crash between the transaction commit and the update of the files, the replication progress information would be wrong.
In other words, a transaction cannot be lost this way, but there <em>is</em> a risk that a transaction could be applied yet another time.
The usual way to avoid this is to have a primary key on all your tables. In that case, a repeated update of the table would cause the slave to stop, and you would have to use <code>SQL_SLAVE_SKIP_COUNTER</code> to skip the transaction and get the slave up and running again. This is better than losing a transaction, but it is nevertheless a nuisance.
Removing the primary key to prevent the slave from stopping will only solve the problem partially: it means that the transaction would be applied twice, which would both place a burden on the application to handle dual entries and also require that the tables to be cleaned regularly.
Both of these approches require either manual intervention or scripting support to handle. This does not affect reliability, but it is so much easier to handle if the replication information is committed in the same transaction as the data being updated.<p>
<h4>Crash-safe masters</h4>
Two problems related to crash-safe replication has been fixed in the master, both of which could cause some annoyance when the master recovered.
<ul>
<li>If the master crashed when a binary log was rotated, it was possible that some orphan binlog files ended up in the binary log index file. This was fixed in 5.1 but is also a piece in the pussle of having crash-safe replication.</li>
<li>Writing to the binary log is not an atomic operation, and if a crash occured while writing to the binary log, there were a possibility of a partial event at the end of the binary log.<p>
Now, the master recovers from this by truncating the binary log to the last known good position, removing the partially written transaction and rolling back the outstanding transactions in the storage engines.</li>
</ul>
<table class="figure-right">
<caption>Figure 1. Moving position information update into transaction</caption>
<tr><td>
<object data="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiz2bv1DkWpAL2z0_NvVkDJoICfSVp-Bz9CyX2W8_9H2jCdtm52NzGv_C_TQ6hMlCroAlyARxXMJ5StBbX1z7oyn-U3ACEXCsKsXz9HnrduPSMA5PxL7MCtaEYC9tbvYpvNcjkV//" type="image/png" width="240" height="210">
</object>
</td><td>
<object data="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgDCEcdRVenx3qo9K_iS9iQwPBNx7KP-L9L2oiyTXLTBydBtEwpnQubHTF6CxbtdSKgWDQDcIQFADC9EC8uity0NNED-F181t2ugZQcGBscTHqXjJKRJBHKwCFxJFUyZgONwV7I//" type="image/png" width="240" height="210">
</object>
</td></tr>
</table>
<h4>Crash-safe slaves</h4>
Several different solutions for implementing crash-safety—or <em>transactional replication</em>, as it is sometimes known as—have been proposed, with Google's <a href="http://code.google.com/p/google-mysql-tools/wiki/TransactionalReplication">TransactionalReplication</a> patch being the most known. This solution stores the replication positions in the InnoDB transaction log, but the MySQL replication team decided to instead implement crash-safety by moving the replication progress information into system tables. This is a more flexible solution and has several advantages compared to storing the positions in the InnoDB transaction log:
<ul>
<li>If the replication information and data is stored in the same storage engine, it will allow both the data and the replication position to be updated as a single transaction, which means that it is crash-safe.</li>
<li>If the replication information and data is stored in different storage engines, but both support XA, they can still be committed as a single transaction.</li>
<li>The replication information is flushed to disk together with the transaction data. Hence writing the replication information directly to the InnoDB redo log does not offer a speed advantage, but does not prevent the user from reading the replication progress information easily.</li>
<li>The tables can be read from a normal session using SQL commands, which also means that it can be incorporated into such things as stored procedures and stored functions.</li>
</ul>
<TABLE class="simple-table figure-right">
<caption><strong>Table 1. </strong><code>slave_master_info</code></caption>
<TR><TH>Field</TH><TH>Line in file</TH><TH>Slave status column</TH></TR>
<TR>
<TD>Master_id</TD>
<TD></TD>
<TD></TD>
</TR>
<TR>
<TD>Number_of_lines</TD>
<TD align="right">1</TD>
<TD></TD>
</TR>
<TR>
<TD>Master_log_name</TD>
<TD align="right">2</TD>
<TD><code>Master_Log_File</code></TD>
</TR>
<TR>
<TD>Master_log_pos</TD>
<TD align="right">3</TD>
<TD><code>Read_Master_Log_Pos</code></TD>
</TR>
<TR>
<TD>Host</TD>
<TD align="right">3</TD>
<TD><code>Master_Host</code></TD>
</TR>
<TR>
<TD>User_name</TD>
<TD align="right">4</TD>
<TD><code>Master_User</code></TD>
</TR>
<TR>
<TD>User_password</TD>
<TD align="right">5</TD>
<TD></TD>
</TR>
<TR>
<TD>Port</TD>
<TD align="right">6</TD>
<TD><code>Master_Port</code></TD>
</TR>
<TR>
<TD>Connect_retry</TD>
<TD align="right">7</TD>
<TD><code>Connect_Retry</code></TD>
</TR>
<TR>
<TD>Enabled_ssl</TD>
<TD align="right">8</TD>
<TD><code>Master_SSL_Allowed</code></TD>
</TR>
<TR>
<TD>Ssl_ca</TD>
<TD align="right">9</TD>
<TD><code>Master_SSL_CA_File</code></TD>
</TR>
<TR>
<TD>Ssl_capath</TD>
<TD align="right">10</TD>
<TD><code>Master_SSL_CA_Path</code></TD>
</TR>
<TR>
<TD>Ssl_cert</TD>
<TD align="right">11</TD>
<TD><code>Master_SSL_Cert</code></TD>
</TR>
<TR>
<TD>Ssl_cipher</TD>
<TD align="right">12</TD>
<TD><code>Master_SSL_Cipher</code></TD>
</TR>
<TR>
<TD>Ssl_key</TD>
<TD align="right">13</TD>
<TD><code>Master_SSL_Key</code></TD>
</TR>
<TR>
<TD>Ssl_verify_servert_cert</TD>
<TD align="right">14</TD>
<TD><code>Master_SSL_Verify_Server_Cert</code></TD>
</TR>
<TR>
<TD>Heartbeat</TD>
<TD align="right">15</TD>
<TD></TD>
</TR>
<TR>
<TD>Bind</TD>
<TD align="right">16</TD>
<TD><code>Master_Bind</code></TD>
</TR>
<TR>
<TD>Ignored_server_ids</TD>
<TD align="right">17</TD>
<TD><code>Replicate_Ignore_Server_Ids</code></TD>
</TR>
<TR>
<TD>Uuid</TD>
<TD align="right">18</TD>
<TD><code>Master_UUID</code></TD>
</TR>
<TR>
<TD>Retry_count</TD>
<TD align="right">19</TD>
<TD><code>Master_Retry_Count</code></TD>
</TR>
</TABLE>
In addition to giving us crash-safe slaves the last of these advantages should not be taken lightly. Being able to handle replication from pure SQL put some of the key features in the hands of application developers.<p>
As previously mentioned, the replication information is stored in two files:
<dl>
<dt><code>master.info</code>
<dd>This file contain information about the connection to the master—such as hostname, user, and password—but also information about how much of the binary log that has been transferred to the slave.
<dt><code>relay-log.info</code>
<dd>This file contain information about the current state of replication, that is, how much of the relay log that has been applied.
</dl>
<h4>Options to select replication information repository</h4>
In order to make the solution flexible, we introduced a general API for adding replication information repositories. This means that we can support multiple types of repositories for replication information, but currently, only the old system using files <code>master.info</code> and <code>relay-log.info</code> and the system using tables <code>slave_master_info</code> and <code>slave_relay_log_info</code> is supported. In order to select what type of repository to use, two new options were added. These options are also available as server variables.
<dl>
<dt><a href="http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-log.html#option_mysqld_master-info-repository"><var>master_info_repository</var></a>
<dd>The type of repository to use for the master info data seen in Table 1.
<dt><a href="http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary.html#option_mysqld_relay-log-info-repository"><var>relay_log_info_repository</var></a>
<dd>The type of repository to use for the relay log info seen in Table 2.
</dl>
Both of the variables can be set to either <code>FILE</code> or <code>TABLE</code>. If the variable is set to <code>TABLE</code> the new table-based system will be used and if it is set to <code>FILE</code>, the old file-based system will be used. The default is <code>FILE</code>, so make sure to set the value if you want to use the table-based system.<p>
<TABLE class="simple-table figure-right" id="slave_relay_log_info">
<caption><strong>Table 2. </strong><code>slave_relay_log_info</code></caption>
<TR><TH>Field</TH><TH>Line in file</code></TH><TH>Slave status column</TH></TR>
<TR><TD>Master_id</TD><TD></TD><TD></TD></TR>
<TR><TD>Number_of_lines</TD><TD>1</TD><TD></TD></TR>
<TR><TD>Relay_log_name</TD><TD>2</TD><TD><code>Relay_Log_File</code></TD></TR>
<TR><TD>Relay_log_pos</TD><TD>3</TD><TD><code>Relay_Log_Pos</code></TD></TR>
<TR><TD>Master_log_name</TD><TD>4</TD><TD><code>Relay_Master_Log_File</code></TD></TR>
<TR><TD>Master_log_pos</TD><TD>5</TD><TD><code>Exec_Master_Log_Pos</code></TD></TR>
<TR><TD>Sql_delay</TD><TD>6</TD><TD><code>SQL_Delay</code></TD></TR>
</TABLE>
If you look in Table 1 and Table 2 you can see the column names used for the tables as well as the line number in the corresponding file and the column name in the output of <code>SHOW SLAVE STATUS</code>. Since we are using tables, the column names are used for storing the data in the table, but when using a file, the column names are only used to identify the correct row to update and the value is inserted at the line number given in the table.<p>
The format of the tables have been extended with an additional field that is not present in the files but which is present in the table: the <code>Master_id</code> field. The reason we added this is to make it possible to extend the server to track multiple masters. Note that we currently have no definite plans to add multi-source support, but as good engineers we do not want these tables to be a hindrance to adding multi-source.<p>
<h4>Selecting replication repository engine</h4>
In contrast with most of the system tables in the server, the replication repositories can be configured to use any storage engine you prefer. The advantage of this is that you can select the same engine for the replication repositories as the data you're managing. If you do that, both the data and the replication information will be committed as a single transaction.<p>
The new tables are created at installation using the <code>mysql_install_db</code> script, as usual, and the default engine for these tables are are the same as for all system tables: MyISAM. As you know MyISAM is not very transactional, so it is necessary to set this to use InnoDB instead if you really want crash-safety. To change the engine for these tables you can just use a normal <samp>ALTER TABLE</samp>.
<pre class="code">
slave> ALTER TABLE mysql.slave_master_info ENGINE = InnoDB;
slave> ALTER TABLE mysql.slave_relay_log_info ENGINE = InnoDB;
</pre>
Note that this works for these tables because they were designed to allow any storage engine to be used for them, but it does not mean that you can change the storage engine for other system tables and expect it to work.
<h4>Event processing</h4>
This implementation of crash-safe slaves work naturally with both statement-based and row-based replication and there is nothing special that needs to be done in the normal cases. However, these tables interleave with the normal processing in a little different ways.<p>
To understand how transactions are processed by the SQL thread, let us consider the following example transaction:
<pre class="code">
START TRANSACTION;
INSERT INTO articles(user, title, body)
VALUE (4711, 'Taming the Higgs Boson using Clicker Training', '....');
UPDATE users SET articles = articles + 1 WHERE user_id = 4711;
COMMIT;
</pre>
This transaction will be written to the binary log and then sent over to the slave and written to the relay log in the usual way. Once it is read from the relay log for execution, it will be executed as if an update statement where added to the end of the transaction, before the commit:
<pre class="code">
START TRANSACTION;
INSERT INTO articles(user, title, body)
VALUE (4711, 'Taming the Higgs Boson using Clicker Training', '....');
UPDATE users SET articles = articles + 1 WHERE user_id = 4711;
<span style="color: red">UPDATE mysql.slave_relay_log_info
SET Master_log_pos = <var>@@Exec_Master_Log_Pos</var>,
Master_log_name = <var>@@Relay_Master_Log_File</var>,
Relay_log_name = <var>@@Relay_Log_File</var>,
Relay_log_pos = <var>@@Relay_Log_Pos</var></span>
COMMIT;
</pre>
In this example, there is a number of pseudo-server variables (that is, they don't exist for real) that have the same name as the corresponding field in the result set from <code>SHOW SLAVE STATUS</code>. As you can see, the update of the position information is now inside the transcation and will be committed with the transaction, so if both <code>articles</code> and <code>mysql.slave_relay_log_info</code> are in <em>the same transactional engine</em>, they will be committed as a unit.<p>
This works well for the SQL thread, but what about the I/O thread? There are no transactions executed here at all, so when is the information in this table committed?<p>
Since a commit to the table is expensive—in the same way as syncing a file to disk is expensive when using files as replication information repository—the updates of the <code>slave_master_info</code> table is not updated with each processed event. Depending on the value of <var>sync_master_info</var> there are a few alternatives.
<dl>
<dt>If <var>sync_master_info</var> = 0
<dd>In this case, the <code>slave_master_info</code> table is just updated when the slave starts or stops (for any reason, including errors), if the relay log is rotated, or if you execute a <code>CHANGE MASTER</code> command.
<dt>If <var>sync_master_info</var> > 0
<dd>Then the <code>slave_master_info</code> table will be updated every <var>sync_master_info</var> event.
</dl>
This means that while the slave is running, you cannot really see how much data has been read to the slave without stopping it. If it is important to see how the slave progress in reading events from the master, then you have to set <var>sync_master_info</var> to some non-zero value, but you should be aware that there is a cost associated with doing this.<p>
This does not usually pose a problem since the times you need to read the master replication information on a running replication is far and few between. It is much more common to read it when the slave has stopped for some reason: to figure out where the error is or to perform a master fail-over.
<h4>Closing remarks</h4>
We would be very interested in hearing any comments you have on this feature and how it is implemented. If you want to try this out for yourselves then you can download the MySQL 5.6 Milestone Development Release where all this is implemented from the <a href="http://dev.mysql.com">MySQL Developer Zone (<code>dev.mysql.com</code>)</a>.
If you want to find out more details, the section <a href="http://dev.mysql.com/doc/refman/5.6/en/slave-logs-status.html">Slave Status Logs</a> in the MySQL 5.6 reference manual will provide you with all the information.
This is one of the features that presented by Lars Thalmann April 11, 2011 (yesterday) at 2:30pm, at the "MySQL Replication" talk at Collaborate 11 and April 12, 2011 (today) 10:50am "MySQL Replication Update" at the O'Reilly MySQL Conference & Expo.Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com14tag:blogger.com,1999:blog-23496029.post-81038332488864940992011-04-11T14:49:00.002+02:002011-04-11T14:55:16.283+02:00Replication Event ChecksumMySQL replication is fast, easy to use, and reliable, but once it
breaks, it can be very hard to figure out what the problem is. One of the concerns often raised is that events are corrupted, either through failing hardware, network failure, or software bugs. Even though it is possible to handle errors during transfer over the network using an SSL connection, errors here is rarely the problem. A more common problem (relatively) is that the events are corrupted either due to a software bug, or hardware error.<p>
To be able to better handle corrupted events, the replication team has added <a href=""><em class="def">replication event checksums</em></a> to MySQL 5.6 Milestone Development Release.
The replication event checksums are added to each event as it is
written to the binary log and are used to check that nothing happened with the event on the way to the slave. Since the checksums are added to all events in the binary log on the master and transfered both over the network and written to the relay log on the slave, it is possible to track events corrupted events both because of hardware problems,
network failures, and software bugs.<p>
<table class="figure-right">
<caption>Figure 1. Master and Slave with Threads</caption>
<tr><td>
<object title="Master and Slave with Threads"
data="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgpooSKNs9URFGqVW9U-ysTzzWvNTH1ct1mPTWTarbkrUr3aVmpRDdseanigvddADAEnSXC-ruxKXO1jAnFVaXk4s20bW1PhLaDxt3Fdz-MKK_5knQtF-xaWznqzDRuH91tEK66//" type="image/png" width="490" height="230">
</object>
</td></tr>
</table>
The checksum used is a CRC-32 checksum, more precisely ISO-3309, which
is the one supplied with <a href="http://zlib.net/">zlib</a>. This is
an efficient checksum algorithm, but there is of course a penalty
since the checksum needs to be generated. At this time, we don't have
any measurements on the performance impact.<p>
If you look at Figure 1 you can see an illustration of how events
propagate through the replication system. In the figure, the points
where a checksum <em>could</em> be generated or checked are marked
with numbers. In the diagram, you can see the threads that handle the
processing of events, and an outgoing arrow from a thread can generate
a checksum while an arrow going into a thread can validate a
checksum. Note, however, that for pragmatic reasons not all
validations or generations can be done.<p>
To enable validation or generation three new options were introduced:
<dl>
<dt><code>binlog_checksum</code>
<dd>This option is used to control checksum generation. Currently,
it can accept two different values: <code>NONE</code> and
<code>CRC32</code>, with <code>NONE</code> being default (for
backward compatibility).<p> Setting <var>binlog_checksum</var>
to <code>NONE</code> means that no checksum is generated, while
setting it to <code>CRC32</code> means that an ISO-3309 CRC-32
checksum is added to each binary log event.<p>
This means that a checksum will be generated by the session thread
and written to the binary log, that is, at point 1 in
Figure 1.
<dt><code>master_verify_checksum</code>
<dd>This option can be set to either 0 or 1 (with default being 0)
and indicates that the master should verify any events read from
the binary log on the master, corresponding to point 2 in
Figure 1. In addition to being read from the binary log by
the dump thread events are also read when a <kbd>SHOW BINLOG
EVENTS</kbd> is issued at the master and a check is done at this
time as well.<p>
Setting this flag can be useful to verify that the event really
written to the binary log is uncorrupted, but it is typically not
needed in a replication setting since the slave should verify the
event on reception.
<dt><code>slave_sql_verify_checksum</code>
<dd>Similar to <code>master_verify_checksum</code>, this option can
be set to either 0 or 1 (but defaults to 1) and indicates that the
<em>SQL thread</em> should verify the checksum when reading it
from the relay log on the slave.
Note that this means that the I/O thread writes a checksum to the
event written to the relay log, regardless of whether it received
an event with a checksum or not.<p>
This means that this option will enable verification at point 5 in
Figure 1 and also enable generation of a checksum at point
4 in the figure.
</dl>
If you payed attention, you probably noticed that there is no checking
for point 3 in the figure. This is not necessary since the checksum is
verified when the event is written to the relay log at point 4, and
the I/O thread just does a straight copy of the event (potentially
adding a checksum, as noted above).<p>
So, how does it look when we encounter a checksum error? Let's try it
out and see what happens.
We start by generating a simple binary log with checksums turned on
and see what we get.
<pre class="code wide">
master> <kbd>CREATE TABLE t1 (id INT AUTO_INCREMENT PRIMARY KEY, name CHAR(50));</kbd>
Query OK, 0 rows affected (0.04 sec)
master> <kbd>INSERT INTO t1(name) VALUES ('Mats'),('Luis');</kbd>
Query OK, 2 rows affected (0.00 sec)
Records: 2 Duplicates: 0 Warnings: 0
master> <kbd>SHOW BINLOG EVENTS FROM 261;</kbd>
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
| Log_name | Pos | Event_type | Server_id | End_log_pos | Info |
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
| master-bin.000001 | 261 | Query | 1 | 333 | BEGIN |
| master-bin.000001 | 333 | Intvar | 1 | 365 | INSERT_ID=1 |
| master-bin.000001 | 365 | Query | 1 | 477 | use `test`; INSERT INTO t1(name) VALUES ('Mats'),('Luis') |
| master-bin.000001 | 477 | Query | 1 | 550 | COMMIT |
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
4 rows in set (0.00 sec)
</pre>
Here, everything looks as before, so no sign of a checksum here, but
let's edit the binlog file directly and change the 's' in 'Mats' to a
'z' and see what happens. First with
<code>MASTER_VERIFY_CHECKSUM</code> set to 0, and then with it set to
1.
<pre class="code wide">
master> <kbd>SHOW BINLOG EVENTS FROM 261;</kbd>
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
| Log_name | Pos | Event_type | Server_id | End_log_pos | Info |
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
| master-bin.000001 | 261 | Query | 1 | 333 | BEGIN |
| master-bin.000001 | 333 | Intvar | 1 | 365 | INSERT_ID=1 |
| master-bin.000001 | 365 | Query | 1 | 477 | use `test`; INSERT INTO t1(name) VALUES ('Matz'),('Luis') |
| master-bin.000001 | 477 | Query | 1 | 550 | COMMIT |
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
4 rows in set (0.00 sec)
master> <kbd>SET GLOBAL MASTER_VERIFY_CHECKSUM=1;</kbd>
Query OK, 0 rows affected (0.00 sec)
master> <kbd>SHOW BINLOG EVENTS FROM 261;</kbd>
ERROR 1220 (HY000): Error when executing command SHOW BINLOG EVENTS: Wrong offset or I/O error
</pre>
Now, the error message generated is not the crystal clear, but there
were an I/O error when reading the binary log: the checksum
verification failed. You can see this because I could show the content
of the binary log with <code>MASTER_VERIFY_CHECKSUM</code> set to 0,
but not when set to 1. Since the checksum is checked when reading
events from the binary log, we get a checksum failure when using
<kbd>SHOW BINLOG EVENTS</kbd>.<p>
So, if we restore the error and verify that it is correct by issuing a
<kbd>SHOW BINLOG EVENTS</kbd> again, we can try to send it over to
the slave and see what happens. The steps to do this (in case you want
to try yourself) is:
<ol>
<li>Start the I/O thread and let it create the relay log using
<kbd>START SLAVE IO_THREAD</kbd>.</li>
<li>Stop the slave using <kbd>STOP SLAVE</kbd> (this is necessary
since the slave buffers part of the relay log).</li>
<li>Manually edit the relay log to corrupt one event (I replaced the
's' with a 'z'.</li>
<li>Start the slave using <kbd>START SLAVE</kbd>.</li>
</ol>
The result when doing this is an error, as you can see below. Removing
the corruption and starting the slave again will apply the events as
expected.
<pre class="code wide">
slave> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
.
.
.
Master_Log_File: master-bin.000001
Read_Master_Log_Pos: 550
Relay_Log_File: slave-relay-bin.000002
Relay_Log_Pos: 419
Relay_Master_Log_File: master-bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: No
.
.
.
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1594
Last_SQL_Error: Relay log read failure: Could not parse
relay log event entry. The possible
reasons are: the master's binary log is
corrupted...
.
.
.
Last_SQL_Error_Timestamp: 110406 09:41:40
1 row in set (0.00 sec)
</pre>
Now, this is all very nice, but if you have a corruption, you also
want to find out where the corruption is—and that preferably
without having to start the server. To handle this, the
<code>mysqlbinlog</code> program was extended to print the CRC
checksum (if there is one) and also to verify it if you give the
<var>verify-binlog-checksum</var> option to it.
<pre class="code wide">
$ <kbd>client/mysqlbinlog --verify-binlog-checksum master-bin.000001</kbd>
.
.
.
<em style="color: red"># at 261</em>
#110406 8:35:28 server id 1 end_log_pos 333 <em style="color: red">CRC32 0xed927ef2</em> Query thread_id=1...
SET TIMESTAMP=1302071728/*!*/;
BEGIN
/*!*/;
# at 333
#110406 8:35:28 server id 1 end_log_pos 365 CRC32 0x01ed254d Intvar
SET INSERT_ID=1/*!*/;
<em style="color: red">ERROR: Error in Log_event::read_log_event(): 'Event crc check failed! Most likely...</em>
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
</pre>
As you can see, an error is emitted for the offending event, and you
can also see the CRC checksum value (which is 32 bits) in the output
above, and it corresponds to the position where the slave stopped for
my corrupted binary log.<p>
This is just the beginning: there are many things that can be done
using checksums, and many new things that are now possible to
implement. If you think that this is a useful feature, please let us
know, and if you think that it needs to be enhanced, changed, or
extended, we would also like to hear from you.
<h2>Closing remarks</h2>
We would be very interested in hearing any comments you have on this
feature and how it is implemented. If you want to try this out for
yourselves then you can download the MySQL 5.6 Milestone Development
Release where all this is implemented from the <a
href="http://dev.mysql.com">MySQL Developer Zone
(<code>dev.mysql.com</code>)</a>.<p>
If you want to find out the details, the reference documentation for
the replication checksum can be found together with the options
mentioned above:
<ul>
<li><a href="http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-log.html#option_mysqld_binlog-checksum">Manual for <var>binlog-checksum</var> option</a></li>
<li><a href="http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-log.html#option_mysqld_master-verify-checksum">Manual for <var>master-verify-checksum</var> option</a></li>
<li><a href="http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-log.html#option_mysqld_slave-sql-verify-checksum">Manual for <var>slave-sql-verify-checksum</var> option</a></li>
</ul>
This is one of the features that are presented by Lars Thalmann today (April 11, 2011) at 2:30pm, at the "MySQL Replication" talk at Collaborate 11 and tomorrow (April 12, 2011) 10:50am "MySQL Replication Update" at the O'Reilly MySQL Conference & Expo.Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com3tag:blogger.com,1999:blog-23496029.post-12180145043209446882011-02-08T13:05:00.002+01:002011-02-08T13:08:53.716+01:00Slave Type Conversions[Note: I'm testing to use <a href="http://code.google.com/p/googlecl/" >googlecl</a> to post this article.]
Replication is typically used to replicate from a master to one or
more slaves using the same definition of tables on the master and
slave, but in some cases you want to replicate to tables with a
different definition on the slave, for example:
<ul>
<li>Adding a timestamp column on the slave to see when the row was
last updated.</li>
<li>Eliminating some columns on the slave because you don't need
them and they take up space that you can use for better
purposes.</li>
<li>Temporarily handling an on-line upgrade of a dual-master or
circular replication setup.</li>
</ul>
Of these alternatives, the last one is critical to any deployment that
want to stay available. If this case can be handled, most other
changes can also be handled, so let's focus on that.<p>
<table class="figure-right">
<caption>Figure 1. Table with an extra column on slave</caption>
<TR><TH>Master</TH><TH>Slave</TH></TR>
<TR>
<TD valign="top"><pre class="code wide">
CREATE TABLE employee (
id SMALLINT AUTO_INCREMENT,
name VARCHAR(64),
email VARCHAR(64),
PRIMARY KEY (id))
</pre></TD>
<TD valign="top"><pre class="code wide">
CREATE TABLE employee (
id SMALLINT AUTO_INCREMENT,
name VARCHAR(64),
email VARCHAR(64),
<strong>ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP</strong>
PRIMARY KEY (id))
</pre></TD>
</TR>
</table>
When using statement-based replication, the plain statements are
replicated—this can at times can be an advantage, but not
always, as you will soon see. The most obvious case is when you have
more or fewer columns on the master than you have on the slave.
To illustrate the problem, let us start with the table definitions in
Figure 1. Here a timestamp column was added to the slave to see
when the row was last changed. When using statement-based replication,
we can properly replicate between these tables provided we always give
column names to the statement on the master, for example:<p>
<pre class="code">
master> <strong>INSERT INTO employee(name, email) VALUES ('Mats', 'mats@example.com');</strong>
master> <strong>DELETE FROM employee WHERE email = 'mats@example.com';</strong>
master> <strong>UPDATE employee SET name = 'Matz' WHERE email = 'mats@example.com';</strong>
</pre>
In all these cases, the statements execute perfectly well with both
table definition since the "missing" column has a default value and
each statement gives exactly the names of the columns to update.
The <code>DELETE</code> and <code>UPDATE</code> statements naturally
refer only to the column on the master, but for <code>INSERT</code> it
is necessary to add the column names even if the tuple matches the
definition on the master since it could be different on the slave.<p>
Having to give the column names all the time is fragile and if the
user—or the application—makes a mistake and types the
following statement, replication on the slave will stop with an
error:<p>
<pre class="code">
master> <strong>INSERT INTO employee VALUES (DEFAULT, 'Mats', 'mats@example.com');</strong>
</pre>
In contrast to statement-based replication, row-based replication will
do the right thing and throw away extra columns sent by the master or
add default values to extra columns on the slave—if the column
has a default value—<em>provided that the columns are added or
removed last in the table.</em><p>
This works fine for the example above since the extra timestamp column
is last in the table. The effect is to keep track of when the row was
last updated on the slave, which could be used to see if the row is
current.
<div class="note">Depending on what you want to accomplish, there
could be better techniques for this, described in <a
href="http://mysqlhighavailability.com/">our book</a>. The problem is
that the timestamp might not have enough precision in a high-load
situation.</div>
So, row-based replication in MySQL 5.1 contain support for using more
or fewer columns on the slave as compared to the master, but there
were one case that was not supported: replicating between different
column types. This is very important for basic upgrade scenarios
where you, for example, change the size of some column during an
upgrade.<p>
<table class="figure-right">
<caption>Figure 2. Different types on master and slave</caption>
<TR><TH>Master</TH><TH>Slave</TH></TR>
<TR>
<TD valign="top"><pre class="code wide">
CREATE TABLE employee (
id SMALLINT AUTO_INCREMENT,
name CHAR(64),
email CHAR(64),
PRIMARY KEY (id))
</pre></TD>
<TD valign="top"><pre class="code wide">
CREATE TABLE employee (
id SMALLINT AUTO_INCREMENT,
name VARCHAR(64),
email VARCHAR(64),
PRIMARY KEY (id))
</pre></TD>
</TR>
</table>
For example, consider the table definition in Figure 2.
In this case, the intention is to save space on the slave by storing
the strings in a <code>VARCHAR</code> field instead of a
<code>CHAR</code> field—recall that <code>VARCHAR</code> fields
are variable length strings while <code>CHAR</code> fields occupy a
fixed space in the row. (We don't care too much about the reasons for
using <code>CHAR</code> on the master, we just use this example to
illustrate the problem.)<p>
When using statement-based replication, this works well since the actual
statement is replicated. However, when using row-based replication we
have the additional requirement (in 5.1) that the column types
<em>have to have identical base types</em>. Unfortunately,
<code>CHAR</code> and <code>VARCHAR</code> does not have the same base
type, so replication will stop with an error when you try to execute
the <code>INSERT</code>, which is not very helpful.<p>
Fortunately, the replication team have extended row-based replication
with a new feature in MySQL 5.5: that of converting between types when
replicating from a master and to a slave with a different table
definition. With this feature, a stricter type checking is also
implemented and better error messages.<p>
The conversion checks the <em>declared types</em> on the master and
slave and decides before executing the transaction if the conversion
is allowed. This means that it does not investigate the actual
<em>values</em> replicated: only the types of the column on the master
and the slave. In addition to better performance when not checking
each value this check is done so that you can be sure that
<em>any</em> value replicated between the tables will work, not just
the values that you happened to have in your test suite.<p>
When dealing with conversions, we are only considering conversions
<em>within</em> the groups below.
<dl>
<dt><strong>Integer types</strong>
<dd><code>TINYINT</code>, <code>SMALLINT</code>,
<code>MEDIUMINT</code>, <code>INT</code>, <code>BIGINT</code>
<dt><strong>Decimal types</strong>
<dd><code>DECIMAL</code>, <code>FLOAT</code>, <code>DOUBLE</code>,
<code>NUMERIC</code>
<dt><strong>String types</strong>
<dd><code>CHAR(<em>N</em>)</code>, <code>VARCHAR(<em>N</em>)</code>,
<code>TEXT</code> even for different values of <em>N</em> on master
and slave.
<dt><strong>Binary types</strong>
<dd><code>BINARY(<em>N</em>)</code>,
<code>VARBINARY(<em>N</em>)</code>, <code>BLOB</code> even for
different values for <em>N</em> on master and slave.
<dt><strong>Bit types</strong>
<dd>Conversion between <code>BIT(<em>N</em>)</code> for different
values of <em>N</em> on master and slave.
</dl>
Since the string and binary types only differ in the character set
they use—<em>and replication is not aware of character sets
yet</em>—replication between string and binary types will be
possible simply because the character set is not known. Don't rely on
this though; as soon as <a
href="http://bugs.mysql.com/bug.php?id=47673" >Bug#47673</a> is fixed,
string and binary types will be separated into distinct groups and
replication will stop if the character sets don't allow conversion.<p>
Within each group, we also have two types of conversions:
<em>non-lossy conversions</em> and <em>lossy conversions</em>. With a
non-lossy conversion you are guaranteed that no information is lost,
but with lossy conversions it is possible that you lose some
information. A typical example of a non-lossy conversion is converting
from a <code>CHAR(32)</code> field to a <code>CHAR(64)</code>
field—since the target field is wider than the source field,
there is no risk that any part of the string is lost. Converting in
the other direction, however, is a lossy conversion since a string
with more than 32 characters cannot fit into a <code>CHAR(32)</code>
field. A more odd example is conversion between <code>FLOAT</code> and
<code>DECIMAL(N,M)</code>, which are <em>always</em> considered lossy,
regardless of the direction the conversion is done. Since it
cannot be guaranteed that all floating-point numbers can be converted
to decimal numbers without losing precision, and vice versa.<p>
Controlling what conversions are allowed is controlled with a new
server variable <code>SLAVE_TYPE_CONVERSIONS</code>, which is of the
type <code>SET('ALL_LOSSY','ALL_NON_LOSSY')</code>, that is, it is a
<em>set</em> of allowed conversions. The default for this variable is
the empty set, meaning that no conversions are allowed at all.<p>
If the <code>ALL_NON_LOSSY</code> constant is in the set, all
conversions (within each group) that do not lose any information are
allowed. For example, replicating from <code>CHAR(32)</code> to
<code>TINYTEXT</code> is allowed since the conversion goes to a wider
field (even if it is a different type).<p>
If the <code>ALL_LOSSY</code> constant is in the set, all conversions
(again, within the same group) that could potentially lose information
is allowed. For example, conversion to a narrower field on the slave,
such as <code>CHAR(32)</code> to <code>CHAR(16)</code> is
allowed. <em>Note that non-lossy conversions are not automatically
allowed when <code>ALL_LOSSY</code> is set.</em><p>
<div class="note">The prefix <code>ALL</code> is used since we were considering the possibility of allowing conversions within certain groups only, for example, to add the feature of only allowing lossy conversions for strings and non-lossy conversions for integers, we could set <code>SLAVE_TYPE_CONVERSIONS</code> to <code>'STRING_LOSSY,INTEGER_NON_LOSSY'</code>. This is, however, pure speculations at this time.</div>
If you are interested about the details of how slave type conversions work, you can find more information in the MySQL Reference Manual in <a href="http://dev.mysql.com/doc/refman/5.5/en/replication-features-differing-tables.html" >Replication with Differing Tables on Master and Slave</a>.Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com3tag:blogger.com,1999:blog-23496029.post-35608810376002868992010-09-22T23:30:00.002+02:002010-09-24T15:18:12.314+02:00Have you seen my replication files?I recently started looking over how to get information about relay log file and binary log file using an SQL interface. Being able to do that can be quite handy when one is going to work with replication in various ways. In my particular case, I wanted to get the path to the relay log index file and binary log index file to be able to read the binary log files as well as the relay log files directly. You are probably familiar with the <code>--relay-log-index</code> and <code>--relay-log</code> options that can be set to specify where the index file. These options can either be used to set an absolute path or a relative path to use for the files. If the option starts with a <code>/</code>, it is considered an absolute path (drive letters are allowed on Windows though), otherwise the path is relative to the data directory (which is specified through the <code>--datadir</code> option). The values supplied to these options are provided from SQL as the system variables <var>relay-log-index</var> and <var>relay-log</var> respectively.
The recommendation is to always set the <code>--relay-log</code> and <code>--relay-log-index</code> since the default value for these options contain the hostname. The problem with this is that if the database files is moved to a new machine with a different hostname, the server will not be able to pick up the files correctly and will assume that they do not exist.
The logic for finding the location of the relay log files can be quite daunting; to find the location of the relay log index file:
<ol>
<li>If <var>relay_log_index</var> is set, this is the location of the relay log index file.</li>
<li>If <var>relay_log_index</var> is not set, then the value supplied to the <var>relay_log</var> option is used to figure out the name of the relay log index file.</li>
<li>If neither <var>relay_log_index</var> nor <var>relay_log</var> is set, then the name of the relay log index file is taken by stripping the directory and extension from the <var>pid_file</var> variable (set using the <code>--pid-file</code> option), if supplied, and adding <code>-relay-bin.index</code> to the end of the string.
<ul>
<li>The <var>pid_file</var> variable has a default value which consists of <code><var>datadir</var>/<var>hostname</var>.pid</code>, which would give the relay log index file a name of <code><var>datadir</var>/<var>hostname</var>-relay-bin.index</code>.</li>
</ul></li>
<li>If the path is a relative path—that is, the path does not start with a directory separator—then the value of <var>datadir</var> is prepended to the relay log index file name.</li>
</ol>
Keeping track of all these details is not something I want to spend my time on, so I wrote a stored function for computing the name of the relay log index file which I simply called <code>relay_log_index_file</code>:
<pre class="code">
CREATE FUNCTION relay_log_index_file () RETURNS VARCHAR(256)
DETERMINISTIC
READS SQL DATA
BEGIN
DECLARE rli_name VARCHAR(256);
IF @@relay_log_index IS NOT NULL THEN
SET rli_name = @@relay_log_index;
ELSEIF @@relay_log IS NOT NULL THEN
SET rli_name = @@relay_log;
ELSE
BEGIN
DECLARE l_pid_file VARCHAR(256);
DECLARE l_pid_base VARCHAR(256);
SET l_pid_file = SUBSTRING_INDEX(@@pid_file, '/', -1);
SET l_pid_base = SUBSTRING_INDEX(l_pid_file, '.', 1);
SET rli_name = CONCAT(l_pid_base, '-relay-bin.index');
END;
END IF;
IF rli_name NOT LIKE '/%' THEN
RETURN CONCAT(@@datadir, rli_name);
END IF;
RETURN rli_name;
END
</pre>
This is a quite complicated way of figuring out the location of the relay log files and hardly something that I consider very useful. It would be much better if the <var>relay_log_index</var> variable gave the complete path to the file, regardless of what was given to the <var>--relay-log-index</var> option (or even if the option was given at all).<p>
Being able to fetch the relay log index file is quite convenient, but being able to fetch the binary log index file would be even more convenient. Unfortunately, there is no such variable. The <var>--log-bin</var> option can be used to supply a base name to use for the binary log, but the <var>log_bin</var> variable can only be ON or OFF, which in my book is not very smart. To fix this, I created <a href="http://forge.mysql.com/worklog/task.php?id=5465">WL#5465</a>, which introduces three new variables—<var>log_bin_basename</var>, <var>relay_log_basename</var>, and <var>log_bin_index</var>—and changes behaviour of <var>relay_log_index</var>.
<dl>
<dt><var>log_bin_basename</var>
<dd>This is a global read-only variable that contain the base file name used for the binary log files, the path to the files but omitting the extension.
<ul>
<li>If a full path was given to <var>--log-bin-index</var>, this will be stored in <var>log_bin_index</var>.</li>
<li>If a relative path was given to <var>--log-bin-index</var>, the contents of <var>datadir</var> will be used as directory and prepended to the value of <var>--log-bin-index</var></li>
<li>Otherwise, the value of <var>datadir</var> will be used as the directory of the file and the base name is created by taking the basename of <var>pid_file</var> (name without extension) and adding '<code>-bin</code>'.</li>
</ul>
<dt><var>log_bin_index</var>
<dd>This is a global read-only variable containing the full name to the binary log index file. If no value is given, the value of <var>log_bin_basename</var> is used and the extension '<code>.index</code>' is added.
<dt><var>relay_log_basename</var>
<dd>This is a global read-only variable containing the base file name used for the relay log file, that is, the full path to the relay logs but not including the extension. The value of this variable is created in the same way as for <var>log_bin_basename</var> with the only difference that the '<code>-relay-bin</code>' suffix is used instead of '<code>-bin</code>'.
<dt><var>relay_log_index</var>
<dd>This is a global read-only variable containing the full name of the relay log index file. If no value is given, the value of <var>relay_log_basename</var> is used and the extension '<code>.index</code>' is added.
</dl>
With these new variables, fetching the full path of the binary log index file is as easy as doing a '<code>SELECT @@log_bin_index</code>'.
<div class="note">If you're interested in if the patch for this worklog will be in any particular server version or f it is pushed at all, you have to check the status of the worklog. Even if I have described the architecture and implemented a patch, there is no way to know where it ends up or even if it is pushed at all.</div>
<div class="digression">
<h1>An alternative: let the application do the job</h1>
Creating a stored function for computing the relay log index file name might be overkill in many situation. If the value is needed from serveral different connections it makes sense to create it as a stored function to allow it to be used by different applications. It can, however, just as well be placed in the application code which would then compute the location of the relay log index file using a single query to the server.<p>
The information you need is the data directory from <var>datadir</var>, the pid file name from <var>pid_file</var> (in the event that the relay log or the relay log index option does not have a value), and the <var>relay_log</var> and <var>relay_log_index</var> values.<p>
For example, the following Python code could be used to compute the data directory, the base use for creating relay log files, and the name of the index file using a single query to the database server:
<pre class="code">
import os.path
def get_relay_log_info(connection):
cursor = connection.cursor()
cursor.execute("SELECT @@datadir, @@pid_file, @@relay_log, @@relay_log_index")
datadir, pid_file, relay_log, relay_log_index = cursor.fetchone()
def _add_datadir(filename):
if os.path.isabs(filename):
return filename
else:
return os.path.join(datadir, filename)
pidfile_base = os.path.basename(os.path.splitext(pid_file)[0])
base_name = _add_datadir(relay_log or pidfile_base + '-relay-bin')
index_file = _add_datadir(relay_log_index or base_name + '.index')
return { 'datadir': datadir, 'base': base_name, 'index': index_file }
</pre>
</div>Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com0tag:blogger.com,1999:blog-23496029.post-67627018335878632912010-08-18T20:49:00.005+02:002010-08-18T21:01:36.291+02:00Binary Log Group Commit - RecoveryIt was a while since I wrote the <a href="http://mysqlmusings.blogspot.com/2010/04/binary-log-group-commit-implementation.html">previous article</a>, but the merging of Oracle and Sun here resulted in quite a lot of time having to be spent on attending various events and courses for legal reason (one of the reasons I prefer working for smaller companies) and together with a summer vacation spent on looking over the house, there were little time for anything else. This is the second post of three, and in the last one I will cover some optimizations that improves performance significantly.<p>
In the <a href="http://mysqlmusings.blogspot.com/2010/04/binary-log-group-commit-implementation.html">previous article</a>, an approach was outlined to handle the binary log group commit. The basic idea is to use the binary log as a ticketing system by reserving space in it for the transactions that are going to be written. This will provide an order on the transactions as well as allowing writing the transactions in parallel to the binary log, thereby boosting performance.
As noted in the previous post, a crash while writing transactions to the binary log requires recovery. To understand what needs to be changed, it is necessary to understand how the structure of the binary log as well as how recovery after a crash works currently together with the implementation of <a href="http://dev.mysql.com/doc/refman/5.1/en/xa.html">2-phase commit that MySQL uses</a>.<p>
<table class="figure-right">
<caption>Figure 1. Binlog file structure</caption>
<tr><td>
<object title="Binary log structure"
data="http://www.kindahl.net/images/binlog-v4-structure.svg" type="image/svg+xml" width="400" height="196">
<object width="400" height="196" type="image/png" data="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaHbanbq7d1Ws8qCtmx_kgyl40A9aF_EhFOx8siYFAJlFFZdkQi3KTfRCVVuVnAxtwUh4sBR8AVhd6-mRxDggGMFUfQ3QGfb6lxrUDRfWMYOf5D8XirkyKbXUIkp4iX-yL0nfW/s320/binlog-v4-structure.png" id="BLOGGER_PHOTO_ID_5506824342835241650">
</object>
</object>
</td></tr>
</table>
<h3>A quick intro to the structure of the binary log</h3>
Figure 1 gives the rough structure of the binary log with a set of <em>binlog files</em> and an <em>binlog index file</em>. The binlog index file just list the binlog files that makes up the binary log, while each binlog file have the real contents of the binary log that you can see when executing a <code>SHOW BINLOG EVENTS</code>.<p>
Each binlog file consists of a sequence of <em>binlog events</em>, where the most important events from our perspective is the <em>Format description event</em>. In addition, each binlog file is also normally terminated by a <em>Rotate event</em> that refers to the next binlog file in the sequence.<p>
The Format description event is used to describe the contents of the binlog file and therefore contain a a lot of information about the binlog file. In this case we are interested in a special flag called <code>LOG_EVENT_BINLOG_IN_USE_F</code>, which is used to tell if the binlog is actively being written by the server. When the server opens a new binlog file, this flag is set to indicate that the file is in use, and when the binary log is rotated and a new binlog file created, this flag is cleared when closing the old binlog file.<p>
In the event of a crash, the flag will therefore be set and the server can see that the file was not closed properly and start with performing recovery.
<h3>Recovery and the binary log</h3>
When recovering, the server has to find all transactions that were partially executed and decide if they are going to be rolled back or committed properly. The deciding point when a transaction will be committed instead of rolled back is when the transaction has been written to the binary log. To do this, the server has to find all transactions that were written to the binary log and tell all storage engines to commit these transactions.<p>
The recovery procedure is executed when the binary log is opened—which the server does calling <code>TC_LOG_BINLOG::open</code> during startup. When the binary log is opened, recovery is done if the last open binlog file was not closed properly. An outline of the procedure executed is:
<ol>
<li>Open the binlog index file and go through it to find the last binlog file mentioned there [<code>TC_LOG_BINLOG::open</code>]</li>
<li>Open this binlog file and check if the <code>LOG_EVENT_BINLOG_IN_USE_F</code> flag is set</li>
<li>If the flag was clear, then the server stopped properly and no recovery is necessary. Otherwise, the server did not stop properly and recovery starts by calling.</li>
<li>The last binlog file is now open, so the entire binlog file is scanned and the XID of each each Xid event is recorded. These XIDs denote the transactions that were properly written to the binary log—that is, the transactions that shall be committed [<code>TC_LOG_BINLOG::recover</code>].</li>
<li>Each storage engine is handed the list of XIDs of transactions to commit through the <code>handlerton::recover</code> interface function [<code>ha_recover</code>].</li>
<li>The storage engine will then commit each transaction in the list and roll back all the others.</li>
</ol>
<table class="figure-right">
<caption>Figure 2. Parallel binary log group commit</caption>
<tr><td>
<object data="http://www.kindahl.net/images/binlog-crash-state.svg" width="300" height="300" type="image/svg+xml">
<object data="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhu-UyBpHLaRINZQNu3SFbD-w5UTs_6GQQ-uAB8fqlu7RiP5uNcnhrThtYwhtoLixfGLSX2Vow0m_Lo0CmIfkjToC-tbVJeS2n3_uAo8ETXtqQT6LauRr3bGdulSf-SJ7qMAPrV/s320/binlog-crash-state.png" width="300" height="300" type="image/png" id="BLOGGER_PHOTO_ID_5506825179747519362">
</object>
</object>
</td></tr>
</table>
<h3>So, what's the problem?</h3>
The procedure above works fine, so what are the problems we have to solve to implement the procedure described in the <a href="http://mysqlmusings.blogspot.com/2010/04/binary-log-group-commit-implementation.html">previous article</a>? If you look in Figure 2, you have a hint to what is the problem.<p>
Now, assume that thread 1, 2, and 3 in Figure 2 is writing transactions to disk (starting at positions <span class="math">Trans_Pos<sub>1</sub></span>, <span class="math">Trans_Pos<sub>2</sub></span>, and <span class="math">Trans_Pos<sub>3</sub></span> respectively) and that a preceding thread (a thread that got a binlog position before <code>Last_Complete</code>) decides that it is time to call <code>fsync</code> to group commit the state this far. The binlog file will then be written in this state—where some transactions are partially written—and <code>Last_Committed</code> will be set to the value of <code>Last_Complete</code>, leading to the situation depicted in Figure 2.<p>
As you can see in the figure, thread 2 has already finished writing data to the binary log and is therefore written to durable storage. Since thread 1—which precedes thread 2 in the binary log—has not completed yet, thread 2 has not yet committed and is still waiting for all the preceding transactions to complete. If a crash occurs in this situation, it is necessary to somehow find the XID of all transactions that have committed—excluding the transaction that thread 2 has completed—and commit them to the storage engine when recovering.<p>
<h3>A proposal for a new recovery algorithm</h3>
In the original algorithm, the scan of the binlog file stopped when the file ended, but since there can be partially written events in the binlog file after the "real" end of the file (the binlog file ends logically at <code>Last_Committed</code>/<code>Last_Complete</code>), so we have to find some other way to detect the logical end of the file.<p>
To handle this, it is necessary to somehow mark events that are not yet committed so that the recovery algorithm can find the correct position where the binlog file ends. The same problem occurs if one wants to persist the end of the binlog file <a href="http://forge.mysql.com/worklog/task.php?id=4925">preallocating the binlog file</a>. There are basically three ways to handle this:
<ul>
<li>Write the end of the binlog file in the binlog file header (that is, the Format description log event). </li>
<li>Mark each event by zeroing out a field that cannot be zero—for example, the length, the event type, or event position—before writing the event to the binary log. Then write this field with the correct value after the entire event has been written.</li>
<li><a href="http://forge.mysql.com/worklog/task.php?id=2540">Checksum the events</a> and find the end of the worklog by scanning for the first event with an incorrect checksum.</li>
</ul>
<dl>
<dt><strong>Write the length in the binlog file header</strong></dt>
<dd>Finding the length of the binlog in this case is easy: just inspect the header and find the length of the binlog file there. In this case, it is necessary to update the length after the event has been written since there may be an <code>fsync</code> call at any time between starting to write the event data and finishing writing the event. Normally, this means updating two block of the file for each event written, which can be a problem since it requires at least the block containing the header and all the blocks that was written since the last group commit to be written when calling <code>fsync</code>. If a large number of events is written between each <code>fsync</code>, this might not impose a large penalty, but if <code>sync-binlog=1</code> it might become quite expensive. Some experiments done by <a href="http://yoshinorimatsunobu.blogspot.com/">Yoshinori</a> showed a drop from 15k events/sec to 10k events/sec, which means that we lose one third in performance.<p>
<strong>Digression.</strong> The measurements that Yoshinori did consisted of one <code>pwrite</code> to write the event, one <code>pwrite</code> to write the length to the header and then a call to <code>fsync</code>. It is, in other word, most similar to using <code>sync_binlog=1</code>. In reality, however, this will not be the case since a user that is using the binary log group commit will have several events written between each call to <code>fsync</code>. Since these writes will be to memory (the file pages are in memory), performance will not drop as much. To evaluate the behavior for a group commit situation better, writing 10 events at a time was compared as well (pretending to be <code>sync_binlog=10</code>). Straight append (using <code>write</code>) gave at that point 110k events/sec and write to the header before calling <code>fsync</code> gave 80k events/sec. This means a performance reduction of 27%, which is an improvement but still a very large overhead.
</dd>
<dt><strong>Use a marker field</strong></dt>
<dd>The second alternative is to use one of the fields as a marker field. By setting one of the fields that cannot be zero to zero, it is possible to detect that the event is incorrect and stop at the event before that. Good candidates as fields is the length—which cannot be zero for any event and is four bytes—and the event type, which is one byte and where zero denotes an unknown event and never occurs naturally in a binlog file. The technique would be to first blank out the type field of the event, write the event to the binlog file, and then use <code>pwrite</code> to fill in the correct type code after the entire event is written. If an <code>fsync</code> occurs before the event type is written, the event will be marked as unknown and if a crash occurs before the event is completely written (and written to disk), it will be possible to scan the binlog file to find the first event that is marked as unknown. In order for this technique to work, it is necessary to zero the unused part of the binlog file before starting to write anything there (or at least zero out the event type). Otherwise, crash recovery will not be able to correctly detect where the last completely written event is located.<p>
Compared to the previous approach, this does not require writing to locations far apart (except in rare circumstances when the event spans two pages). It also has the advantage of not requiring any change of the binlog format. This technique is likely to be quite efficient. (Note that most of the writes will be to memory, so there will not be any extraneous "seeks" over the disk to zero out parts of the file.)</dd>
<dt><strong>Checksum on each event</strong></dt>
<dd>The third alternative is to rely on an <a href="http://forge.mysql.com/worklog/task.php?id=2540">event checksum</a> to detect events that are incompletely written. This approach is by far the most efficient of the approaches since the event checksum is naturally written last. It also has the advantage of not requiring the unused parts of the binlog file to be zeroed since it is unlikely that the checksum will be correct for the event unless the event has been fully written. This also makes it a very good candidate for detecting the end of the binlog file when preallocating the binlog file. The disadvantage is, of course, that it requires checksums to be enabled and implemented.</dd>
</dl>
With this in mind, the best approach seems to be to checksum each event and use that to detect the end of the binary log. If necessary, the second approach can be implemented when the binlog is not checksummed.<p>
The next article will wrap up the description by pointing out some efficiency issues and how to solve them to get an efficient implementation.Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com3tag:blogger.com,1999:blog-23496029.post-74989429084610725332010-04-30T06:23:00.008+02:002010-08-18T21:02:33.749+02:00Binary Log Group Commit - An Implementation ProposalIt is with interest that I read <a href="http://kristiannielsen.livejournal.com/12254.html">Kristian's</a>
<a href="http://kristiannielsen.livejournal.com/12408.html">three</a>
<a href="http://kristiannielsen.livejournal.com/12553.html">blogs</a>
on the binary log group commit. In the article, he mentions InnoDB's
<code>prepare_commit_mutex</code> as the main hindrance to accomplish group commits—which it indeed is—and proposes to remove it with the motivation that <code>FLUSH TABLES WITH READ LOCK</code> can be used to get a good binlog position instead. That is a solution—but not really a good solution—as Kristian points out in the last post.<p>
The <code>prepare_commit_mutex</code> is used to ensure that the
order of transactions in the binary log is the same as the order of
transactions in the InnoDB log—and keeping the same order in the logs is critical for getting a true on-line backup to work, so
removing it is not really an option, which Kristian points out in his third article. In other words, it is necessary to ensure that the InnoDB transaction log and the binary log have the same order of
transactions.<p>
To understand how to solve the problem, it is necessary to take a
closer look at the XA commit procedure and see how we can change it to implement a group commit of the binary log.<p>
The transaction data is stored in a per-thread <em>transaction
cache</em> and the <em>transaction size</em> is the size of the data
in the transaction cache.
In addition, each transaction will have a <em>transaction binlog
position</em> (or just <em>transaction position</em>) where the
transaction data is written in the binary log.<p>
The procedure can be outlined in the following steps:<p>
<ol>
<li>Prepare InnoDB [<code>ha_prepare</code>]:</li>
<ol>
<li>Write prepare record to log buffer</li>
<li><code>fsync()</code> log file to disk (this can currently do
group commit)</li>
<li>Take <code>prepare_commit_mutex</code></li>
</ol>
<li>Log transaction to binary log [<code>TC_LOG_BINLOG::log_xid</code>]:</li>
<ol>
<li>Lock binary log</li>
<li>Write transaction data to binary log</li>
<li>Sync binary log based on <code>sync_binlog</code>. This forces
the binlog to always <code>fsync()</code> (no group commit) due to
<code>prepare_commit_mutex</code></li>
<li>Unlock binary log</li>
</ol>
<li>Commit InnoDB:</li>
<ol>
<li>Release <code>prepare_commit_mutex</code></li>
<li>Write commit record to log buffer</li>
<li>Sync log buffer to disk (this can currently do group commit)</li>
<li>InnoDB locks are released</li>
</ol>
</ol>
There are mainly two problems with this approach:
<ul>
<li>The InnoDB row level and table level locks are released very
late in the sequence, which affects concurrency. Ideally, we need
to release the locks very early, preferably as soon as we have
prepared InnoDB.</li>
<li>It is not possible to perform a group commit in step 2</li>
</ul>
As you can see here, the prepare of the storage engines (in this
case just InnoDB) is done before the binary log mutex is taken, and
that means that if the <code>prepare_commit_mutex</code> is removed it
is possible for another thread to overtake a transaction so that the
prepare and writing to the binary log is done in different order.<p>
To solve this, Mark suggests using a queue or a ticket system to
ensure that transactions are committed in the same order, but we
actually already have such a system that we can use to assign tickets:
namely the binary log.<p>
The idea is to allocate space in the binary log for the transaction to
be written. This gives us a sequence number that we can use to order
the transactions.<p>
In the <a
href="http://forge.mysql.com/worklog/task.php?id=5223">worklog on
binary log group commits</a> you will find the complete description as
well as the status of the evolving work.<p>
In this post, I will outline an approach that <a
href="http://harrison-fisk.blogspot.com">Harrison</a> and I have
discussed, which we think will solve the problems mentioned above. In
this post, I will outline the procedure during normal operations, in
the <a href="http://mysqlmusings.blogspot.com/2010/08/binary-log-group-commit-recovery.html">next post</a> I will discuss recovery, and in the third post (but
likely not the last on the subject), I will discuss some optimizations
that can be done.<p>
I want to emphasize that the fact that we have a worklog does not
involve any guarantees or promises of what, when, or even if any
patches will be pushed to any release of MySQL.<p>
In <a
href="http://forge.mysql.com/worklog/task.php?id=4007">Worklog #4007</a> an
approach for writing the binary log is suggested where space is
allocated for the transaction in the binary log before actually
starting to write it. In addition to avoiding unnecessary locking of
the binary log, it also allow us to use the binary log to order the
transactions in-place. We will use this idea of reserving space in the
binary log to implement the binary log group commit.<p>
By re-structuring the procedure above slightly, we can ensure that
the transactions are written in the same order in both the InnoDB
transaction log and the binary log.<p>
There are two ways to re-structure the code: one simple and one more
complicated that potentially can render better performance. To
simplify the presentation, it is assumed that pre-allocation is
handled elsewhere, for example using <a
href="http://forge.mysql.com/worklog/task.php?id=4925" >Worklog
#4925</a>. In a real implementation, pre-allocation can either be
handled when a new binlog file is created, or when transaction data is
being written to the binary log.
<h2>The sequential write approach</h2>
<table class="figure-right">
<caption>Figure 1. Sequential binary log group commit</caption>
<tr>
<td>
<img style="cursor:pointer; cursor:hand;width: 235px; height: 320px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9jaxd6siz3tZpNCJ1Me9M6gyORNcPRSsAZsphI1b33XnPF7aze0ijuoNlLlQ2Me_d-ImI4AArdijQPaHrMbMKyr6Mxq3b0m4JXcl4z5l_-52Os3opqGYjRtpOx_f11P8TTg1O/s320/binlog-write-simple.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5465798204996562242" />
</td>
</tr>
</table>
In the sequential write approach, the transactions are still written
to the binary log in order and the code is just re-ordered to avoid
keeping mutexes when calling <code>fsync()</code>.
To describe the algorithm, three shared variables are introduced to
keep track of the status of replication:
<dl>
<dt><code>Next_Available</code>
<dd>This variable keeps track of where a new transaction can be written
<dt><code>Last_Committed</code>
<dd>This variable keeps track of the last committed transaction,
meaning that all transactions preceding this position is actually
on disc. This variable is not necessary in the real implementation,
but it is kept here to simplify the presentation of the algorithm.
<dt><code>Last_Complete</code>
<dd>This variable keeps track of the last complete transaction. All
transactions preceding this point is actually written to the binary
log, but are not necessarily flushed to disc yet.
</dl>
You can see an illustration of how the variables are used with the
binary log in Figure 1 where you can also see three threads each
waiting to write a transaction. Both variables are initially is set
to the beginning of the binary log and it is always true that <span class="math">
<code>Last_Committed</code> ≤
<code>Last_Complete</code> ≤ <code>Next_Available</code>
</span>.
The procedure can be described in the following steps:
<ol>
<li>Lock the binary log</li>
<li>Save value of <code>Next_Available</code> in a variable
<span class="math">Trans_Pos</span> and increase
<code>Next_Available</code> with the size of the transaction.</li>
<li>Prepare InnoDB:</li>
<ol>
<li>Write prepare record to log buffer (but do not
<code>fsync()</code> buffer here)</li>
<li>Release row locks</li>
</ol>
<li>Unlock binary log</li>
<li>Post prepare InnoDB:</li>
<ol>
<li><code>fsync()</code> log file to disk, which can now be done
using group commit since no mutex is held.</li>
</ol>
<li>Log transaction to binary log:</li>
<ol>
<li>Wait until <code>Last_Complete</code> =
<code>Trans_Pos</code>. (This can be implemented using a
condition variable and a mutex.)</li>
<li>Write transaction data to binary log using
<code>pwrite</code>. At this point, it is not really necessary to
use <code>pwrite</code> since the transaction data is simply
appended, but it will be used in the second algorithm, so we
introduce it here.</li>
<li>Update <code>Last_Complete</code> to
<code>Trans_Pos</code> + transaction size.<br/></li>
<li>Broadcast the the new position to all waiting threads to wake
them up.</li>
<li>Call <code>fsync()</code> to persist binary log on disk. This
can now be group committed.</ol>
<li>Commit InnoDB:</li>
<ol>
<li>Write commit record to log buffer</li>
<li>Sync log buffer to disk, which currently can be group
committed.</li>
</ol>
</ol>
To implement group commit, it is sufficient to have a condition
variable and wait for that for a specified interval. Once the interval
has passed, the transaction data can call <code>fsync()</code>, after
which it broadcasts the fact that data has been flushed to disc to
other waiting threads so that they can skip this. Typically, the code
looks something along these lines (we ignore checking error codes here
to simplify the description):
<pre class="code">
pthread_mutex_lock(&binlog_lock);
while (Last_Complete ≥ Last_Committed) {
struct timespec timeout;
gettimeofday(&timeout, NULL);
timeout.tv_usec += 1000; /* 1 msec */
int error= pthread_cond_timedwait(&binlog_flush, &binlog_lock, &timeout);
if (error == ETIMEDOUT) {
fsync(&binlog_file);
Last_Committed = Last_Complete;
pthread_cond_broadcast(&binlog_flush);
}
}
pthread_mutex_unlock(&binlog_lock);
</pre>
There are a few observations regarding this approach:
<ul>
<li>Step 6a requires a condition variable and a mutex when waiting
for <code>Last_Complete</code> to reach
<code>Trans_Pos</code>. Since there is just a single condition
variable, it is necessary to broadcast a wakeup to <em>all</em>
waiting threads, which each will evaluate the condition just to find
a single thread that should continue, while the other threads go to
sleep again.<p>
This means that the condition will be checked <span
class="math">O(N<sup>2</sup>)</span> times to commit <span
class="math">N</span> transactions.
This is a waste of resources, especially if there is a lot of
threads waiting, and if we can avoid this, we can gain
performance.</li>
<li>Since the thread has a good position in the binary log where it
could write, it could just as well start writing instead of
waiting. It will not interfere with any other threads, regardless if
locks are kept or not.</li>
</ul>
These observations lead us to the second approach, that of writing
transaction data to the binary log in parallel.
<h2>A parallel write approach</h2>
<table class="figure-right">
<caption>Figure 2. Parallel binary log group commit</caption>
<tr>
<td>
<img style="cursor:pointer; cursor:hand;width: 235px; height: 320px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhOmLGHx-hkW1opDRl1p_ZXDLEylycRwucbtz7bJttwsK48hZ2hPvUOKDptRNh01UveAs1hX5E52EJ93dkVGfFg37iPJY3aFwE9co0_1Ay27xfIHnzD483hYqPcGUw6tJCsZW7w/s320/binlog-write-parallel.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5465798641659622354" />
</td>
</tr>
</table>
In this approach, each session is allowed to write to the binary log
at the same time using <code>pwrite</code> since the space for the
transaction data has already been allocated when preparing the
engines. Figure 2 illustrates how the binary log is filled in (grey
areas) by multiple threads at the same time. Similar to the
sequential write approach, we still have the
<code>Last_Complete</code>, <code>Last_Committed</code>, and
<code>Next_Available</code> variables.<p>
Each thread does not have to wait for other threads before writing,
but it <em>does</em> have to wait for the other threads to
<em>commit</em>.
This is necessary since we required the order of commits in the InnoDB
log and the binary log to be the same. In reality, this does not pose
a problem since the I/O is buffered, hence the writes are done to
in-memory file buffers.<p>
The algorithms look quite similar to the sequential write approach,
but notice that in step 6, the transaction data is simply written to
the binary log using <code>pwrite</code>.
<ol>
<li>Lock the binary log</li>
<li>Save value of <code>Next_Available</code> in a local variable
<code>Trans_Pos</code> and increase
<code>Next_Available</code> with the size of the transaction.</li>
<li>Prepare InnoDB:</li>
<ol>
<li>Write prepare record to log buffer (but do not
<code>fsync()</code> buffer here)</li>
<li>Release row locks</li>
</ol>
<li>Unlock binary log</li>
<li>Post prepare InnoDB:</li>
<ol>
<li><code>fsync()</code> log file to disk, which can now be done
using group commit since no mutex is held.</li>
</ol>
<li>Log transaction to binary log:</li>
<ol>
<li>Write transaction data to binary log using
<code>pwrite</code>. There is no need to keep a lock to protect
the binary log here since all threads will write to different
positions.</li>
<li>Wait until <code>Last_Complete</code> =
<code>Trans_Pos</code>.</li>
<li>Update <code>Last_Complete</code> to <code>Trans_Pos</code> +
transaction size.<br/></li>
<li>Broadcast the the new position to all waiting threads to wake
them up.</li>
<li>Call <code>fsync()</code> to persist binary log on disk. This
can now be group committed.
</ol>
<li>Commit InnoDB:</li>
<ol>
<li>Write commit record to log</li>
<li>Sync log file to disk</li>
</ol>
</ol>
This new algorithm has some advantages, but there are a few things to note:
<ul>
<li>When a transaction is committed, it is guaranteed that
<code>Trans_Pos</code> ≥ <code>Last_Committed</code> for all
threads (recall that <code>Trans_Pos</code> is a thread-local
variable).</li>
<li>Writes are done in parallel, but when waiting for the condition
in step 6b still requires a broadcast to wake up all waiting
threads, while only one will be allowed to proceed. This means that
we still have the <span class="math">O(N<sup>2</sup>)</span>
complexity of the sequential algorithm. However, for the parallel
algorithm it is possible to improve the performance significantly,
which we will demonstrate in the third part where we will discuss
optimizations to the algorithms.</li>
<li>Recovery in the sequential algorithm is comparably simple since
there are no partially written transactions. If you consider that a
crash can occur in the situation described in Figure 2, it is
necessary to device a method for correctly recovering. This we will
discuss in the second part of these posts.</li>
</ul>Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com8tag:blogger.com,1999:blog-23496029.post-17436719167373415052010-04-13T20:07:00.003+02:002010-04-13T20:45:25.431+02:00MySQL Conference Replication tutorial: Article and Demo SoftwareThe MySQL Conference and Expo started with me and Lars Thalmann doing the <a href="http://en.oreilly.com/mysql2010/user/proposal/status/12432">replication tutorial</a>. Unfortunately, we cannot at this time distribute the slides (please watch the <a href="http://en.oreilly.com/mysql2010/public/schedule/detail/12432">replication tutorial page at the conference site</a>), but there is a replication tutorial package for easy setup of server to play around with—including some sample scripts—and a paper that both explains how the package can be used as well as giving some example setups.
<ul>
<li>
The software package can be <a href="http://forge.mysql.com/w/images/8/85/Reptut-scripts.zip">downloaded from the forge</a> and requires Perl at least version 5.6.0 to execute.
</li>
<li>The article can can also be <a href="http://www.kindahl.net/mats/ReplicationTutorialBooklet.pdf">downloaded from my site</a> (PDF).</li>
<li></li>
</ul>Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com0tag:blogger.com,1999:blog-23496029.post-92076722794674697322010-03-05T14:10:00.002+01:002010-03-05T14:39:45.378+01:00Going to the O'Reilly MySQL Conference & Expo<p>As I've been doing the last couple of years, I will be going to the
<a href="">O'Reilly MySQL Conference & Expo</a>. In addition to
the tutorial and the replication sessions that I will be holding
together with <a
href="http://en.oreilly.com/mysql2010/public/schedule/speaker/3109">Lars</a>,
I will be holding a session about the binary log together with <a
href="http://en.oreilly.com/mysql2010/public/schedule/speaker/162">Chuck</a>
from the Backup team which the Replication team normally works very
close with.</p>
<p>This year, O'Reilly also have a <em>Friend of the Speaker</em>
discount of 25% that you can use when you register using the code
<code>mys10fsp</code>.</p>
The sessions that we are going to hold are listed below. Note that I
am using <a href="www.microformats.org">Microformats</a>, which will
allow you to easily extract and add the events to your calendar using,
for example, the <a
href="https://addons.mozilla.org/en-US/firefox/addon/4106">Operator</a>
plugin for Firefox.
<p>See you there!</p>
<dl>
<div class="vevent">
<dt class="summary"><a class="url" href="http://en.oreilly.com/mysql2010/public/schedule/detail/13476">Mysteries of the Binary Log</a>
<dd>
<abbr title="2010-04-14T10:50-07:00" class="dtstart">April 14th, 2010 10:50am</abbr> -
<abbr title="2010-04-14T11:50-07:00" class="dtend">11:50am</abbr>
Room: <span class="location">Ballroom F</span>
</div>
<div class="vevent">
<dt class="summary"><a class="url" href="http://en.oreilly.com/mysql2010/public/schedule/detail/12451">New Replication Features</a>
<dd>
<abbr title="2010-04-13T14:00-07:00" class="dtstart">April 13th, 2010 2:00pm</abbr> -
<abbr title="2010-04-13T15:00-07:00" class="dtstart">3:00pm</abbr>
Room: <span class="location">Ballroom A</span>
</div>
<div class="vevent">
<dt class="summary"><a class="url" href="http://en.oreilly.com/mysql2010/public/schedule/detail/12444">Replication Tricks & Tips</a>
<dd>
<abbr title="2010-04-14T14:00-07:00" class="dtstart">April 14th, 2010 2:00pm</abbr> -
<abbr title="2010-04-14T15:00-07:00" class="dtstart">3:00pm</abbr>
Room: <span class="location">Ballroom B</span>
</div>
<div class="vevent">
<dt class="summary"><a class="url" href="http://en.oreilly.com/mysql2010/public/schedule/detail/12432">The Replication Tutorial</a>
<dd>
<abbr title="2010-04-12T08:30-07:00" class="dtstart">April 12th, 2010 8:30am</abbr> -
<abbr title="2010-04-12T12:00-07:00" class="dtstart">12:00pm</abbr>
Room: <span class="location">Ballroom E</span>
</div>
</dl>Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com0tag:blogger.com,1999:blog-23496029.post-44836689564824997382010-02-03T21:19:00.003+01:002010-02-03T21:37:24.008+01:00MySQL Replicant: Architecture<table id="class-design" class="figure" noborder>
<caption align="bottom">MySQL Replicant Library<br>Class Design</caption>
<tr><td><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjT53S4pBAxDST4X96TdVub3YTzjUgqmKCk3SorL5Il8NKjOiNzKqgOjRVflwYMIDapDn99elGw1L68wPy5Jjm7QirrC5Z-3L3nTSiFeZdrgrhGABX0LqZKJ2QBDHzvlLEkXLb6/s1600-h/class-diagram.png"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 320px; height: 223px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjT53S4pBAxDST4X96TdVub3YTzjUgqmKCk3SorL5Il8NKjOiNzKqgOjRVflwYMIDapDn99elGw1L68wPy5Jjm7QirrC5Z-3L3nTSiFeZdrgrhGABX0LqZKJ2QBDHzvlLEkXLb6/s320/class-diagram.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5434117168518305154" /></a>
</td></tr>
</table>
In the <a
href="/2009/12/mysql-replicant-library-for-controlling.html">previous
post</a> I described the first steps of a Python library for
controlling the replication of large installations. The intention of
the library is to provide a uniform interface to such installations
and that will allow procedures for handling various situations to be
written in a uniform language.<p>
For the library to be useful, it is necessary to support installations
that use different operating systems for the machines, as well as
different versions of the servers. Specifically, it is necessary to
allow some aspects of the system to vary.
<ul>
<li><p>Depending on the operating system, or even just how the server
is installed on the machine, the procedures for bringing the server
down and up will differ.</p></li>
<li><p>Configurations are managed different ways depending on the
deployment and there are various other tools to manage
configurations of large systems.</p>
<p>As part of the management of the topology, it is necessary to
change the configuration files, but this should play well with
other tools.</p>
<p>In either case, any specific method for configuration handling
should neither be required nor enforced.</p></li>
<li>In the example in the <a
href="/2009/12/mysql-replicant-library-for-controlling.html">previous
article</a>, the technique for cloning a server was demonstrated. In
this case the naive method of copying the database files was
used. For the general case, however, <em>some</em> backup method
will be used, but it depends on the requirements of the
deployment. In other words, it is necessary to parameterize the
backup method as well.</li>
<li><p>Each server in the system has a specific <em>role</em> to
fulfill. Some server are final slaves whose only purpose is to
answer queries, at least one server is a master, and some servers
are relay servers.</p></li>
</ul>
<p>To allow the system to be parameterized on these aspects, a set of
abstract classes is introduced. In the figure you can see a UML
diagram describing the high-level architecture of the Replicant
library.</p>
In the figure, there are four abstract classes:
<dl>
<dt><code>Machine</code>
<dd>The responsibility of this class it to handle all issues that
are specific to the remote operating system, for example, to fetch
files or issue commands to start and stop the server.
<dt><code>Config</code>
<dd>The responsibility of this class is to maintain the
configuration of a server. To do this, it may need to parse
configuration files to be able to extract the specific section
containing the definition.
<dt><code>BackupMethod</code>
<dd>The responsibility of this class is to provide the primitives to
create a backup and restore a backup. In both cases, the class
supports taking a backup and potentially placing the backup image at
a different machine, and restoring it.
<dt><code>Role</code>
<dd>The responsibility of this class is to provide all the
information necessary to configure a server in a role. Since the
role does not only entails pure configuration information, but can
also involve keeping certain tables and other database objects
available, this is modeled as a separate class.
</dl>
The central <code>Server</code> class relies on a <code>Machine</code>
instance and a <code>Config</code> instance to implement the interface
to the machine and to the configuration, respectively.<p>
<h2>Configuration Management</h2>
<p>The configuration of the server is made part of the Replicant library
since manipulating the server configuration is usually necessary when
changing roles of servers.</p>
<p>Depending on the deployment, other configuration managers such as
<a href="http://www.gnu.org/software/cfengine/">cfengine</a> or <a
href="http://reductivelabs.com/trac/puppet">puppet</a> are used to
administer the configuration of all servers, while others hand-edit
the configuration files (which has to be for small configurations,
since it would be a pain to administer larger deployments in this
way).</p>
<p>Long-term, there should be support for some safety measures when
working with server configurations, so implementing an interface for
handling server configurations in a safe transaction-like
manner—or maybe this should be called a RCU-style
manner—seems like a good idea. To support that, the following
methods to fetch and replace configurations are introduced.</p>
<dl>
<dt class="func">Server.fetch_config()
<dd>Returns a <code>Config</code> instance of the configuration for
the server.
<dt class="func">Server.replace_config(<var>config</var>)
<dd>Replace the configuration of the server with the modified
configuration instance <var>config</var>.
</dl>
<p>This will allow an implementation to keep version numbers around to
avoid conflicts, but is not required by the interface.</p>
<p>Each <code>Config</code> instance can then be manipulated by using
the following methods:</p>
<dl>
<dt class="func">Config.get(<var>option</var>)
<dd>Get the value of <var>option</var> as a string.
<dt class="func">Config.set(<var>option</var>[, <var>value</var>])
<dd>Set the value of <var>option</var> to <var>value</var>. If no
<var>value</var> is supplied, <code>None</code> is used, which
denotes that the option is set but not given a specific string
value.
<dt class="func">Config.remove(<var>option</var>)
<dd>Remove the <var>option</var> from the configuration instance
entirely.
</dl>
So, for example, the <var>log-bin</var> option can be set in the
following manner:
<pre class="code">
config = server.fetch_config()
config.set('log-bin', 'master-bin')
server.replace_config(config)
</pre>
<h2>Machines</h2>
<p>A MySQL server can run on many different machines and in many
setups. A server can run on Linux, Solaris, or Windows, and even in
those cases, there can be multiple servers on a single machine.</p>
<p>For a Linux machine with a single server, one usually uses the
script <file>/etc/init.d/mysql</file> to start and stop the
server—at least on my Ubuntu—but if multiple servers are
used on a single machine, then <file>mysqld_multi</file> should be
used instead.</p>
<p>For Windows and Solaris, the procedure for starting and stopping
servers are entirely different. Windows starts and stops the servers
using <code>net start MySQL</code> and <code>net stop MySQL</code>,
while Solaris uses the <em><a
href="http://docs.sun.com/app/docs/doc/819-2240/svcadm-1m?a=view">svcadm(1M)</a></em></p>
<p>To parameterize the system over the various ways it can be
installed, the concept of a <em>Machine</em> is introduced (I actually
had problems figuring out a name for this, but this was suggested to
me and seems to be good enough).</p>
<p>The responsibility of the <code>Machine</code> class is to provide
an interface to access the installed server together with installation
information such as the location of configuration files.</p>
<h2>BackupMethod</h2>
<p>One of the more important techniques when managing a set of server is
the ability to clone a slave or a master to create new slaves. Cloning
involves taking a backup of a server and then restoring the backup
image on a the new slave. Since the techniques for taking backups vary
a lot and different techniques will be used in different situations,
parameterizing over the various backup methods is sensible.</p>
<dl>
<dt class="func">BackupMethod.backup_to(<var>server</var>, <var>url</var>)
<dd>This method will take a backup of <var>server</var> and store it
at the location indicated by <var>url</var>.
<dt class="func">BackupMethod.restore_from(<var>server</var>, <var>url</var>)</code>
<dd>This method will restore the backup image indicated by
<var>url</var> into <var>server</var>.
</dl>
<h2>Role</h2>
<p>In a deployment, each server is configured to play a specific
<em>role</em>. It can either be acting as a master, a slave, or even a
relay. To represent a role, a separate <code>Role</code> class is
introduced. Once a role is created, a server can be <em>imbued</em>
with it.</p>
<ul>
<li>Not every server have an assigned role.</li>
<li>Each server can just have a single role.</li>
<li>Each roles can be assigned to multiple servers.</li>
</ul>
<p>Since a role may encompass much more than just setting some
configuration parameters, this more flexible approach was chosen.
When imbuing a server with a role, a piece of Python code is executed
to configure the server correctly.</p>
<p>The use of roles in this case is actually just one of many choices,
and when using this approach, there is actually two different ways
that roles can be used. I am slightly undecided on the two and would
like to hear comments on which one to use.</p>
<ol>
<li>Roles are just applied to the initial deployment and does not
play any role after the system have been deployed. Roles are imbued
into a server initially, and then the configuration of the server
can be changed by procedures to manipulate the deployment.</li>
<li>Roles exists in the entire deployment and when a server changes
roles in the deployment, the Role instance will also change. Every
server is assigned a role in the system, which is represented using
a subclass of the <code>Role</code> class.</li>
</ol>
<p>The first is by far the easiest to implement, which is why I chose
this at this time. Since the roles are just containers for
configuration options and other items that needs to be added, they are
easy to write. Since this is what is used in the library currently, it
is also what you see in the class design above.</p>
<p>The second approach seems better, but it has a number of
consequences:</p>
<ul>
<li>Every server has to have a role class associated with it, even
the "initial" role is required.</li>
<li>If the role changes, another role class will be associated with
it. This forces the role class to not only be able to imbue a server
in a role, but to also <em>unimbue</em> the server from that
role.</li>
<li>It cannot be possible to change the configuration of a server
directly, it has to be in the form of defining a role and then
changing the server to that role. Unimbuing the server from a role
becomes very hard if the configuration of the server is changed
outside the control of the role.</li>
</ul>Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com3tag:blogger.com,1999:blog-23496029.post-8043952342303681212009-12-18T17:12:00.005+01:002009-12-28T10:11:14.828+01:00MySQL Replicant: a library for controlling replication deploymentsKeeping a MySQL installation up and running can be quite tricky at
times, especially when having many servers to manage and monitor. In
the replication tutorials at the annual MySQL Users' Conference, we
demonstrate how to set up replication appropriately and also how to
handle various issues that can arise.
Many of these procedures are routine: bring down the server, edit the
configuration file, bring the server up again, start a
<code>mysql</code> client and add a user, etc.<p>
It has always annoyed me that these procedures are perfect candidates
for automation, but that we do not have the necessary interfaces to
manipulate an entire installation of MySQL servers.<p>
If there were an interface with a relatively small set of
primitives—re-directing servers, bringing servers down, add a
line to the configuration file, etc.—it would be possible to
create pre-canned procedures that can just be executed.<p>
To that end, I started writing on a library that would provide an
interface like this. Although more familiar with Perl, Python was
picked for this project, since it seems to be widely used by many
database administrators (it's just a feeling I have, I have no figures
to support it) and just to have a cool name on the library, we call it
<em>MySQL Replicant</em> and it is (of course) <a href="https://launchpad.net/mysql-replicant-python">available at
Launchpad</a>.<p>
So what do we want to achieve with having a library like this?
Well... the goal is to to provide an generic interface to complete
installations and thereby make administration of large installations
easy.<p>
By providing such an interface, it will allow description of
procedures in an executable format, namely as Python scripts.<p>
In
addition to making it easy to implement common tasks for experienced
database administrators, it also promotes sharing by providing a way
to write complete scripts for solving common problems. Having a pool
of such scripts makes it easier for newcomers to get up and
running.<p>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqFITAUw1_8GcoVLuCEsEKzhY47Z7B65XiHUwwYG-Tf-1F9Y1vb4EkQ92b3A1HqGTMvp_u7NlkItHYK_OuJMoiSGUDABCn3UdQIzflBDpSIfpjyr1VVGqilsMnGvmTRwzndj9a/s1600-h/topology-with-model.png"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 320px; height: 289px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqFITAUw1_8GcoVLuCEsEKzhY47Z7B65XiHUwwYG-Tf-1F9Y1vb4EkQ92b3A1HqGTMvp_u7NlkItHYK_OuJMoiSGUDABCn3UdQIzflBDpSIfpjyr1VVGqilsMnGvmTRwzndj9a/s320/topology-with-model.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5420211919263763954" /></a>
The basic idea is that you create a model of the installation on a
computer and then manipulate the model. When doing these
manipulations, the appropriate commands—either as SQL commands
to a running server or shell commands to the host where the server is
running—will then be sent to the servers in the installation to
configure them correctly. <p>
So, to take small example, how does the code for re-directing a bunch
of servers to a master look?
<pre class="code">
import mysqlrep, my_servers
for slave in my_server.slaves:
mysqlrep.change_master(slave, my_servers.master)
</pre>
In this case, the installation is defined in a separate file and is
imported as a Python module. Right now, the interface for specifying
a topology is quite rough, but this is going to change.
<pre class="code">
from mysqlrep import Server, User, Linux
servers = [Server(server_id=1, host="server1.example.com",
sql_user=User("mysql_replicant", "xyzzy"),
ssh_user=User("mysql_replicant"),
machine=Linux()),
Server(server_id=2, host="server2.example.com",
sql_user=User("mysql_replicant", "xyzzy"),
ssh_user=User("mysql_replicant"),
machine=Linux()),
Server(server_id=3, host="server3.example.com",
sql_user=User("mysql_replicant", "xyzzy"),
ssh_user=User("mysql_replicant"),
machine=Linux()),
Server(server_id=4, host="server4.example.com",
sql_user=User("mysql_replicant", "xyzzy"),
ssh_user=User("mysql_replicant"),
machine=Linux())]
master = servers[0]
slaves = servers[1:]
</pre>
Here, the <code>Server</code> class represents a server and to be able
to do it's job, it is necessary to have one MySQL account on the
server and one shell account on the host machine. Right now, it is
also necessary to specify the server ID, but the plan is to just
require the host, port, socket, SQL account name, and SSH account
information. The remaining information can then be fetched from the
configuration file of the server. Each server have a small set of
primitives on top of which everything else is built:
<dl>
<dt><code>Server.sql(<em>SQL command</em>)</code>
<dd>Execute the SQL command and return a result set.
<dt><code>Server.ssh(<em>command list</em>)</code>
<dd>Execute the command given by the command list return an iterator
to the result output.
<dt><code>Server.start()</code>
<dd>Start the server
<dt><code>Server.stop()</code>
<dd>Stop the server.
</dl>
There is a small set of commands defined on top of these primitives
that can be used. Here is a list of just a few of them, but there are
some more in the library at Launchpad.
<dl>
<dt><code>change_master(slave, master, position=None)</code>
<dd>Change the master of <code>slave</code> to be
<code>master</code> and start replicating from
<code>position</code>.
<dt><code>fetch_master_pos(server)</code>
<dd>Fetch the master position of <code>server</code>, which is the
position where the last executed statement ends in the binary log.
<dt><code>fetch_slave_pos(server)</code>
<dd>Fetch the slave position of <code>server</code>, which is the
position where the last executed event ends.
<dt><code>flush_and_lock_database(server)</code>
<dd>Flush all tables on <code>server</code> and lock the database
for read.
<dt><code>unlock_database(server)</code>
<dd>Unlock a previously locked database.
</dl>
Using these primitives, it is easy to clone a master by executing the
code below. For this example, I use the quite naive method of backing
up a database by creating an archive of the database files and copying
them to the new slave.
<pre class="code">
from mysqlrep import flush_and_lock_database, fetch_master_position
from subprocess import call
flush_and_lock_database(master)
position = fetch_master_position(master)
master.ssh("tar Pzcf " + backup_name + " /usr/var/mysql")
unlock_database(master)
call(["scp", source.host + ":" + backup_name, slave.host + ":."])
slave.stop()
slave.ssh("tar Pzxf " + backup_name + " /usr/var/mysql")
slave.start()
start_replication(slave)
</pre>
What do you think? Would this be a valuable project to pursue?
Here are some links related to this post:
<ul>
<li><a href="http://mituzas.lt/2009/08/17/dba-python-scripts/">Domas post on "MySQL DBA, python edition"</a></li>
<li><a href="http://bitbucket.org/david415/mysql-cluster-tools/">Code from Spinn3r</a></li>
</ul>Mats Kindahlhttp://www.blogger.com/profile/07528917029894926261noreply@blogger.com4