<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-23496029</id><updated>2012-01-26T20:31:48.056+01:00</updated><category term='connector'/><category term='virtual table'/><category term='proxy'/><category term='locators'/><category term='quilt'/><category term='multi-threaded'/><category term='isolation'/><category term='hash'/><category term='status'/><category term='community'/><category term='feisty'/><category term='scaling'/><category term='reengineering'/><category term='poll'/><category term='conference'/><category term='resolution'/><category term='lua'/><category term='protobuf'/><category term='binary'/><category term='concurrent'/><category term='binary log'/><category term='blob'/><category term='mysql replicant'/><category term='detection'/><category term='transactions'/><category term='python'/><category term='planning'/><category term='rss'/><category term='manipulator'/><category term='slave'/><category term='performance'/><category term='c++'/><category term='binlog'/><category term='locator'/><category term='scale-up'/><category term='database'/><category term='transactional'/><category term='feed'/><category term='transaction'/><category term='threads'/><category term='mysql'/><category term='kubuntu'/><category term='programming'/><category term='tutorial'/><category term='sqlite'/><category term='stopping'/><category term='streaming'/><category term='scale-out'/><category term='multi-core'/><category term='syndication'/><category term='bash'/><category term='join'/><category term='concurrency'/><category term='API'/><category term='thread'/><category term='databases'/><category term='online'/><category term='building'/><category term='conflict'/><category term='crash-safe'/><category term='drizzle'/><category term='fawn'/><category term='multi-master'/><category term='multi-source'/><category term='feature'/><category term='build'/><category term='interceptor'/><category term='expo'/><category term='log'/><category term='atom'/><category term='round-robin'/><category term='testing'/><category term='master_pos_wait'/><category term='error'/><category term='semi-synchronous'/><category term='replication'/><category term='falcon'/><category term='google'/><category term='bisect'/><category term='checksum'/><title type='text'>MySQL Musings</title><subtitle type='html'>Various comments on using and developing MySQL, especially focused on replication.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>41</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-23496029.post-8853718009688026024</id><published>2012-01-23T21:55:00.002+01:00</published><updated>2012-01-23T22:13:07.369+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='interceptor'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='connector'/><title type='text'>MySQL: Python, Meta-Programming, and Interceptors</title><content type='html'>I recently found Todd's &lt;a href="http://mysqlblog.fivefarmers.com/2011/10/17/connectorj-extension-points-lifecycle-interceptors/" &gt;posts&lt;/a&gt; &lt;a href="http://mysqlblog.fivefarmers.com/2011/10/17/connectorj-extension-points-statement-interceptors/" &gt;on&lt;/a&gt; &lt;a href="http://mysqlblog.fivefarmers.com/2011/11/21/connectorj-extension-points-%E2%80%93-exception-interceptors/"&gt;interceptors&lt;/a&gt; which allow callbacks (called &lt;em&gt;interceptors&lt;/em&gt;) to be registered with the connector so that you can intercept a statement execution, commit, or any of the many extension points supported by Connector/Java.  This is a language feature that allow you to implement a number of new features without having to change the application code such as load-balancing policies, profiling queries or transactions, or debugging an application.&lt;p&gt;

Since Python is a dynamic language, it is easy to add interceptors to &lt;em&gt;any&lt;/em&gt; method in Connector/Python, without having to extend the connector with specific code. This is something that is possible in dynamic languages such as Python, Perl, JavaScript, and even some lesser known languages such as Lua and Self. In this post, I will describe how and also give an introduction to some of the (in my view) more powerful features of Python.&lt;p&gt;

In order to create an interceptor, you need to be able to do these things:

&lt;ul&gt;
  &lt;li&gt;Catch an existing method in a class and replace it with a new one.&lt;/li&gt;
  &lt;li&gt;Call the original function, if necessary.&lt;/li&gt;
  &lt;li&gt;For extra points: catch an existing method in an &lt;em&gt;object&lt;/em&gt; and replace a new one.&lt;/li&gt;
&lt;/ul&gt;

You will in this post see how all three of these problems are solved in Python. You will see and use &lt;em&gt;decorators&lt;/em&gt; to be able to define methods in existing classes and object, and closures to be able to call the original version of the methods.  By picking this approach, it will not be necessary to change the implementation: in fact, you can use this code to replace &lt;em&gt;any&lt;/em&gt; method in &lt;em&gt;any&lt;/em&gt; class, not only in Connector/Python.&lt;p&gt;

&lt;table border="" width="" class="figure-right"&gt;
&lt;caption&gt;Table 1. Attributes for methods&lt;/caption&gt;
&lt;TR&gt;
  &lt;TH&gt;&lt;/TH&gt;
  &lt;TH colspan="2"&gt;Method Instance&lt;/TH&gt;
&lt;/TR&gt;
&lt;TR&gt;
  &lt;TH&gt;Name&lt;/TH&gt;
  &lt;TH&gt;Unbound&lt;/TH&gt;
  &lt;TH&gt;Bound&lt;/TH&gt;&lt;/TR&gt;
&lt;TR&gt;
  &lt;TD&gt;&lt;code class="symbol"&gt;__name__&lt;/code&gt;&lt;/TD&gt;
  &lt;TD align="center" colspan="2"&gt;Name of Method&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
  &lt;TD&gt;&lt;code class="symbol"&gt;im_func&lt;/code&gt;&lt;/TD&gt;
  &lt;TD align="center" colspan="2"&gt;"Inner" function of the method&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
  &lt;TD&gt;&lt;code class="symbol"&gt;im_self&lt;/code&gt;&lt;/TD&gt;
  &lt;TD&gt;&lt;code class="constant"&gt;None&lt;/code&gt;&lt;/TD&gt;
  &lt;TD align="center"&gt;Class instance for the method&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
  &lt;TD&gt;&lt;code class="symbol"&gt;im_class&lt;/code&gt;&lt;/TD&gt;
  &lt;TD align="center" colspan="2"&gt;Class that the method belongs to&lt;/TD&gt;
&lt;/TR&gt;
&lt;/table&gt;

In addition to being able to replace methods in the class, we would also like to be able to replace methods in instances of a class ("objects" in the traditional sense). This is useful to create specialized objects, for example for tracking particular cases where a method is used.&lt;p&gt;

In order to understand how the replacement works, you should understand that in Python (and the dynamic languages mentioned above), all objects can have attributes, including classes, functions, and a bunch of other esoteric constructions. Each type of object has a set of pre-defined attributes with well-defined meaning. For classes (and class instances), methods are stored as attributes of the class (or class instance) and can therefore be replaced with other methods that you build dynamically. However, it requires some tinkering to take an existing "normal" function definition and "imbue" it with whatever "tincture" that makes it behave as a method of the class or class instance.&lt;p&gt;

Depending on where the method comes from, it can be either &lt;em&gt;unbound&lt;/em&gt; and &lt;em&gt;bound&lt;/em&gt;. Unbound methods are roughly equivalent to member function pointers in C++: they reference a function, but not the instance. In contrast, bound methods have an instance tied to it, so when you call them, they already know what instance they belong to and will use it. Methods have a set of attributes, of which the four in Table&amp;nbsp;1 interests us.  If a method is fetched from a class (to be precise, from a class object), it will be unbound and &lt;code&gt;im_self&lt;/code&gt; will be &lt;code&gt;None&lt;/code&gt;. If the method is fetched from a class &lt;em&gt;instance&lt;/em&gt;, it will be bound and &lt;code class="symbol"&gt;im_self&lt;/code&gt; will be set to the instance it belongs to. These attributes are all the "tincture" you need make our own instance methods.

The code for doing the replacement described above is simply:

&lt;pre class="code"&gt;
&lt;code class="keyword"&gt;import&lt;/code&gt; functools, types

&lt;code class="keyword"&gt;def&lt;/code&gt; replace_method(orig, func)
    functools.update_wrapper(func, orig.im_func)
    new = types.MethodType(func, orig.im_self, orig.im_class)
    obj = orig.im_self or orig.im_class
    setattr(obj, orig.__name__, new)
&lt;/pre&gt;

The function uses two standard modules to make the job simpler, but the steps are:
&lt;ol&gt;
  &lt;li&gt;Copy the meta-information from the original method function to the new function using &lt;code&gt;update_wrapper&lt;/code&gt;. This copies the name, module information, and documentation from the original method function to make it look like the original method.&lt;/li&gt;
  &lt;li&gt;Create a new method instance from the method information of the original method using the constructor &lt;code&gt;MethodType&lt;/code&gt;, but replace the "inner" function with the new function.&lt;/li&gt;
  &lt;li&gt;Install the new instance method in the class or instance by replacing the attribute denoting the original method with the new method.  Depending on whether the function is given a bound or unbound instance, either the method in the class or in the instance is replaced.&lt;/li&gt;
&lt;/ol&gt;

Using this function you can now replace a method in a class like this:

&lt;pre class="code"&gt;
&lt;code class="keyword"&gt;from&lt;/code&gt; mysql.connector &lt;code class="keyword"&gt;import&lt;/code&gt; MySQLCursor

&lt;code class="keyword"&gt;def&lt;/code&gt; my_execute(self, operation, params=None):
  ...

replace_method(MySQLCursor.execute, my_execute)
&lt;/pre&gt;

This is already pretty useful, but note that you can also replace only a specific instance as well by using &lt;code&gt;replace_method(cursor.execute, my_execute)&lt;/code&gt;. It was not necessary to change anything inside Connector/Python to intercept a method there, so you can actually apply this to any method in any of the classes in Connector/Python that you already have available.  In order to make it even easier to use you'll see how to define a &lt;em&gt;decorator&lt;/em&gt; that will install the function in the correct place at the same time as it is defined. The code for defining a decorator and an example usage is:

&lt;pre class="code"&gt;
&lt;code class="keyword"&gt;import&lt;/code&gt; functools, types
&lt;code class="keyword"&gt;from&lt;/code&gt; mysql.connector &lt;code class="keyword"&gt;import&lt;/code&gt; MySQLCursor

&lt;code class="keyword"&gt;def&lt;/code&gt; intercept(orig):
    &lt;code class="keyword"&gt;def&lt;/code&gt; wrap(func):
        functools.update_wrapper(func, orig.im_func)
        meth = types.MethodType(func, orig.im_self, orig.im_class)
        obj = orig.im_self or orig.im_class
        setattr(obj, orig.__name__, meth)
        return func
    return wrap

&lt;span style="color: blue; font-style: italic"&gt;# Define a function using the decorator&lt;/span&gt;
@intercept(MySQLCursor.execute)
&lt;code class="keyword"&gt;def&lt;/code&gt; my_execute(self, operation, params=None):
  ...
&lt;/pre&gt;

The &lt;code&gt;@intercept&lt;/code&gt; line before the definition of &lt;code&gt;my_execute&lt;/code&gt; is where the new descriptor is used. The syntax is a shorthand that can be used to do some things with the function when defining it. It behaves as if the following code had been executed:

&lt;pre class="code"&gt;
&lt;code class="keyword"&gt;def&lt;/code&gt; &lt;var&gt;temporary&lt;/var&gt;(self, operation, params=None):
  ...
my_execute = intercept(MySQLCursor.execute)(&lt;var&gt;temporary&lt;/var&gt;)
&lt;/pre&gt;

As you can see here, whatever is given after the &lt;code&gt;@&lt;/code&gt; is used as a function and called with the function-being-defined as argument. This explains why the &lt;code&gt;wrap&lt;/code&gt; function is returned from the decorator (it will be called with a reference to the function that is being defined), and also why the original function is returned from the &lt;code&gt;wrap&lt;/code&gt; function (the result will be assigned to the function name).&lt;p&gt;

Using a statement interceptor, you can catch the execution of statements and do some special magic on them.  In our case, let's define an interceptor to catch the execution of a statement and log the result using the standard &lt;a href="http://docs.python.org/library/logging.html"&gt;&lt;code&gt;logging&lt;/code&gt;&lt;/a&gt; module. If you read the &lt;code&gt;wrap&lt;/code&gt; function carefully, you probably noted that it uses a &lt;dfn&gt;closure&lt;/dfn&gt; to access the value of &lt;var&gt;orig&lt;/var&gt; when the decorator was &lt;em&gt;called&lt;/em&gt;, not the value it happen to have when the &lt;&gt;wrap&lt;/code&gt; function is executed. This feature is very useful since a closure can also be used to get access to the original &lt;code&gt;execute&lt;/code&gt; function and call it from within the new function.  So, to intercept an execute call and log information about the statement using the &lt;a href="http://docs.python.org/library/logging.html"&gt;&lt;code&gt;logging&lt;/code&gt;&lt;/a&gt; module, you could use code like this:

&lt;pre class="code lang-python"&gt;
from mysql.connector import MySQLCursor
original_execute = MySQLCursor.execute
@intercept(MySQLCursor.execute)
def my_execute(self, operation, params=None):
    if params is not None:
        stmt = operation % self._process_params(params)
    else:
        stmt = operation
    result = original_execute(self, operation, params)
    logging.debug("Executed '%s', rowcount: %d", stmt, self.rowcount)
    logging.debug("Columns: %s", ', '. join(c[0] for c in self.description))
    return result
&lt;/pre&gt;

Now with this, you could implement your own caching layer to, for example, do a memcached lookup before sending the statement to the server for execution. I leave this as an exercises to the reader, or maybe I'll show you in a later post. &amp;smiley;

Implementing a lifecycle interceptor is similar, only that you replace, for example, the commit or rollback calls. However, implementing an exception interceptor is not obvious.  Catching the exception is straightforward and can be done using the &lt;code&gt;intercept&lt;/code&gt; decorator:

&lt;pre class="code"&gt;
original_init = ProgrammingError.__init__
@intercept(ProgrammingError.__init__)
def catch_error(self, msg, errno):
    logging.debug("This statement didn't work: '%s', errno: %d", msg, errno)
    original_init(self, msg, errno=errno)
&lt;/pre&gt;

However, in order to do something more interesting, such as asking for some additional information from the database, it is necessary to either get hold of the cursor that was used to execute the query, or at least the connection. It is possible to dig through the interpreter stack, or try to override one of the internal methods that Connector/Python uses, but since that is very dependent on the implementation, I will not present that in this post.  It would be good if the cursor is passed down to the exception constructor, but this requires some changes to the connector code.&lt;p&gt;

Even though I have been programming in dynamic languages for decades (literally) it always amaze me how easy it is to accomplish things in these languages. If you are interested in playing around with this code, you can always fetch Connector/Python on Launchpad and try out the examples above. Some links and other assorted references related to this post are:

&lt;ul&gt;
  &lt;li&gt;Connector/Python is found at &lt;a href="https://launchpad.net/myconnpy" &gt;launchpad.net/myconnpy&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Geert has a number of excellent posts on Connector/Python under &lt;a href="http://geert.vanderkelen.org/" &gt;geert.vanderkelen.org&lt;/a&gt;. Also, &lt;a href="http://geert.vanderkelen.org/post/817" &gt;as you might already know&lt;/a&gt;, he is now working with developing Connector/Python and he's always interested in comments and suggestions. :)
  &lt;li&gt;Todd's Blog &lt;a href="http://mysqlblog.fivefarmers.com/" &gt;mysqlblog.fivefarmers.com&lt;/a&gt; is always interesting to read, and these articles on interceptors are the ones I read
  &lt;ul&gt;
    &lt;li&gt;&lt;a href="http://mysqlblog.fivefarmers.com/2011/10/17/connectorj-extension-points-lifecycle-interceptors/" &gt;Lifecycle Interceptors&lt;/a&gt;
    &lt;li&gt;&lt;a href="http://mysqlblog.fivefarmers.com/2011/10/17/connectorj-extension-points-statement-interceptors/" &gt;Statement Interceptors&lt;/a&gt;
    &lt;li&gt;&lt;a href="http://mysqlblog.fivefarmers.com/2011/11/21/connectorj-extension-points-%E2%80%93-exception-interceptors/"&gt;Exception Interceptors&lt;/a&gt;
  &lt;/ul&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-8853718009688026024?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/8853718009688026024/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=8853718009688026024' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/8853718009688026024'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/8853718009688026024'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2012/01/mysql-python-meta-programming-and.html' title='MySQL: Python, Meta-Programming, and Interceptors'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-1297703426721517909</id><published>2011-09-26T10:27:00.001+02:00</published><updated>2011-09-26T10:27:28.008+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='API'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><title type='text'>Python Interface to MySQL</title><content type='html'>There has been a lot of discussions lately about various non-SQL languages that provide access to databases without having to resort to using SQL.  I wondered how difficult it would be to implement such an interface, so as an experiment, I implemented a simple interface in Python that similar to the document-oriented interfaces available elsewhere. The interface generate SQL queries to query the database, but does not require any knowlegdge of SQL to use. The syntax is inspired by &lt;a href="http://jquery.org"&gt;JQuery&lt;/a&gt;, but since JQuery works with documents, the semantics is slightly different.&lt;p&gt;

A simple example would look like this:

&lt;pre class="code lineno"&gt;
from native_db import *
server = Server(host='127.0.0.1')
server.test.t1.insert({'more': 3, 'magic': 'just a test', 'count': 0})
server.test.t1.insert({'more': 3, 'magic': 'just another test', 'count': 0})
server.test.t1.insert({'more': 4, 'magic': 'quadrant', 'count': 0})
server.test.t1.insert({'more': 5, 'magic': 'even more magic', 'count': 0})
for row in server.test.t1.find({'more': 3}):
  print "The magic is:", row['magic']
server.test.t1.update({'more': 3}, {'count': 'count+1'})
for row in server.test.t1.find({'more': 3}, ['magic', 'count']):
  print "The magic is:", row['magic'], "and the count is", row['count']
server.test.t1.delete({'more': 5})
&lt;/pre&gt;

The first line define a server to communicate with, which is simply done by creating a &lt;code&gt;Server&lt;/code&gt; object with the necessary parameters. The constructor accepts the normal parameters for Connector/Python (which is what I'm using internally), but the user defaults to whatever &lt;code&gt;getpass.getuser()&lt;/code&gt; returns, and the host default to 127.0.0.1, even though I've provided it here.&lt;p&gt;

After that, the necessary methods are overridden so that &lt;code&gt;&lt;em&gt;server&lt;/em&gt;.&lt;em&gt;database&lt;/em&gt;.&lt;em&gt;table&lt;/em&gt;&lt;/code&gt; will refer to the table with name &lt;em&gt;table&lt;/em&gt; in database with name &lt;em&gt;database&lt;/em&gt; on the given server. One possibility would be to just skip the database and go directly on the table (using some default database name), but since this is just an experiment, I did this instead. After that, there are various methods defined to support searching, inserting, deleting, and updating.&lt;p&gt;

Since this is intended to be a simple interface, autocommit is on. Each of the functions generate a single SQL statement, so they will be executed atomically if you're using InnoDB.&lt;p&gt;

&lt;dl&gt;
  &lt;dt&gt;&lt;em&gt;table&lt;/em&gt;.insert(&lt;em&gt;row&lt;/em&gt;)
  &lt;dd&gt;This function will insert the contents of the dictionary into the table. using the keys of the dictionary as column names. If the table does not exist, it will be created with a "best effort" guess of what types to use for the columns.
  &lt;dt&gt;&lt;em&gt;table&lt;/em&gt;.delete(&lt;em&gt;condition&lt;/em&gt;)
  &lt;dd&gt;This function will remove all rows in the table that matches the supplied dictionary. Currently, only equality mapping is supported, but see below for how it could be extended.
  &lt;dt&gt;&lt;em&gt;table&lt;/em&gt;.find(&lt;em&gt;condition&lt;/em&gt;, &lt;em&gt;fields&lt;/em&gt;="*")
  &lt;dd&gt;This will search the table and return an iterable to the rows that match &lt;em&gt;condition&lt;/em&gt;. If &lt;em&gt;fields&lt;/em&gt; is supplied (as a list of field names), only those fields are returned.
  &lt;dt&gt;&lt;em&gt;table&lt;/em&gt;.update(&lt;em&gt;condition&lt;/em&gt;, &lt;em&gt;update&lt;/em&gt;)
  &lt;dd&gt;This will search for rows matching &lt;em&gt;condition&lt;/em&gt; and update each matching row according to the &lt;em&gt;update&lt;/em&gt; dictionary. The values of the dictionary is used on the right side of the assignments of the &lt;code&gt;UPDATE&lt;/code&gt; statement, so expressions can be given here as strings.
&lt;/dl&gt;

&lt;h2&gt;That's all folks!&lt;/h2&gt;

The code is available at &lt;a href="http://mats.kindahl.net/python/native_db.py"&gt;http://mats.kindahl.net/python/native_db.py&lt;/a&gt; if you're interested in trying it out. The code is very basic, and there's potential for a lot of extensions. If there's interest, I could probably create a repository somewhere.&lt;p&gt;

Note that this is not a replacement for an ORM library. The intention is not to allow storing arbitrary objects in the database: the intention is to be able to query the database using a Python interface without resorting to using SQL.&lt;p&gt;

I'm just playing around and testing some things out, and I'm not really sure if there is any interest in anything like this, so what do you think? Personally, I have no problems with using SQL, but since I'm working with MySQL on a daily basis, I'm strongly biased on the subject. For simple jobs, this is probably easier to work with than a "real" SQL interface, but it cannot handle as complex queries as SQL can (at least not without extensions).&lt;p&gt;

There is a number of open issues for the implementation (this is just a small list of obvious ones):

&lt;dl&gt;
  &lt;dt&gt;&lt;strong&gt;Only equality searching supported&lt;/strong&gt;
  &lt;dd&gt;Searching can only be done with equality matches, but it is trivial to extend to support more complex comparisons. To allow more complex conditions, the condition supplied to &lt;code&gt;find&lt;/code&gt;, &lt;code&gt;delete&lt;/code&gt;, and &lt;code&gt;update&lt;/code&gt; can actually be a string, in which case it is used "raw".&lt;p&gt;
  Conditions could be extended to support something like &lt;code&gt;{'more': '&gt;3'}&lt;/code&gt;, or a more object-oriented approach would be to support something similar to &lt;code&gt;{'more': operator.gt(3)}&lt;/code&gt;.&lt;p&gt;
  &lt;dt&gt;&lt;strong&gt;No support for indexes&lt;/strong&gt;
  &lt;dd&gt;There's no support for indexes yet, but that can easily be added. The complication is &lt;em&gt;what&lt;/em&gt; kind of indexes should be generated.&lt;p&gt;
    For example, right now rows are identified by their content, but if we want unique rows to be handled as a set? Imagine the following (not supported) query where we insert :
    &lt;blockquote&gt;&lt;code&gt;server.test.t1.insert(&lt;em&gt;content with some more=3&lt;/em&gt;).find({'more': eq(3)})&lt;/code&gt;&lt;/blockquote&gt;
    In this case, we have to fetch the row identifiers for the inserted rows to be able to manipulate &lt;em&gt;exactly&lt;/em&gt; those rows and none other. Not sure how to do this right now, but auto-inventing a row-identifier would mean that tables lacking it cannot be handled naturally.&lt;p&gt;
  &lt;dt&gt;&lt;strong&gt;Creating and dropping tables&lt;/strong&gt;
  &lt;dd&gt;The support for creation of tables is to create tables automatically if they do not exist. A simple heuristic is used to figure out the table definition, but this has obvious flaws if later inserts have more fields than the first one.&lt;p&gt;
    To support extending the table, one would have to generate an &lt;code&gt;ALTER TABLE&lt;/code&gt; statement to "fix" the table.&lt;p&gt;
  There is no support for dropping tables... or databases.
&lt;/dl&gt;
&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-1297703426721517909?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/1297703426721517909/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=1297703426721517909' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/1297703426721517909'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/1297703426721517909'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2011/09/python-interface-to-mysql.html' title='Python Interface to MySQL'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-567039231190382000</id><published>2011-07-27T15:12:00.002+02:00</published><updated>2011-07-27T15:59:30.805+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='scale-out'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='multi-core'/><category scheme='http://www.blogger.com/atom/ns#' term='scale-up'/><title type='text'>Binlog Group Commit Experiments</title><content type='html'>&lt;h1&gt;Binlog Group Commit Experiments&lt;/h1&gt;

It was a while ago since I &lt;a href="http://mysqlmusings.blogspot.com/2010/04/binary-log-group-commit-implementation.html"&gt;talked&lt;/a&gt; &lt;a href="http://mysqlmusings.blogspot.com/2010/08/binary-log-group-commit-recovery.html"&gt;about&lt;/a&gt; binary log group commit.  I had to spend time on a &lt;a href="http://mysqlmusings.blogspot.com/2011/02/slave-type-conversions.html"&gt;few&lt;/a&gt; &lt;a href="http://mysqlmusings.blogspot.com/2011/04/replication-event-checksum.html"&gt;other&lt;/a&gt; &lt;a href="http://mysqlmusings.blogspot.com/2011/04/round-robin-multi-source-in-pure-sql.html"&gt;things&lt;/a&gt;.&lt;p&gt;

Since then, Kristian has &lt;a href="https://lists.launchpad.net/maria-developers/msg03693.html"&gt;released a version of binary log group commit&lt;/a&gt; that seems to &lt;a href="http://www.facebook.com/notes/mark-callaghan/group-commit-in-mariadb-is-fast/10150211546215933"&gt;work well&lt;/a&gt;.

However, for a few reasons that will be outlined below, we decided to do experiments ourselves using the approach that I have described earlier. A &lt;em&gt;very&lt;/em&gt; early version of what we will start doing benchmarks on are available at the &lt;a href=""&gt;MySQL labs&lt;/a&gt;. We have not done any any benchmarking on this approach before OSCON, so we we'll have to get back on that.&lt;p&gt;

All of this started with &lt;a href="http://www.facebook.com/note.php?note_id=438641125932"&gt;Facebook pointing out a problem&lt;/a&gt; in how the group commit interacts with the binary log and proposed a way to handle the binary log group commit by demonstrating a patch to solve the problem. 

&lt;h2&gt;What's in the patch&lt;/h2&gt;

The patch involves implementing logic for handling binary log group commit and parallel writing of the binary log, including a minor change to the handler protocol by adding a &lt;code&gt;persist&lt;/code&gt; callback. The extension of the handler interface is strictly speaking not necessary for the implementation, but it is natural to extend the interface in this manner and I belive that it can be used by storage engines to execute more efficiently).

In addition to the new logic, three new options were added and one option was created as an alias of an old option.

&lt;dl&gt;
  &lt;dt&gt;&lt;code&gt;binlog-sync-period=&lt;/code&gt;&lt;var&gt;N&lt;/var&gt;
  &lt;dd&gt;This is just a rename of the old &lt;code&gt;sync-period&lt;/code&gt; option, which tell that &lt;code&gt;fsync&lt;/code&gt; should be called for the binary log every &lt;var&gt;N&lt;/var&gt; events. For many of the old options, it is not clear what they are configuring, so we are adding the &lt;code&gt;binlog-&lt;/code&gt; prefix to options that affect the binary log. The old option is kept as an alias for this option.
  &lt;dt&gt;&lt;code&gt;binlog-sync-interval=&lt;/code&gt;&lt;var&gt;msec&lt;/var&gt;
  &lt;dd&gt;No transaction commit will wait for more than &lt;var&gt;msec&lt;/var&gt; milliseconds before calling &lt;code&gt;fsync&lt;/code&gt; on the binary log. If set to zero, it is disabled. You can set both this option and the &lt;code&gt;binlog-sync-period&lt;/code&gt; option.
  &lt;dt&gt;&lt;code&gt;binlog-trx-committed={COMPLETE,DURABLE}&lt;/code&gt;
  &lt;dd&gt;A transaction is considered committed when it is either in durable store or when it is completed. If set to &lt;code&gt;DURABLE&lt;/code&gt; either &lt;code&gt;binlog-sync-interval&lt;/code&gt; or &lt;code&gt;binlog-sync-period&lt;/code&gt; has to be non-zero. If they are both zero, transactions will not be flushed to disk and hence they will never be considered durable.
  &lt;dt&gt;&lt;code&gt;master-trx-read=={COMPLETE,DURABLE}&lt;/code&gt;
  &lt;dd&gt;A transaction is read from the binary log when it is completed or when it is durable. If set to &lt;code&gt;DURABLE&lt;/code&gt; either &lt;code&gt;binlog-sync-interval&lt;/code&gt; or &lt;code&gt;binlog-sync-period&lt;/code&gt; has to be non-zero or an error will be generated.  If it was possible for both zero, no transactions will ever be read from the binary log and hence never sent out.
&lt;/dl&gt;

The patch also contain code to eliminate the &lt;code&gt;prepare_commit_mutex&lt;/code&gt; as well as moving release of row locks inside InnoDB (not completely applied yet, I will get it there as soon as possible) to the prepare phase. The focus on these changes is that we should maintain consistency, so we have not done any aggressive changes like moving the release of the write locks to the prepare phase: that could possibly lead to inconsistencies.&lt;p&gt;

&lt;table class="figure-right"&gt;
  &lt;caption&gt;Figure&amp;nbsp;1. Binary log with transaction in different stages&lt;/caption&gt;
  &lt;tr&gt;
    &lt;td&gt;
      &lt;object data="http://images.kindahl.net/trans-stages.svg" width="100" height="300" type="image/svg+xml"&gt;
      &lt;/object&gt;
    &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;

The main changes are about how a transaction is committed. The details are explained in the previous articles, but for understanding the rest of this blog post, I'll briefly recapitulate how a transaction is committed in this solution. Each transaction pass through three states: &lt;em&gt;prepared&lt;/em&gt;, &lt;em&gt;completed&lt;/em&gt; (committed to memory), and &lt;em&gt;durable&lt;/em&gt; (committed to disk), as seen in Figure&amp;nbsp;1. The transaction is pushed through these states using the following procedure:
&lt;ol&gt;
  &lt;li&gt;The transaction is first &lt;strong&gt;prepared&lt;/strong&gt;, which is now split into two steps:
  &lt;ol&gt;
    &lt;li&gt;In the &lt;strong&gt;reserve&lt;/strong&gt; step, a slot is assigned for the transaction in the binary log and the storage engine is asked check if this transaction can be committed. At this point, the storage engine can abort the transaction if it is unable to fulfill the commit, but if it approves of the commit, the only thing that can abort the transaction after this point is a server crash. This check is currently done using the &lt;code&gt;prepare&lt;/code&gt; call. This step is executed with a lock, but is intended to be short.&lt;/li&gt;
    &lt;li&gt;In the &lt;strong&gt;persist&lt;/strong&gt; step, the &lt;code&gt;persist&lt;/code&gt; function is called, which asks the storage engine to persist any data that it need to persist to guarantee that the transaction is fully prepared. After this step is complete, the transaction is fully prepared in the storage engine and in the event of a crash, it will be able to commit the transaction on recovery, if asked to do so. This step is executed without a lock and a storage engine that intend to handle group commit should defer any expensive operations to this step.&lt;/li&gt;
  &lt;/ol&gt;&lt;/li&gt;
  &lt;li&gt;To record the decision, the transaction is written to the reserved slot in the binary log. Since the write is done to a dedicated place in the binary log reserved to this transaction, it is not necessary to hold any locks, which means that several threads can write the transaction to the binary log at the same time.&lt;/li&gt;
  &lt;li&gt;The &lt;strong&gt;commit&lt;/strong&gt; phase is in turn split into two steps:
  &lt;ol&gt;
    &lt;li&gt;In the &lt;strong&gt;completion&lt;/strong&gt; step, the thread waits for all preceeding transactions to be fully written to the binary log, after which the transaction is &lt;em&gt;completed&lt;/em&gt;, which means that it is logically committed but not necessarily in durable storage.&lt;/li&gt;
    &lt;li&gt;In the &lt;strong&gt;durability&lt;/strong&gt;, step, the thread waits for the transaction (and all preceeding transactions) to be written to disk. If this does not occur within the given time period, it will itself call &lt;code&gt;fsync&lt;/code&gt; for the binary log. This will make all completed transactions durable.&lt;/li&gt;
  &lt;/ol&gt;&lt;/li&gt;
&lt;/ol&gt;

After this procedure is complete, the transaction is fully committed and the thread can proceed with executing the next statement.

&lt;h2&gt;The different approaches&lt;/h2&gt;

So, providing this patch begs the questions: why a third version of binary log group commit?

There are three approaches: Facebook's patch (#1), Kristian's patch (#2), and my patch (#3).  Before going over the rationale leading to a third version, it is necessary to understand how the Facebook patch and Krisian's patch work on a very high level. If you look at Figure&amp;nbsp;1, you see a principal diagram showing how the patches work. Both of them maintain a queue of threads with transactions to be written and will ensure that they are written in the correct order to the binary log.&lt;p&gt;

The Facebook patch ensures that the transactions are written in the correct order by signalling each thread waiting in the queue in the correct order, after which the thread will take a lock on the binary log, append the transaction, and release the lock.  When the decision to commit the outstanding transactions are made, &lt;code&gt;fsync()&lt;/code&gt; is called. It has turned out that this lock-write-unlock loop can just be executed at a certain speed, which means that as the number threads waiting to write transactions increase, the system choke and is not able to keep up.&lt;p&gt;

Kristian solves this by designating the first thread in the queue as the leader, and have it write the transactions for &lt;em&gt;all&lt;/em&gt; threads in the queue instead of just having each thread do it individually and then broadcast to the other threads, who just return from the commit. This improves performance significantly as can be seen from the figures in the &lt;a href="http://www.facebook.com/notes/mark-callaghan/group-commit-in-mariadb-is-fast/10150211546215933"&gt;measurements that Mark did&lt;/a&gt;. Note, however, that a lock of the binary log is still kept while writing the transactions.&lt;p&gt;

The approach we are experimenting with goes about this in another way and instead of queueing the data to be written, a place is immediately allocated in the binary log after which the thread proceed to write the data. This means that several threads can at the same time write in parallel to the binary log without needing to keep any locks. There is a need for a lock when allocating space in the binary log, but that is very short. Since the threads can finish writing in different order, it is necessary to keep logic around for deciding when a transaction is committed and when it's not. For details, you can look at the &lt;a href="http://forge.mysql.com/worklog/task.php?id=5223"&gt;worklog&lt;/a&gt; (which is not entirely up to date, but I'll fix that). In this sense, the binary log itself is the queue (there is a queue in the implementation, but this is just for bookkeeping).

The important differences leading us to a want to have a look at this third version are:
&lt;ul&gt;
  &lt;li&gt;Approaches #1 and #2 keep a lock while writing the binary log while #3 doesn't.&lt;/li&gt;
  &lt;li&gt;Approaches #1 and #2 keep the transactions on the side (in the queue) and write them to the binary log when they are being committed. Approach #3 writes the transactions directly to the binary log, possibly before they are committed.&lt;/li&gt;
&lt;/ul&gt;

&lt;table class="figure-right"&gt;
  &lt;caption&gt;Figure&amp;nbsp;1. Sources of performance problems&lt;/caption&gt;
  &lt;tr&gt;
    &lt;td&gt;
      &lt;object data="http://images.kindahl.net/locking-issues.svg" width="320" height="450" type="image/svg+xml"&gt;
      &lt;/object&gt;
    &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;

&lt;h2&gt;Efficiently using Multiple Cores&lt;/h2&gt;

Efficiently using a multi-threaded systems, especially one with multiple cores, is very hard. It requires knowledge of hardware issues, operating systems considerations, algorithms, and some luck. I will not cover all the issues revolving around designing a system for multi-core use, but I will focus on three of the parts that we are considering in this case.  We split the sources of performance degradations when committing a transaction into three separate parts: CPU and memory issues, software lock contention, and I/O.

&lt;ul&gt;
  &lt;li&gt;The CPU and memory issues has to do with how caches are handled on the CPU level, which can affect performance quite a lot.  There some things that can be done, such as avoiding false sharing, handling data alignment, and checking the cache access patterns, but in general, it is hard to add as an afterthought and require quite a lot of work to get right.  We are not considering this and view it as static.&lt;/li&gt;
  &lt;li&gt;The I/O can be reduced using either SSDs or use RAID solutions (which does not reduce latency, but improves the throughput and therefore reduce the I/O needed for each transaction). Also, reducing the number of accesses to disk using group commits will improve the situation significantly, which is what we're doing here.&lt;/li&gt;
  &lt;li&gt;To reduce the software lock contention there is only one solution: reduce the time each lock is kept. This can be as simple as moving the lock aquire and release, using atomic primitives instead of locks, but can also require re-designing algorithms to be able to run without locks.&lt;/li&gt;
&lt;/ul&gt;

So, assuming that we reduce the I/O portion of committing a transaction&amp;mdash;and &lt;em&gt;only&lt;/em&gt; I/O portion&amp;mdash;as you can see in Figure&amp;nbsp;1, the software lock time start to become the problem and we need to start to work on reducing that. To do this, there are not many options except the approach described above. And if we take this approach to reduce lock contention, there's just a few additions to get the group commit as well.&lt;p&gt;

Given this, it is rational to explore if this solution can solve the group commit problem as good as the other solutions and improve the scalability of the server at the same time.

&lt;h2&gt;Scaling out&lt;/h2&gt;

One of the most central uses for replication is to achieve high-availability by duplicating masters and replicate between them to keep both up to date. For this reason, it is important to get the changes over to the other master as fast as possible. In this case, whether the data is durable on the original master or not is of a smaller concern since once the transaction has left the node, a crash will not cause the transaction to disappear since it has already been distributed. This means that for implementing multi-masters, we want replication to send transactions as soon as possible&amp;mdash;and maybe even before that&amp;mdash;since we can achive high-availablility by propagating the information as widely as possible.&lt;p&gt;

On the other hand, transactions sent from the master to the slave &lt;em&gt;might&lt;/em&gt; need to be durable on the master since otherwise the slave might be moving into an alternative future&amp;mdash;a future where this transaction was committed&amp;mdash;if the transactions sent to the slave are lost because of a crash.  In this case, it is necessary for the master to not send out the transaction before it is in durable store.

Having a master that is able to send out both completed transactions and durable transactions at the same time, all based on the requirements of the slave that connects, is a great feature and allow the implementation of both an efficient multi-master solution as well as slaves that does not diverge from the master even in the event of crashes. Currently, a master cannot both deliver transactions that are &lt;em&gt;completed&lt;/em&gt; and transactions that are &lt;em&gt;durable&lt;/em&gt; at the same time.  With the patch presented in this article, it is possible to implement this, but in alternative #1 and #2 described above, all the transactions are kept "on the side" and not written to the binary log until they are being committed. This means that it is harder to support this scenario with the two other alternatives.

&lt;h2&gt;Concluding remarks&lt;/h2&gt;

To sum up the discussion above: we are interested in exploring this approach since we think that it provides shorter lock time, hence scales better to multi-core machines, and in addition provide better scale-out capabilities, since it will be possible that the slaves can decide if they want to receive durable or completed transactions.

Thanks to all in the community for the great work and discussions on binlog group commit.  The next steps will be to benchmark this solution to see how it flies and it would be great to also get some feedback on this approach.  As always, we are interested in getting a good and efficent solution that also can be maintained end evolved easily.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-567039231190382000?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/567039231190382000/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=567039231190382000' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/567039231190382000'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/567039231190382000'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2011/07/binlog-group-commit-experiments.html' title='Binlog Group Commit Experiments'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-6903983818073234593</id><published>2011-04-13T16:23:00.004+02:00</published><updated>2011-04-18T08:53:27.930+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='multi-master'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='round-robin'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='multi-source'/><title type='text'>Round-Robin Multi-Source in Pure SQL</title><content type='html'>With the &lt;a href="crash-safe-replication.html"&gt;addition of the new tables to implement crash-safe replication&lt;/a&gt; we also get access to replication information through the SQL interface.  This might not seem like a big advantage, but it should not be taken lightly.  To demonstrate the power of using this approach, I will show how to implement a multi-source round-robin replication described at &lt;a href="http://dom.as/2008/05/14/trainwreck-external-mysql-replication-agent/"&gt;other places&lt;/a&gt; (including &lt;a href="http://oreilly.com/catalog/9780596807290"&gt;our book&lt;/a&gt;).  However, compared to the other implementations&amp;mdash;where the implementation requires a client to parse the output of &lt;code&gt;SHOW SLAVE STATUS&lt;/code&gt;&amp;mdash;the twist is that the implementation is entirely done in the server, using pure SQL.&lt;p&gt;

If you're familiar with replication, you know that a slave can just replication from a single master.  The trick used to replicate from multiple master&amp;mdash;this is usually called &lt;em&gt;multi-source&lt;/em&gt;&amp;mdash;is to switch between masters in a time-share fashion as illustrated in Figure&amp;nbsp;1.  The schema used to pick the master to replicate can vary, but it is common to use a round robin schedule.&lt;p&gt;

The steps necessary to switch master are:

&lt;ol&gt;
  &lt;li&gt;Stop reading events from the master and empty the relay log.  To stop reading events from the master, it is necessary to ensure that there are no outstanding events in the relay log before switching to another master.  If this is not done, some will not be applied and will have to be re-fetched from the master.&lt;ol&gt;
    &lt;li&gt;Stop the I/O thread.&lt;/li&gt;
    &lt;li&gt;Wait for the events in the relay log to be applied.&lt;/li&gt;
    &lt;li&gt;Stop the SQL thread.&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;
  &lt;li&gt;Save away the replication information.&lt;/li&gt;
  &lt;li&gt;Fetch the saved information about the next master to replicate from.&lt;/li&gt;
  &lt;li&gt;Change master using the new information.&lt;/li&gt;
  &lt;li&gt;Start the slave threads.&lt;/li&gt;
&lt;/ol&gt;

Simple, right? So, let's make an implementation! So, what pieces do we need?
&lt;ul&gt;
  &lt;li&gt;To handle the periodic switching, we use an SQL event for executing the above procedure.&lt;/li&gt;
  &lt;li&gt;We need a table to store the state of each master. The table should contain all the necessary information for configuring the master, including the binlog position.&lt;/li&gt;
  &lt;li&gt;We need to be able to store what master we're currently replicating from.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;Saving state information&lt;/h4&gt;

&lt;table class="figure-right"&gt;
  &lt;caption&gt;&lt;strong&gt;Figure&amp;nbsp;1. &lt;/strong&gt;Tables for storing information about masters&lt;/caption&gt;
  &lt;tr&gt;&lt;td&gt;&lt;pre class="code"&gt;CREATE TABLE my_masters (
    idx INT AUTO_INCREMENT PRIMARY KEY,
    host VARCHAR(50), port INT DEFAULT 3306,
    user VARCHAR(50), passwd VARCHAR(50),
    log_file VARCHAR(50), log_pos LONG,
    UNIQUE INDEX (host,port,user)
) ENGINE=InnoDB;

CREATE TABLE current_master (
    idx INT
) ENGINE=InnoDB;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

We need two tables: a table &lt;code&gt;my_masters&lt;/code&gt; to record information about the available masters and a table &lt;code&gt;current_master&lt;/code&gt; that keeps information about the current master.  The &lt;code&gt;my_masters&lt;/code&gt; table will contain information on how to connect to the masters as well as the last seen position. We assume that the user and password information is stored in the table and won't save away that information when switching master. To store the current master being replicated from, We cannot use a user defined variable&amp;mdash;because each invocation of an event spawns a new session&amp;mdash;so we store this information in a table.&lt;p&gt;

&lt;h4&gt;Switching masters&lt;/h4&gt;

To be able to execute a &lt;code&gt;CHANGE MASTER&lt;/code&gt; statement with the information we need, it would be perfect to use a prepared statement, but unfortunately, the &lt;code&gt;CHANGE MASTER&lt;/code&gt; statement is one of those statements that cannot be used inside a prepared statement, so we have to build the statement dynamically.  To make it easier, we create a &lt;code&gt;change_master&lt;/code&gt; procedure that does the job of building, preparing, executing, and deallocating a prepared statement.  We also allow the file name and position passed to be NULL, in which case we start replication without these parameters, essentially starting from the beginning of the masters binary log.

&lt;pre class="code"&gt;
delimiter $$
CREATE PROCEDURE change_master(
    host VARCHAR(50), port INT,
    user VARCHAR(50), passwd VARCHAR(50),
    name VARCHAR(50), pos LONG)
BEGIN
  SET @cmd = CONCAT('CHANGE MASTER TO ',
                    CONCAT_WS(', ',
                    CONCAT('MASTER_HOST = "', host, '"'),
                    CONCAT('MASTER_PORT = ', port),
                    CONCAT('MASTER_USER = "', user, '"'),
                    CONCAT('MASTER_PASSWORD = "', passwd, '"')));

  IF name IS NOT NULL AND pos IS NOT NULL THEN
    SET @cmd = CONCAT(@cmd,
                      CONCAT_WS(', ', '',
                                CONCAT('MASTER_LOG_FILE = "', name, '"'),
                                CONCAT('MASTER_LOG_POS = ', pos)));
  END IF;
  PREPARE change_master FROM @cmd;
  EXECUTE change_master;
  DEALLOCATE PREPARE change_master;
END $$
delimiter ;
&lt;/pre&gt;

The last step is to create the event that switch master for us. As a specific feature, we implement the event handling so that we can add and remove rows from the &lt;code&gt;my_masters&lt;/code&gt; table and the event will just pick the next one in order. To solve this, we use queries to pick the next one in order based on the index of the last used master and then an additional query to handle the case of a wrap-around with a missing table at index 1.&lt;p&gt;

To allow the table to be changed while the events are executing, we place all the updates of our tables into a transaction. That way, any updates done to the table while the event is executing will not affect the logic for picking the next table.&lt;p&gt;

There are some extra logic added to handle the case that there are "holes" in the index numbers: it is possible that there is no master with index 1 and it is possible that the next master does not have the next index in sequence.  This also allow the server ID of the master to be used, but in the current implementation, we use a simple index instead.

&lt;table&gt;
&lt;tr&gt;&lt;td&gt;&lt;pre class="code"&gt;delimiter $$
CREATE EVENT multi_source
    ON SCHEDULE EVERY 10 SECOND DO
BEGIN
   DECLARE l_host VARCHAR(50);
   DECLARE l_port INT UNSIGNED;
   DECLARE l_user TEXT;
   DECLARE l_pass TEXT;
   DECLARE l_file VARCHAR(50);
   DECLARE l_pos BIGINT;
   DECLARE l_idx INT DEFAULT 1;
&lt;/pre&gt;&lt;/td&gt;&lt;td&gt;
&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt;&lt;td valign="top"&gt;&lt;pre class="code"&gt;   SET SQL_LOG_BIN = 0;&lt;/pre&gt;&lt;/td&gt;&lt;td valign="top"&gt;&lt;em&gt;Don't write any of this to the binary log. Since this is an event, it will automatically be reset at the end of the execution and not affect anything else.&lt;/em&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt;&lt;td  valign="top"&gt;&lt;pre class="code"&gt;   STOP SLAVE IO_THREAD;
   SELECT master_log_name, master_log_pos
     INTO l_file, l_pos
     FROM mysql.slave_master_info;
   SELECT MASTER_POS_WAIT(l_file, l_pos);
   STOP SLAVE;
&lt;/pre&gt;&lt;/td&gt;&lt;td valign="top"&gt;
&lt;em&gt;Stop the slave I/O thread and empty the relay log before switching master&lt;/em&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt;&lt;td&gt;&lt;pre class="code"&gt;   START TRANSACTION;
&lt;/pre&gt;&lt;/td&gt;&lt;td&gt;
&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt;&lt;td valign="top"&gt;&lt;pre class="code"&gt;   UPDATE my_masters AS m,
          mysql.slave_relay_log_info AS rli
      SET m.log_pos = rli.master_log_pos,
          m.log_file = rli.master_log_name
    WHERE idx = (SELECT idx FROM current_master);
&lt;/pre&gt;&lt;/td&gt;&lt;td valign="top"&gt;&lt;em&gt;Save the position of the current master&lt;/em&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt;&lt;td valign="top"&gt;&lt;pre class="code"&gt;   SELECT idx INTO l_next_idx FROM my_masters
    WHERE idx &gt; (SELECT idx FROM current_master)
    ORDER BY idx LIMIT 1;
&lt;/pre&gt;&lt;/td&gt;&lt;td valign="top"&gt;&lt;em&gt;Find the next master in turn. To handle that masters have been removed, we will pick the next one index-wise. Wrap-around is handled by using the default of 1 above.&lt;/em&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;tr&gt;&lt;td valign="top"&gt;&lt;pre class="code"&gt;    SELECT idx INTO l_next_idx FROM my_masters
     WHERE idx &gt;= l_next_idx
     ORDER BY idx LIMIT 1;
&lt;/pre&gt;&lt;/td&gt;&lt;td valign="top"&gt;&lt;em&gt;If we did a wrap-around, it might be the case that master with index 1 does not exist (the default for l_next_idx), so then we have to scan and find the first index that exists which is equal to or greater than l_next_idx.&lt;/em&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td valign="top"&gt;&lt;pre class="code"&gt;    UPDATE current_master SET idx = l_next_idx;

    SELECT host, port, user, passwd, log_pos, log_file
      INTO l_host, l_port, l_user, l_pass, l_pos, l_file
      FROM my_masters
      WHERE idx = l_next_idx;

    CALL change_master(l_host, l_port, l_user,
                       l_pass, l_file, l_pos);
    COMMIT;
    START SLAVE;
END $$
delimiter ;
&lt;/pre&gt;&lt;/td&gt;&lt;td valign="top"&gt;&lt;em&gt;Extract the information about the new master from our masters table &lt;code&gt;my_masters&lt;/code&gt; and change to use that master.&lt;/em&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;

That's all! No you go off and play with it and send me comments.&lt;p&gt;

You can download the MySQL 5.6 Milestone Development Release &lt;a href="http://dev.mysql.com"&gt;MySQL Developer Zone (&lt;code&gt;dev.mysql.com&lt;/code&gt;)&lt;/a&gt;, which contain the new replication tables and you can find information in the &lt;a href="crash-safe-replication.html"&gt;previous post&lt;/a&gt; on how to set up the server to use the new tables.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-6903983818073234593?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/6903983818073234593/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=6903983818073234593' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/6903983818073234593'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/6903983818073234593'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2011/04/round-robin-multi-source-in-pure-sql.html' title='Round-Robin Multi-Source in Pure SQL'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-1708230410469619438</id><published>2011-04-12T17:09:00.004+02:00</published><updated>2011-04-12T17:19:45.114+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='transactional'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='crash-safe'/><title type='text'>Crash-safe Replication</title><content type='html'>A common request is to have replication crash-safe in the sense that the replication progress information always is in sync with what has actually been applied to the database, even in the event of a crash. Although transactions are not lost if the server crashes, it could require some tweaking to bring the slaves up again.&lt;p&gt;

In the latest MySQL 5.6 milestone development release, the replication team has implemented crash-safety for the slave by adding the ability of committing the replication information together with the transaction (see Figure&amp;nbsp;1). This means that replication information will always be consistent with has been applied to the database, even in the event of a server crash. Also, some fixes were done on the master to ensure that it recovers correctly.&lt;p&gt;

If you're familiar with replication, you know that the replication information is stored in two files: &lt;code&gt;master.info&lt;/code&gt; and &lt;code&gt;relay-log.info&lt;/code&gt;.

The update of these files are arranged so that they are updated after the transaction had been applied. This means that if you have a crash between the transaction commit and the update of the files, the replication progress information would be wrong.

In other words, a transaction cannot be lost this way, but there &lt;em&gt;is&lt;/em&gt; a risk that a transaction could be applied yet another time.

The usual way to avoid this is to have a primary key on all your tables. In that case, a repeated update of the table would cause the slave to stop, and you would have to use &lt;code&gt;SQL_SLAVE_SKIP_COUNTER&lt;/code&gt; to skip the transaction and get the slave up and running again.  This is better than losing a transaction, but it is nevertheless a nuisance.

Removing the primary key to prevent the slave from stopping will only solve the problem partially: it means that the transaction would be applied twice, which would both place a burden on the application to handle dual entries and also require that the tables to be cleaned regularly.

Both of these approches require either manual intervention or scripting support to handle. This does not affect reliability, but it is so much easier to handle if the replication information is committed in the same transaction as the data being updated.&lt;p&gt;

&lt;h4&gt;Crash-safe masters&lt;/h4&gt;

Two problems related to crash-safe replication has been fixed in the master, both of which could cause some annoyance when the master recovered.

&lt;ul&gt;
  &lt;li&gt;If the master crashed when a binary log was rotated, it was possible that some orphan binlog files ended up in the binary log index file. This was fixed in 5.1 but is also a piece in the pussle of having crash-safe replication.&lt;/li&gt;
  &lt;li&gt;Writing to the binary log is not an atomic operation, and if a crash occured while writing to the binary log, there were a possibility of a partial event at the end of the binary log.&lt;p&gt;

  Now, the master recovers from this by truncating the binary log to the last known good position, removing the partially written transaction and rolling back the outstanding transactions in the storage engines.&lt;/li&gt;
&lt;/ul&gt;

&lt;table class="figure-right"&gt;
  &lt;caption&gt;Figure&amp;nbsp;1. Moving position information update into transaction&lt;/caption&gt;
  &lt;tr&gt;&lt;td&gt;
    &lt;object data="http://lh3.ggpht.com/_X_imutbSFuE/TaRqS1QTGbI/AAAAAAAAAIQ/tx1heb9tpNw/trans-repl-before.png" type="image/png" width="240" height="210"&gt;
    &lt;/object&gt;
  &lt;/td&gt;&lt;td&gt;
    &lt;object data="http://lh6.ggpht.com/_X_imutbSFuE/TaRqS6MW3-I/AAAAAAAAAIM/JYHXd6EGgAU/trans-repl-after.png" type="image/png" width="240" height="210"&gt;
    &lt;/object&gt;    
  &lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;

&lt;h4&gt;Crash-safe slaves&lt;/h4&gt;

Several different solutions for implementing crash-safety&amp;mdash;or &lt;em&gt;transactional replication&lt;/em&gt;, as it is sometimes known as&amp;mdash;have been proposed, with Google's &lt;a href="http://code.google.com/p/google-mysql-tools/wiki/TransactionalReplication"&gt;TransactionalReplication&lt;/a&gt; patch being the most known. This solution stores the replication positions in the InnoDB transaction log, but the MySQL replication team decided to instead implement crash-safety by moving the replication progress information into system tables. This is a more flexible solution and has several advantages compared to storing the positions in the InnoDB transaction log:

&lt;ul&gt;
  &lt;li&gt;If the replication information and data is stored in the same storage engine, it will allow both the data and the replication position to be updated as a single transaction, which means that it is crash-safe.&lt;/li&gt;
  &lt;li&gt;If the replication information and data is stored in different storage engines, but both support XA, they can still be committed as a single transaction.&lt;/li&gt;
  &lt;li&gt;The replication information is flushed to disk together with the transaction data. Hence writing the replication information directly to the InnoDB redo log does not offer a speed advantage, but does not prevent the user from reading the replication progress information easily.&lt;/li&gt;
  &lt;li&gt;The tables can be read from a normal session using SQL commands, which also means that it can be incorporated into such things as stored procedures and stored functions.&lt;/li&gt;
&lt;/ul&gt;

&lt;TABLE class="simple-table figure-right"&gt;
  &lt;caption&gt;&lt;strong&gt;Table 1. &lt;/strong&gt;&lt;code&gt;slave_master_info&lt;/code&gt;&lt;/caption&gt;
  &lt;TR&gt;&lt;TH&gt;Field&lt;/TH&gt;&lt;TH&gt;Line in file&lt;/TH&gt;&lt;TH&gt;Slave status column&lt;/TH&gt;&lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Master_id&lt;/TD&gt;
    &lt;TD&gt;&lt;/TD&gt;
    &lt;TD&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Number_of_lines&lt;/TD&gt;
    &lt;TD align="right"&gt;1&lt;/TD&gt;
    &lt;TD&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Master_log_name&lt;/TD&gt;
    &lt;TD align="right"&gt;2&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Master_Log_File&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Master_log_pos&lt;/TD&gt;
    &lt;TD align="right"&gt;3&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Read_Master_Log_Pos&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Host&lt;/TD&gt;
    &lt;TD align="right"&gt;3&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Master_Host&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;User_name&lt;/TD&gt;
    &lt;TD align="right"&gt;4&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Master_User&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;User_password&lt;/TD&gt;
    &lt;TD align="right"&gt;5&lt;/TD&gt;
    &lt;TD&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Port&lt;/TD&gt;
    &lt;TD align="right"&gt;6&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Master_Port&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Connect_retry&lt;/TD&gt;
    &lt;TD align="right"&gt;7&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Connect_Retry&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Enabled_ssl&lt;/TD&gt;
    &lt;TD align="right"&gt;8&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Master_SSL_Allowed&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Ssl_ca&lt;/TD&gt;
    &lt;TD align="right"&gt;9&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Master_SSL_CA_File&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Ssl_capath&lt;/TD&gt;
    &lt;TD align="right"&gt;10&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Master_SSL_CA_Path&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Ssl_cert&lt;/TD&gt;
    &lt;TD align="right"&gt;11&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Master_SSL_Cert&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Ssl_cipher&lt;/TD&gt;
    &lt;TD align="right"&gt;12&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Master_SSL_Cipher&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Ssl_key&lt;/TD&gt;
    &lt;TD align="right"&gt;13&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Master_SSL_Key&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Ssl_verify_servert_cert&lt;/TD&gt;
    &lt;TD align="right"&gt;14&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Master_SSL_Verify_Server_Cert&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Heartbeat&lt;/TD&gt;
    &lt;TD align="right"&gt;15&lt;/TD&gt;
    &lt;TD&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Bind&lt;/TD&gt;
    &lt;TD align="right"&gt;16&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Master_Bind&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Ignored_server_ids&lt;/TD&gt;
    &lt;TD align="right"&gt;17&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Replicate_Ignore_Server_Ids&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Uuid&lt;/TD&gt;
    &lt;TD align="right"&gt;18&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Master_UUID&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
  &lt;TR&gt;
    &lt;TD&gt;Retry_count&lt;/TD&gt;
    &lt;TD align="right"&gt;19&lt;/TD&gt;
    &lt;TD&gt;&lt;code&gt;Master_Retry_Count&lt;/code&gt;&lt;/TD&gt;
  &lt;/TR&gt;
&lt;/TABLE&gt;

In addition to giving us crash-safe slaves the last of these advantages should not be taken lightly.  Being able to handle replication from pure SQL put some of the key features in the hands of application developers.&lt;p&gt;

As previously mentioned, the replication information is stored in two files:

&lt;dl&gt;
  &lt;dt&gt;&lt;code&gt;master.info&lt;/code&gt;
  &lt;dd&gt;This file contain information about the connection to the master&amp;mdash;such as hostname, user, and password&amp;mdash;but also information about how much of the binary log that has been transferred to the slave.
  &lt;dt&gt;&lt;code&gt;relay-log.info&lt;/code&gt;
  &lt;dd&gt;This file contain information about the current state of replication, that is, how much of the relay log that has been applied.
&lt;/dl&gt;

&lt;h4&gt;Options to select replication information repository&lt;/h4&gt;

In order to make the solution flexible, we introduced a general API for adding replication information repositories. This means that we can support multiple types of repositories for replication information, but currently, only the old system using files &lt;code&gt;master.info&lt;/code&gt; and &lt;code&gt;relay-log.info&lt;/code&gt; and the system using tables &lt;code&gt;slave_master_info&lt;/code&gt; and &lt;code&gt;slave_relay_log_info&lt;/code&gt; is supported. In order to select what type of repository to use, two new options were added. These options are also available as server variables.

&lt;dl&gt;
  &lt;dt&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-log.html#option_mysqld_master-info-repository"&gt;&lt;var&gt;master_info_repository&lt;/var&gt;&lt;/a&gt;
  &lt;dd&gt;The type of repository to use for the master info data seen in Table&amp;nbsp;1.
  &lt;dt&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary.html#option_mysqld_relay-log-info-repository"&gt;&lt;var&gt;relay_log_info_repository&lt;/var&gt;&lt;/a&gt;
  &lt;dd&gt;The type of repository to use for the relay log info seen in Table&amp;nbsp;2.
&lt;/dl&gt;

Both of the variables can be set to either &lt;code&gt;FILE&lt;/code&gt; or &lt;code&gt;TABLE&lt;/code&gt;. If the variable is set to &lt;code&gt;TABLE&lt;/code&gt; the new table-based system will be used and if it is set to &lt;code&gt;FILE&lt;/code&gt;, the old file-based system will be used. The default is &lt;code&gt;FILE&lt;/code&gt;, so make sure to set the value if you want to use the table-based system.&lt;p&gt;

&lt;TABLE class="simple-table figure-right" id="slave_relay_log_info"&gt;
  &lt;caption&gt;&lt;strong&gt;Table 2. &lt;/strong&gt;&lt;code&gt;slave_relay_log_info&lt;/code&gt;&lt;/caption&gt;
  &lt;TR&gt;&lt;TH&gt;Field&lt;/TH&gt;&lt;TH&gt;Line in file&lt;/code&gt;&lt;/TH&gt;&lt;TH&gt;Slave status column&lt;/TH&gt;&lt;/TR&gt;
  &lt;TR&gt;&lt;TD&gt;Master_id&lt;/TD&gt;&lt;TD&gt;&lt;/TD&gt;&lt;TD&gt;&lt;/TD&gt;&lt;/TR&gt;
  &lt;TR&gt;&lt;TD&gt;Number_of_lines&lt;/TD&gt;&lt;TD&gt;1&lt;/TD&gt;&lt;TD&gt;&lt;/TD&gt;&lt;/TR&gt;
  &lt;TR&gt;&lt;TD&gt;Relay_log_name&lt;/TD&gt;&lt;TD&gt;2&lt;/TD&gt;&lt;TD&gt;&lt;code&gt;Relay_Log_File&lt;/code&gt;&lt;/TD&gt;&lt;/TR&gt;
  &lt;TR&gt;&lt;TD&gt;Relay_log_pos&lt;/TD&gt;&lt;TD&gt;3&lt;/TD&gt;&lt;TD&gt;&lt;code&gt;Relay_Log_Pos&lt;/code&gt;&lt;/TD&gt;&lt;/TR&gt;
  &lt;TR&gt;&lt;TD&gt;Master_log_name&lt;/TD&gt;&lt;TD&gt;4&lt;/TD&gt;&lt;TD&gt;&lt;code&gt;Relay_Master_Log_File&lt;/code&gt;&lt;/TD&gt;&lt;/TR&gt;
  &lt;TR&gt;&lt;TD&gt;Master_log_pos&lt;/TD&gt;&lt;TD&gt;5&lt;/TD&gt;&lt;TD&gt;&lt;code&gt;Exec_Master_Log_Pos&lt;/code&gt;&lt;/TD&gt;&lt;/TR&gt;
  &lt;TR&gt;&lt;TD&gt;Sql_delay&lt;/TD&gt;&lt;TD&gt;6&lt;/TD&gt;&lt;TD&gt;&lt;code&gt;SQL_Delay&lt;/code&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;/TABLE&gt;

If you look in Table&amp;nbsp;1 and Table&amp;nbsp;2 you can see the column names used for the tables as well as the line number in the corresponding file and the column name in the output of &lt;code&gt;SHOW SLAVE STATUS&lt;/code&gt;. Since we are using tables, the column names are used for storing the data in the table, but when using a file, the column names are only used to identify the correct row to update and the value is inserted at the line number given in the table.&lt;p&gt;

The format of the tables have been extended with an additional field that is not present in the files but which is present in the table: the &lt;code&gt;Master_id&lt;/code&gt; field. The reason we added this is to make it possible to extend the server to track multiple masters. Note that we currently have no definite plans to add multi-source support, but as good engineers we do not want these tables to be a hindrance to adding multi-source.&lt;p&gt;

&lt;h4&gt;Selecting replication repository engine&lt;/h4&gt;

In contrast with most of the system tables in the server, the replication repositories can be configured to use any storage engine you prefer.  The advantage of this is that you can select the same engine for the replication repositories as the data you're managing. If you do that, both the data and the replication information will be committed as a single transaction.&lt;p&gt;

The new tables are created at installation using the &lt;code&gt;mysql_install_db&lt;/code&gt; script, as usual, and the default engine for these tables are are the same as for all system tables: MyISAM. As you know MyISAM is not very transactional, so it is necessary to set this to use InnoDB instead if you really want crash-safety.  To change the engine for these tables you can just use a normal &lt;samp&gt;ALTER TABLE&lt;/samp&gt;.

&lt;pre class="code"&gt;
slave&amp;gt; ALTER TABLE mysql.slave_master_info ENGINE = InnoDB;
slave&amp;gt; ALTER TABLE mysql.slave_relay_log_info ENGINE = InnoDB;
&lt;/pre&gt;

Note that this works for these tables because they were designed to allow any storage engine to be used for them, but it does not mean that you can change the storage engine for other system tables and expect it to work.

&lt;h4&gt;Event processing&lt;/h4&gt;

This implementation of crash-safe slaves work naturally with both statement-based and row-based replication and there is nothing special that needs to be done in the normal cases. However, these tables interleave with the normal processing in a little different ways.&lt;p&gt;

To understand how transactions are processed by the SQL thread, let us consider the following example transaction:

&lt;pre class="code"&gt;
START TRANSACTION;
INSERT INTO articles(user, title, body)
      VALUE (4711, 'Taming the Higgs Boson using Clicker Training', '....');
UPDATE users SET articles = articles + 1 WHERE user_id = 4711;
COMMIT;
&lt;/pre&gt;

This transaction will be written to the binary log and then sent over to the slave and written to the relay log in the usual way. Once it is read from the relay log for execution, it will be executed as if an update statement where added to the end of the transaction, before the commit:

&lt;pre class="code"&gt;
START TRANSACTION;
INSERT INTO articles(user, title, body)
      VALUE (4711, 'Taming the Higgs Boson using Clicker Training', '....');
UPDATE users SET articles = articles + 1 WHERE user_id = 4711;
&lt;span style="color: red"&gt;UPDATE mysql.slave_relay_log_info
   SET Master_log_pos = &lt;var&gt;@@Exec_Master_Log_Pos&lt;/var&gt;,
       Master_log_name = &lt;var&gt;@@Relay_Master_Log_File&lt;/var&gt;,
       Relay_log_name = &lt;var&gt;@@Relay_Log_File&lt;/var&gt;,
       Relay_log_pos = &lt;var&gt;@@Relay_Log_Pos&lt;/var&gt;&lt;/span&gt;
COMMIT;
&lt;/pre&gt;

In this example, there is a number of pseudo-server variables (that is, they don't exist for real) that have the same name as the corresponding field in the result set from &lt;code&gt;SHOW SLAVE STATUS&lt;/code&gt;.  As you can see, the update of the position information is now inside the transcation and will be committed with the transaction, so if both &lt;code&gt;articles&lt;/code&gt; and &lt;code&gt;mysql.slave_relay_log_info&lt;/code&gt; are in &lt;em&gt;the same transactional engine&lt;/em&gt;, they will be committed as a unit.&lt;p&gt;

This works well for the SQL thread, but what about the I/O thread? There are no transactions executed here at all, so when is the information in this table committed?&lt;p&gt;

Since a commit to the table is expensive&amp;mdash;in the same way as syncing a file to disk is expensive when using files as replication information repository&amp;mdash;the updates of the &lt;code&gt;slave_master_info&lt;/code&gt; table is not updated with each processed event. Depending on the value of &lt;var&gt;sync_master_info&lt;/var&gt; there are a few alternatives.

&lt;dl&gt;
  &lt;dt&gt;If &lt;var&gt;sync_master_info&lt;/var&gt; = 0
  &lt;dd&gt;In this case, the &lt;code&gt;slave_master_info&lt;/code&gt; table is just updated when the slave starts or stops (for any reason, including errors), if the relay log is rotated, or if you execute a &lt;code&gt;CHANGE MASTER&lt;/code&gt; command.
  &lt;dt&gt;If &lt;var&gt;sync_master_info&lt;/var&gt; &amp;gt; 0
  &lt;dd&gt;Then the &lt;code&gt;slave_master_info&lt;/code&gt; table will be updated every &lt;var&gt;sync_master_info&lt;/var&gt; event.
&lt;/dl&gt;

This means that while the slave is running, you cannot really see how much data has been read to the slave without stopping it. If it is important to see how the slave progress in reading events from the master, then you have to set &lt;var&gt;sync_master_info&lt;/var&gt; to some non-zero value, but you should be aware that there is a cost associated with doing this.&lt;p&gt;

This does not usually pose a problem since the times you need to read the master replication information on a running replication is far and few between. It is much more common to read it when the slave has stopped for some reason: to figure out where the error is or to perform a master fail-over.

&lt;h4&gt;Closing remarks&lt;/h4&gt;

We would be very interested in hearing any comments you have on this feature and how it is implemented. If you want to try this out for yourselves then you can download the MySQL 5.6 Milestone Development Release where all this is implemented from the &lt;a href="http://dev.mysql.com"&gt;MySQL Developer Zone (&lt;code&gt;dev.mysql.com&lt;/code&gt;)&lt;/a&gt;.

If you want to find out more details, the section &lt;a href="http://dev.mysql.com/doc/refman/5.6/en/slave-logs-status.html"&gt;Slave Status Logs&lt;/a&gt; in the MySQL 5.6 reference manual will provide you with all the information.

This is one of the features that presented by Lars Thalmann April 11, 2011 (yesterday) at 2:30pm, at the "MySQL Replication" talk at Collaborate 11 and April 12, 2011 (today) 10:50am "MySQL Replication Update" at the O'Reilly MySQL Conference &amp;amp; Expo.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-1708230410469619438?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/1708230410469619438/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=1708230410469619438' title='11 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/1708230410469619438'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/1708230410469619438'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2011/04/crash-safe-replication.html' title='Crash-safe Replication'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>11</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-8103833248886494099</id><published>2011-04-11T14:49:00.002+02:00</published><updated>2011-04-11T14:55:16.283+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='checksum'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><title type='text'>Replication Event Checksum</title><content type='html'>MySQL replication is fast, easy to use, and reliable, but once it
breaks, it can be very hard to figure out what the problem is.  One of the concerns often raised is that events are corrupted, either through failing hardware, network failure, or software bugs.  Even though it is possible to handle errors during transfer over the network using an SSL connection, errors here is rarely the problem.  A more common problem (relatively) is that the events are corrupted either due to a software bug, or hardware error.&lt;p&gt;

To be able to better handle corrupted events, the replication team has added &lt;a href=""&gt;&lt;em class="def"&gt;replication event checksums&lt;/em&gt;&lt;/a&gt; to MySQL 5.6 Milestone Development Release.

The replication event checksums are added to each event as it is
written to the binary log and are used to check that nothing happened with the event on the way to the slave.  Since the checksums are added to all events in the binary log on the master and transfered both over the network and written to the relay log on the slave, it is possible to track events corrupted events both because of hardware problems,
network failures, and software bugs.&lt;p&gt;

&lt;table class="figure-right"&gt;
  &lt;caption&gt;Figure&amp;nbsp;1. Master and Slave with Threads&lt;/caption&gt;
  &lt;tr&gt;&lt;td&gt;
    &lt;object title="Master and Slave with Threads"
            data="http://lh4.ggpht.com/_X_imutbSFuE/TaL4CmoHXcI/AAAAAAAAAII/_XZWdyzgCzk/master-slave.png" type="image/png" width="490" height="230"&gt;
    &lt;/object&gt;
  &lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;

The checksum used is a CRC-32 checksum, more precisely ISO-3309, which
is the one supplied with &lt;a href="http://zlib.net/"&gt;zlib&lt;/a&gt;. This is
an efficient checksum algorithm, but there is of course a penalty
since the checksum needs to be generated.  At this time, we don't have
any measurements on the performance impact.&lt;p&gt;

If you look at Figure&amp;nbsp;1 you can see an illustration of how events
propagate through the replication system.  In the figure, the points
where a checksum &lt;em&gt;could&lt;/em&gt; be generated or checked are marked
with numbers.  In the diagram, you can see the threads that handle the
processing of events, and an outgoing arrow from a thread can generate
a checksum while an arrow going into a thread can validate a
checksum. Note, however, that for pragmatic reasons not all
validations or generations can be done.&lt;p&gt;

To enable validation or generation three new options were introduced:

&lt;dl&gt;
  &lt;dt&gt;&lt;code&gt;binlog_checksum&lt;/code&gt;

  &lt;dd&gt;This option is used to control checksum generation. Currently,
    it can accept two different values: &lt;code&gt;NONE&lt;/code&gt; and
    &lt;code&gt;CRC32&lt;/code&gt;, with &lt;code&gt;NONE&lt;/code&gt; being default (for
    backward compatibility).&lt;p&gt; Setting &lt;var&gt;binlog_checksum&lt;/var&gt;
    to &lt;code&gt;NONE&lt;/code&gt; means that no checksum is generated, while
    setting it to &lt;code&gt;CRC32&lt;/code&gt; means that an ISO-3309 CRC-32
    checksum is added to each binary log event.&lt;p&gt;

    This means that a checksum will be generated by the session thread
    and written to the binary log, that is, at point 1 in
    Figure&amp;nbsp;1.

  &lt;dt&gt;&lt;code&gt;master_verify_checksum&lt;/code&gt;

  &lt;dd&gt;This option can be set to either 0 or 1 (with default being 0)
    and indicates that the master should verify any events read from
    the binary log on the master, corresponding to point 2 in
    Figure&amp;nbsp;1. In addition to being read from the binary log by
    the dump thread events are also read when a &lt;kbd&gt;SHOW BINLOG
    EVENTS&lt;/kbd&gt; is issued at the master and a check is done at this
    time as well.&lt;p&gt;

    Setting this flag can be useful to verify that the event really
    written to the binary log is uncorrupted, but it is typically not
    needed in a replication setting since the slave should verify the
    event on reception.

  &lt;dt&gt;&lt;code&gt;slave_sql_verify_checksum&lt;/code&gt;

  &lt;dd&gt;Similar to &lt;code&gt;master_verify_checksum&lt;/code&gt;, this option can
    be set to either 0 or 1 (but defaults to 1) and indicates that the
    &lt;em&gt;SQL thread&lt;/em&gt; should verify the checksum when reading it
    from the relay log on the slave.

    Note that this means that the I/O thread writes a checksum to the
    event written to the relay log, regardless of whether it received
    an event with a checksum or not.&lt;p&gt;

    This means that this option will enable verification at point 5 in
    Figure&amp;nbsp;1 and also enable generation of a checksum at point
    4 in the figure.

&lt;/dl&gt;

If you payed attention, you probably noticed that there is no checking
for point 3 in the figure. This is not necessary since the checksum is
verified when the event is written to the relay log at point 4, and
the I/O thread just does a straight copy of the event (potentially
adding a checksum, as noted above).&lt;p&gt;

So, how does it look when we encounter a checksum error? Let's try it
out and see what happens.

We start by generating a simple binary log with checksums turned on
and see what we get.

&lt;pre class="code wide"&gt;
master&amp;gt; &lt;kbd&gt;CREATE TABLE t1 (id INT AUTO_INCREMENT PRIMARY KEY, name CHAR(50));&lt;/kbd&gt;
Query OK, 0 rows affected (0.04 sec)

master&amp;gt; &lt;kbd&gt;INSERT INTO t1(name) VALUES ('Mats'),('Luis');&lt;/kbd&gt;
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

master&amp;gt; &lt;kbd&gt;SHOW BINLOG EVENTS FROM 261;&lt;/kbd&gt;
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
| Log_name          | Pos | Event_type | Server_id | End_log_pos | Info                                                      |
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
| master-bin.000001 | 261 | Query      |         1 |         333 | BEGIN                                                     |
| master-bin.000001 | 333 | Intvar     |         1 |         365 | INSERT_ID=1                                               |
| master-bin.000001 | 365 | Query      |         1 |         477 | use `test`; INSERT INTO t1(name) VALUES ('Mats'),('Luis') |
| master-bin.000001 | 477 | Query      |         1 |         550 | COMMIT                                                    |
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
4 rows in set (0.00 sec)
&lt;/pre&gt;

Here, everything looks as before, so no sign of a checksum here, but
let's edit the binlog file directly and change the 's' in 'Mats' to a
'z' and see what happens. First with
&lt;code&gt;MASTER_VERIFY_CHECKSUM&lt;/code&gt; set to 0, and then with it set to
1.

&lt;pre class="code wide"&gt;
master&amp;gt; &lt;kbd&gt;SHOW BINLOG EVENTS FROM 261;&lt;/kbd&gt;
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
| Log_name          | Pos | Event_type | Server_id | End_log_pos | Info                                                      |
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
| master-bin.000001 | 261 | Query      |         1 |         333 | BEGIN                                                     |
| master-bin.000001 | 333 | Intvar     |         1 |         365 | INSERT_ID=1                                               |
| master-bin.000001 | 365 | Query      |         1 |         477 | use `test`; INSERT INTO t1(name) VALUES ('Matz'),('Luis') |
| master-bin.000001 | 477 | Query      |         1 |         550 | COMMIT                                                    |
+-------------------+-----+------------+-----------+-------------+-----------------------------------------------------------+
4 rows in set (0.00 sec)

master&amp;gt; &lt;kbd&gt;SET GLOBAL MASTER_VERIFY_CHECKSUM=1;&lt;/kbd&gt;
Query OK, 0 rows affected (0.00 sec)

master&amp;gt; &lt;kbd&gt;SHOW BINLOG EVENTS FROM 261;&lt;/kbd&gt;
ERROR 1220 (HY000): Error when executing command SHOW BINLOG EVENTS: Wrong offset or I/O error
&lt;/pre&gt;

Now, the error message generated is not the crystal clear, but there
were an I/O error when reading the binary log: the checksum
verification failed. You can see this because I could show the content
of the binary log with &lt;code&gt;MASTER_VERIFY_CHECKSUM&lt;/code&gt; set to 0,
but not when set to 1. Since the checksum is checked when reading
events from the binary log, we get a checksum failure when using
&lt;kbd&gt;SHOW BINLOG EVENTS&lt;/kbd&gt;.&lt;p&gt;

So, if we restore the error and verify that it is correct by issuing a
&lt;kbd&gt;SHOW BINLOG EVENTS&lt;/kbd&gt; again, we can try to send it over to
the slave and see what happens. The steps to do this (in case you want
to try yourself) is:

&lt;ol&gt;
  &lt;li&gt;Start the I/O thread and let it create the relay log using
  &lt;kbd&gt;START SLAVE IO_THREAD&lt;/kbd&gt;.&lt;/li&gt;
  &lt;li&gt;Stop the slave using &lt;kbd&gt;STOP SLAVE&lt;/kbd&gt; (this is necessary
  since the slave buffers part of the relay log).&lt;/li&gt;
  &lt;li&gt;Manually edit the relay log to corrupt one event (I replaced the
  's' with a 'z'.&lt;/li&gt;
  &lt;li&gt;Start the slave using &lt;kbd&gt;START SLAVE&lt;/kbd&gt;.&lt;/li&gt;
&lt;/ol&gt;

The result when doing this is an error, as you can see below. Removing
the corruption and starting the slave again will apply the events as
expected.

&lt;pre class="code wide"&gt;
slave&amp;gt; SHOW SLAVE STATUS\G
*************************** 1. row ***************************
                        .
                        .
                        .
              Master_Log_File: master-bin.000001
          Read_Master_Log_Pos: 550
               Relay_Log_File: slave-relay-bin.000002
                Relay_Log_Pos: 419
        Relay_Master_Log_File: master-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
                        .
                        .
                        .
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1594
               Last_SQL_Error: Relay log read failure: Could not parse
                               relay log event entry. The possible
                               reasons are: the master's binary log is
                               corrupted...
                        .
                        .
                        .
     Last_SQL_Error_Timestamp: 110406 09:41:40
1 row in set (0.00 sec)
&lt;/pre&gt;

Now, this is all very nice, but if you have a corruption, you also
want to find out where the corruption is&amp;mdash;and that preferably
without having to start the server.  To handle this, the
&lt;code&gt;mysqlbinlog&lt;/code&gt; program was extended to print the CRC
checksum (if there is one) and also to verify it if you give the
&lt;var&gt;verify-binlog-checksum&lt;/var&gt; option to it. 

&lt;pre class="code wide"&gt;
$ &lt;kbd&gt;client/mysqlbinlog --verify-binlog-checksum master-bin.000001&lt;/kbd&gt;
        .
        .
        .
&lt;em style="color: red"&gt;# at 261&lt;/em&gt;
#110406  8:35:28 server id 1  end_log_pos 333 &lt;em style="color: red"&gt;CRC32 0xed927ef2&lt;/em&gt;  Query   thread_id=1...
SET TIMESTAMP=1302071728/*!*/;
BEGIN
/*!*/;
# at 333
#110406  8:35:28 server id 1  end_log_pos 365 CRC32 0x01ed254d  Intvar
SET INSERT_ID=1/*!*/;
&lt;em style="color: red"&gt;ERROR: Error in Log_event::read_log_event(): 'Event crc check failed! Most likely...&lt;/em&gt;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
&lt;/pre&gt;

As you can see, an error is emitted for the offending event, and you
can also see the CRC checksum value (which is 32 bits) in the output
above, and it corresponds to the position where the slave stopped for
my corrupted binary log.&lt;p&gt;

This is just the beginning: there are many things that can be done
using checksums, and many new things that are now possible to
implement. If you think that this is a useful feature, please let us
know, and if you think that it needs to be enhanced, changed, or
extended, we would also like to hear from you.

&lt;h2&gt;Closing remarks&lt;/h2&gt;

We would be very interested in hearing any comments you have on this
feature and how it is implemented. If you want to try this out for
yourselves then you can download the MySQL 5.6 Milestone Development
Release where all this is implemented from the &lt;a
href="http://dev.mysql.com"&gt;MySQL Developer Zone
(&lt;code&gt;dev.mysql.com&lt;/code&gt;)&lt;/a&gt;.&lt;p&gt;

If you want to find out the details, the reference documentation for
the replication checksum can be found together with the options
mentioned above:
&lt;ul&gt;
  &lt;li&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-log.html#option_mysqld_binlog-checksum"&gt;Manual for &lt;var&gt;binlog-checksum&lt;/var&gt; option&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-log.html#option_mysqld_master-verify-checksum"&gt;Manual for &lt;var&gt;master-verify-checksum&lt;/var&gt; option&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-log.html#option_mysqld_slave-sql-verify-checksum"&gt;Manual for &lt;var&gt;slave-sql-verify-checksum&lt;/var&gt; option&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

This is one of the features that are presented by Lars Thalmann today (April 11, 2011) at 2:30pm, at the "MySQL Replication" talk at Collaborate 11 and tomorrow (April 12, 2011) 10:50am "MySQL Replication Update" at the O'Reilly MySQL Conference &amp;amp; Expo.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-8103833248886494099?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/8103833248886494099/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=8103833248886494099' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/8103833248886494099'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/8103833248886494099'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2011/04/replication-event-checksum.html' title='Replication Event Checksum'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-1218014504320944688</id><published>2011-02-08T13:05:00.002+01:00</published><updated>2011-02-08T13:08:53.716+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><title type='text'>Slave Type Conversions</title><content type='html'>[Note: I'm testing to use &lt;a href="http://code.google.com/p/googlecl/" &gt;googlecl&lt;/a&gt; to post this article.]

Replication is typically used to replicate from a master to one or
more slaves using the same definition of tables on the master and
slave, but in some cases you want to replicate to tables with a
different definition on the slave, for example:

&lt;ul&gt;

  &lt;li&gt;Adding a timestamp column on the slave to see when the row was
  last updated.&lt;/li&gt;

  &lt;li&gt;Eliminating some columns on the slave because you don't need
  them and they take up space that you can use for better
  purposes.&lt;/li&gt;

  &lt;li&gt;Temporarily handling an on-line upgrade of a dual-master or
  circular replication setup.&lt;/li&gt;

&lt;/ul&gt;

Of these alternatives, the last one is critical to any deployment that
want to stay available. If this case can be handled, most other
changes can also be handled, so let's focus on that.&lt;p&gt;

&lt;table class="figure-right"&gt;
&lt;caption&gt;Figure&amp;nbsp;1. Table with an extra column on slave&lt;/caption&gt;
  &lt;TR&gt;&lt;TH&gt;Master&lt;/TH&gt;&lt;TH&gt;Slave&lt;/TH&gt;&lt;/TR&gt;
&lt;TR&gt;
  &lt;TD valign="top"&gt;&lt;pre class="code wide"&gt;
CREATE TABLE employee (
    id SMALLINT AUTO_INCREMENT,
    name VARCHAR(64),
    email VARCHAR(64),

    PRIMARY KEY (id))
&lt;/pre&gt;&lt;/TD&gt;
  &lt;TD valign="top"&gt;&lt;pre class="code wide"&gt;
CREATE TABLE employee (
    id SMALLINT AUTO_INCREMENT,
    name VARCHAR(64),
    email VARCHAR(64),
    &lt;strong&gt;ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP&lt;/strong&gt;
    PRIMARY KEY (id))
&lt;/pre&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/table&gt;

When using statement-based replication, the plain statements are
replicated&amp;mdash;this can at times can be an advantage, but not
always, as you will soon see. The most obvious case is when you have
more or fewer columns on the master than you have on the slave.

To illustrate the problem, let us start with the table definitions in
Figure&amp;nbsp;1. Here a timestamp column was added to the slave to see
when the row was last changed. When using statement-based replication,
we can properly replicate between these tables provided we always give
column names to the statement on the master, for example:&lt;p&gt;

&lt;pre class="code"&gt;
master&amp;gt; &lt;strong&gt;INSERT INTO employee(name, email) VALUES ('Mats', 'mats@example.com');&lt;/strong&gt;
master&amp;gt; &lt;strong&gt;DELETE FROM employee WHERE email = 'mats@example.com';&lt;/strong&gt;
master&amp;gt; &lt;strong&gt;UPDATE employee SET name = 'Matz' WHERE email = 'mats@example.com';&lt;/strong&gt;
&lt;/pre&gt;

In all these cases, the statements execute perfectly well with both
table definition since the "missing" column has a default value and
each statement gives exactly the names of the columns to update.

The &lt;code&gt;DELETE&lt;/code&gt; and &lt;code&gt;UPDATE&lt;/code&gt; statements naturally
refer only to the column on the master, but for &lt;code&gt;INSERT&lt;/code&gt; it
is necessary to add the column names even if the tuple matches the
definition on the master since it could be different on the slave.&lt;p&gt;

Having to give the column names all the time is fragile and if the
user&amp;mdash;or the application&amp;mdash;makes a mistake and types the
following statement, replication on the slave will stop with an
error:&lt;p&gt;

&lt;pre class="code"&gt;
master&amp;gt; &lt;strong&gt;INSERT INTO employee VALUES (DEFAULT, 'Mats', 'mats@example.com');&lt;/strong&gt;
&lt;/pre&gt;

In contrast to statement-based replication, row-based replication will
do the right thing and throw away extra columns sent by the master or
add default values to extra columns on the slave&amp;mdash;if the column
has a default value&amp;mdash;&lt;em&gt;provided that the columns are added or
removed last in the table.&lt;/em&gt;&lt;p&gt;

This works fine for the example above since the extra timestamp column
is last in the table. The effect is to keep track of when the row was
last updated on the slave, which could be used to see if the row is
current.

&lt;div class="note"&gt;Depending on what you want to accomplish, there
could be better techniques for this, described in &lt;a
href="http://mysqlhighavailability.com/"&gt;our book&lt;/a&gt;. The problem is
that the timestamp might not have enough precision in a high-load
situation.&lt;/div&gt;

So, row-based replication in MySQL 5.1 contain support for using more
or fewer columns on the slave as compared to the master, but there
were one case that was not supported: replicating between different
column types. This is very important for basic upgrade scenarios
where you, for example, change the size of some column during an
upgrade.&lt;p&gt;

&lt;table class="figure-right"&gt;
&lt;caption&gt;Figure&amp;nbsp;2. Different types on master and slave&lt;/caption&gt;
&lt;TR&gt;&lt;TH&gt;Master&lt;/TH&gt;&lt;TH&gt;Slave&lt;/TH&gt;&lt;/TR&gt;
&lt;TR&gt;
  &lt;TD valign="top"&gt;&lt;pre class="code wide"&gt;
CREATE TABLE employee (
    id SMALLINT AUTO_INCREMENT,
    name CHAR(64),
    email CHAR(64),
    PRIMARY KEY (id))
&lt;/pre&gt;&lt;/TD&gt;
  &lt;TD valign="top"&gt;&lt;pre class="code wide"&gt;
CREATE TABLE employee (
    id SMALLINT AUTO_INCREMENT,
    name VARCHAR(64),
    email VARCHAR(64),
    PRIMARY KEY (id))
&lt;/pre&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;/table&gt;

For example, consider the table definition in Figure&amp;nbsp;2.

In this case, the intention is to save space on the slave by storing
the strings in a &lt;code&gt;VARCHAR&lt;/code&gt; field instead of a
&lt;code&gt;CHAR&lt;/code&gt; field&amp;mdash;recall that &lt;code&gt;VARCHAR&lt;/code&gt; fields
are variable length strings while &lt;code&gt;CHAR&lt;/code&gt; fields occupy a
fixed space in the row. (We don't care too much about the reasons for
using &lt;code&gt;CHAR&lt;/code&gt; on the master, we just use this example to
illustrate the problem.)&lt;p&gt;

When using statement-based replication, this works well since the actual
statement is replicated.  However, when using row-based replication we
have the additional requirement (in 5.1) that the column types
&lt;em&gt;have to have identical base types&lt;/em&gt;. Unfortunately,
&lt;code&gt;CHAR&lt;/code&gt; and &lt;code&gt;VARCHAR&lt;/code&gt; does not have the same base
type, so replication will stop with an error when you try to execute
the &lt;code&gt;INSERT&lt;/code&gt;, which is not very helpful.&lt;p&gt;

Fortunately, the replication team have extended row-based replication
with a new feature in MySQL 5.5: that of converting between types when
replicating from a master and to a slave with a different table
definition.  With this feature, a stricter type checking is also
implemented and better error messages.&lt;p&gt;

The conversion checks the &lt;em&gt;declared types&lt;/em&gt; on the master and
slave and decides before executing the transaction if the conversion
is allowed. This means that it does not investigate the actual
&lt;em&gt;values&lt;/em&gt; replicated: only the types of the column on the master
and the slave.  In addition to better performance when not checking
each value this check is done so that you can be sure that
&lt;em&gt;any&lt;/em&gt; value replicated between the tables will work, not just
the values that you happened to have in your test suite.&lt;p&gt;

When dealing with conversions, we are only considering conversions
&lt;em&gt;within&lt;/em&gt; the groups below.

&lt;dl&gt;

  &lt;dt&gt;&lt;strong&gt;Integer types&lt;/strong&gt;
  &lt;dd&gt;&lt;code&gt;TINYINT&lt;/code&gt;, &lt;code&gt;SMALLINT&lt;/code&gt;,
  &lt;code&gt;MEDIUMINT&lt;/code&gt;, &lt;code&gt;INT&lt;/code&gt;, &lt;code&gt;BIGINT&lt;/code&gt;

  &lt;dt&gt;&lt;strong&gt;Decimal types&lt;/strong&gt;
  &lt;dd&gt;&lt;code&gt;DECIMAL&lt;/code&gt;, &lt;code&gt;FLOAT&lt;/code&gt;, &lt;code&gt;DOUBLE&lt;/code&gt;,
  &lt;code&gt;NUMERIC&lt;/code&gt;

  &lt;dt&gt;&lt;strong&gt;String types&lt;/strong&gt;
  &lt;dd&gt;&lt;code&gt;CHAR(&lt;em&gt;N&lt;/em&gt;)&lt;/code&gt;, &lt;code&gt;VARCHAR(&lt;em&gt;N&lt;/em&gt;)&lt;/code&gt;,
  &lt;code&gt;TEXT&lt;/code&gt; even for different values of &lt;em&gt;N&lt;/em&gt; on master
  and slave.

  &lt;dt&gt;&lt;strong&gt;Binary types&lt;/strong&gt;
  &lt;dd&gt;&lt;code&gt;BINARY(&lt;em&gt;N&lt;/em&gt;)&lt;/code&gt;,
  &lt;code&gt;VARBINARY(&lt;em&gt;N&lt;/em&gt;)&lt;/code&gt;, &lt;code&gt;BLOB&lt;/code&gt; even for
  different values for &lt;em&gt;N&lt;/em&gt; on master and slave.

  &lt;dt&gt;&lt;strong&gt;Bit types&lt;/strong&gt;
  &lt;dd&gt;Conversion between &lt;code&gt;BIT(&lt;em&gt;N&lt;/em&gt;)&lt;/code&gt; for different
  values of &lt;em&gt;N&lt;/em&gt; on master and slave.
&lt;/dl&gt;

Since the string and binary types only differ in the character set
they use&amp;mdash;&lt;em&gt;and replication is not aware of character sets
yet&lt;/em&gt;&amp;mdash;replication between string and binary types will be
possible simply because the character set is not known. Don't rely on
this though; as soon as &lt;a
href="http://bugs.mysql.com/bug.php?id=47673" &gt;Bug#47673&lt;/a&gt; is fixed,
string and binary types will be separated into distinct groups and
replication will stop if the character sets don't allow conversion.&lt;p&gt;

Within each group, we also have two types of conversions:
&lt;em&gt;non-lossy conversions&lt;/em&gt; and &lt;em&gt;lossy conversions&lt;/em&gt;. With a
non-lossy conversion you are guaranteed that no information is lost,
but with lossy conversions it is possible that you lose some
information. A typical example of a non-lossy conversion is converting
from a &lt;code&gt;CHAR(32)&lt;/code&gt; field to a &lt;code&gt;CHAR(64)&lt;/code&gt;
field&amp;mdash;since the target field is wider than the source field,
there is no risk that any part of the string is lost. Converting in
the other direction, however, is a lossy conversion since a string
with more than 32 characters cannot fit into a &lt;code&gt;CHAR(32)&lt;/code&gt;
field. A more odd example is conversion between &lt;code&gt;FLOAT&lt;/code&gt; and
&lt;code&gt;DECIMAL(N,M)&lt;/code&gt;, which are &lt;em&gt;always&lt;/em&gt; considered lossy,
regardless of the direction the conversion is done. Since it
cannot be guaranteed that all floating-point numbers can be converted
to decimal numbers without losing precision, and vice versa.&lt;p&gt;

Controlling what conversions are allowed is controlled with a new
server variable &lt;code&gt;SLAVE_TYPE_CONVERSIONS&lt;/code&gt;, which is of the
type &lt;code&gt;SET('ALL_LOSSY','ALL_NON_LOSSY')&lt;/code&gt;, that is, it is a
&lt;em&gt;set&lt;/em&gt; of allowed conversions.  The default for this variable is
the empty set, meaning that no conversions are allowed at all.&lt;p&gt;

If the &lt;code&gt;ALL_NON_LOSSY&lt;/code&gt; constant is in the set, all
conversions (within each group) that do not lose any information are
allowed. For example, replicating from &lt;code&gt;CHAR(32)&lt;/code&gt; to
&lt;code&gt;TINYTEXT&lt;/code&gt; is allowed since the conversion goes to a wider
field (even if it is a different type).&lt;p&gt;

If the &lt;code&gt;ALL_LOSSY&lt;/code&gt; constant is in the set, all conversions
(again, within the same group) that could potentially lose information
is allowed. For example, conversion to a narrower field on the slave,
such as &lt;code&gt;CHAR(32)&lt;/code&gt; to &lt;code&gt;CHAR(16)&lt;/code&gt; is
allowed. &lt;em&gt;Note that non-lossy conversions are not automatically
allowed when &lt;code&gt;ALL_LOSSY&lt;/code&gt; is set.&lt;/em&gt;&lt;p&gt;

&lt;div class="note"&gt;The prefix &lt;code&gt;ALL&lt;/code&gt; is used since we were considering the possibility of allowing conversions within certain groups only, for example, to add the feature of only allowing lossy conversions for strings and non-lossy conversions for integers, we could set &lt;code&gt;SLAVE_TYPE_CONVERSIONS&lt;/code&gt; to &lt;code&gt;'STRING_LOSSY,INTEGER_NON_LOSSY'&lt;/code&gt;. This is, however, pure speculations at this time.&lt;/div&gt;

If you are interested about the details of how slave type conversions work, you can find more information in the MySQL Reference Manual in &lt;a href="http://dev.mysql.com/doc/refman/5.5/en/replication-features-differing-tables.html" &gt;Replication with Differing Tables on Master and Slave&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-1218014504320944688?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/1218014504320944688/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=1218014504320944688' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/1218014504320944688'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/1218014504320944688'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2011/02/slave-type-conversions.html' title='Slave Type Conversions'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-3560881037600286899</id><published>2010-09-22T23:30:00.002+02:00</published><updated>2010-09-24T15:18:12.314+02:00</updated><title type='text'>Have you seen my replication files?</title><content type='html'>I recently started looking over how to get information about relay log file and binary log file using an SQL interface. Being able to do that can be quite handy when one is going to work with replication in various ways.  In my particular case, I wanted to get the path to the relay log index file and binary log index file to be able to read the binary log files as well as the relay log files directly.  You are probably familiar with the &lt;code&gt;--relay-log-index&lt;/code&gt; and &lt;code&gt;--relay-log&lt;/code&gt; options that can be set to specify where the index file. These options can either be used to set an absolute path or a relative path to use for the files. If the option starts with a &lt;code&gt;/&lt;/code&gt;, it is considered an absolute path (drive letters are allowed on Windows though), otherwise the path is relative to the data directory (which is specified through the &lt;code&gt;--datadir&lt;/code&gt; option). The values supplied to these options are provided from SQL as the system variables &lt;var&gt;relay-log-index&lt;/var&gt; and &lt;var&gt;relay-log&lt;/var&gt; respectively.

The recommendation is to always set the &lt;code&gt;--relay-log&lt;/code&gt; and &lt;code&gt;--relay-log-index&lt;/code&gt; since the default value for these options contain the hostname. The problem with this is that if the database files is moved to a new machine with a different hostname, the server will not be able to pick up the files correctly and will assume that they do not exist.

The logic for finding the location of the relay log files can be quite daunting; to find the location of the relay log index file:
&lt;ol&gt;
  &lt;li&gt;If &lt;var&gt;relay_log_index&lt;/var&gt; is set, this is the location of the relay log index file.&lt;/li&gt;
  &lt;li&gt;If &lt;var&gt;relay_log_index&lt;/var&gt; is not set, then the value supplied to the &lt;var&gt;relay_log&lt;/var&gt; option is used to figure out the name of the relay log index file.&lt;/li&gt;
  &lt;li&gt;If neither &lt;var&gt;relay_log_index&lt;/var&gt; nor &lt;var&gt;relay_log&lt;/var&gt; is set, then the name of the relay log index file is taken by stripping the directory and extension from the &lt;var&gt;pid_file&lt;/var&gt; variable (set using the &lt;code&gt;--pid-file&lt;/code&gt; option), if supplied, and adding &lt;code&gt;-relay-bin.index&lt;/code&gt; to the end of the string.
  &lt;ul&gt;
    &lt;li&gt;The &lt;var&gt;pid_file&lt;/var&gt; variable has a default value which consists of &lt;code&gt;&lt;var&gt;datadir&lt;/var&gt;/&lt;var&gt;hostname&lt;/var&gt;.pid&lt;/code&gt;, which would give the relay log index file a name of &lt;code&gt;&lt;var&gt;datadir&lt;/var&gt;/&lt;var&gt;hostname&lt;/var&gt;-relay-bin.index&lt;/code&gt;.&lt;/li&gt;
  &lt;/ul&gt;&lt;/li&gt;
  &lt;li&gt;If the path is a relative path&amp;mdash;that is, the path does not start with a directory separator&amp;mdash;then the value of &lt;var&gt;datadir&lt;/var&gt; is prepended to the relay log index file name.&lt;/li&gt;
&lt;/ol&gt;

Keeping track of all these details is not something I want to spend my time on, so I wrote a stored function for computing the name of the relay log index file which I simply called &lt;code&gt;relay_log_index_file&lt;/code&gt;:

&lt;pre class="code"&gt;
CREATE FUNCTION relay_log_index_file () RETURNS VARCHAR(256)
  DETERMINISTIC
  READS SQL DATA
BEGIN
  DECLARE rli_name VARCHAR(256);
  IF @@relay_log_index IS NOT NULL THEN
    SET rli_name = @@relay_log_index;
  ELSEIF @@relay_log IS NOT NULL THEN
    SET rli_name = @@relay_log;
  ELSE
    BEGIN
      DECLARE l_pid_file VARCHAR(256);
      DECLARE l_pid_base VARCHAR(256);
      SET l_pid_file = SUBSTRING_INDEX(@@pid_file, '/', -1);
      SET l_pid_base = SUBSTRING_INDEX(l_pid_file, '.', 1);
      SET rli_name = CONCAT(l_pid_base, '-relay-bin.index');
    END;
  END IF;

  IF rli_name NOT LIKE '/%' THEN
    RETURN CONCAT(@@datadir, rli_name);
  END IF;

  RETURN rli_name;
END
&lt;/pre&gt;


This is a quite complicated way of figuring out the location of the relay log files and hardly something that I consider very useful. It would be much better if the &lt;var&gt;relay_log_index&lt;/var&gt; variable gave the complete path to the file, regardless of what was given to the &lt;var&gt;--relay-log-index&lt;/var&gt; option (or even if the option was given at all).&lt;p&gt;

Being able to fetch the relay log index file is quite convenient, but being able to fetch the binary log index file would be even more convenient. Unfortunately, there is no such variable. The &lt;var&gt;--log-bin&lt;/var&gt; option can be used to supply a base name to use for the binary log, but the &lt;var&gt;log_bin&lt;/var&gt; variable can only be ON or OFF, which in my book is not very smart.  To fix this, I created &lt;a href="http://forge.mysql.com/worklog/task.php?id=5465"&gt;WL#5465&lt;/a&gt;, which introduces three new variables&amp;mdash;&lt;var&gt;log_bin_basename&lt;/var&gt;, &lt;var&gt;relay_log_basename&lt;/var&gt;, and &lt;var&gt;log_bin_index&lt;/var&gt;&amp;mdash;and changes behaviour of &lt;var&gt;relay_log_index&lt;/var&gt;.

&lt;dl&gt;
  &lt;dt&gt;&lt;var&gt;log_bin_basename&lt;/var&gt;
  &lt;dd&gt;This is a global read-only variable that contain the base file name used for the binary log files, the path to the files but omitting the extension.
    &lt;ul&gt;
      &lt;li&gt;If a full path was given to &lt;var&gt;--log-bin-index&lt;/var&gt;, this will be stored in &lt;var&gt;log_bin_index&lt;/var&gt;.&lt;/li&gt;
      &lt;li&gt;If a relative path was given to &lt;var&gt;--log-bin-index&lt;/var&gt;, the contents of &lt;var&gt;datadir&lt;/var&gt; will be used as directory and prepended to the value of &lt;var&gt;--log-bin-index&lt;/var&gt;&lt;/li&gt;
      &lt;li&gt;Otherwise, the value of &lt;var&gt;datadir&lt;/var&gt;  will be used as the directory of the file and the base name is created by taking the basename of &lt;var&gt;pid_file&lt;/var&gt; (name without extension) and adding '&lt;code&gt;-bin&lt;/code&gt;'.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;dt&gt;&lt;var&gt;log_bin_index&lt;/var&gt;
  &lt;dd&gt;This is a global read-only variable containing the full name to the binary log index file. If no value is given, the value of &lt;var&gt;log_bin_basename&lt;/var&gt; is used and the extension '&lt;code&gt;.index&lt;/code&gt;' is added.
  &lt;dt&gt;&lt;var&gt;relay_log_basename&lt;/var&gt;
  &lt;dd&gt;This is a global read-only variable containing the base file name used for the relay log file, that is, the full path to the relay logs but not including the extension. The value of this variable is created in the same way as for &lt;var&gt;log_bin_basename&lt;/var&gt; with the only difference that the '&lt;code&gt;-relay-bin&lt;/code&gt;' suffix is used instead of '&lt;code&gt;-bin&lt;/code&gt;'.
  &lt;dt&gt;&lt;var&gt;relay_log_index&lt;/var&gt;
  &lt;dd&gt;This is a global read-only variable containing the full name of the relay log index file. If no value is given, the value of &lt;var&gt;relay_log_basename&lt;/var&gt; is used and the extension '&lt;code&gt;.index&lt;/code&gt;' is added.
&lt;/dl&gt;

With these new variables, fetching the full path of the binary log index file is as easy as doing a '&lt;code&gt;SELECT @@log_bin_index&lt;/code&gt;'.

&lt;div class="note"&gt;If you're interested in if the patch for this worklog will be in any particular server version or f it is pushed at all, you have to check the status of the worklog.  Even if I have described the architecture and implemented a patch, there is no way to know where it ends up or even if it is pushed at all.&lt;/div&gt;

&lt;div class="digression"&gt;
&lt;h1&gt;An alternative: let the application do the job&lt;/h1&gt;
Creating a stored function for computing the relay log index file name might be overkill in many situation. If the value is needed from serveral different connections it makes sense to create it as a stored function to allow it to be used by different applications. It can, however, just as well be placed in the application code which would then compute the location of the relay log index file using a single query to the server.&lt;p&gt;

The information you need is the data directory from &lt;var&gt;datadir&lt;/var&gt;, the pid file name from &lt;var&gt;pid_file&lt;/var&gt; (in the event that the relay log or the relay log index option does not have a value), and the &lt;var&gt;relay_log&lt;/var&gt; and &lt;var&gt;relay_log_index&lt;/var&gt; values.&lt;p&gt;

For example, the following Python code could be used to compute the data directory, the base use for creating relay log files, and the name of the index file using a single query to the database server:
&lt;pre class="code"&gt;
import os.path

def get_relay_log_info(connection):
    cursor = connection.cursor()
    cursor.execute("SELECT @@datadir, @@pid_file, @@relay_log, @@relay_log_index")
    datadir, pid_file, relay_log, relay_log_index = cursor.fetchone()
               
    def _add_datadir(filename):
        if os.path.isabs(filename):
            return filename
        else:
            return os.path.join(datadir, filename)

    pidfile_base = os.path.basename(os.path.splitext(pid_file)[0])
    base_name = _add_datadir(relay_log or pidfile_base + '-relay-bin')
    index_file = _add_datadir(relay_log_index or base_name + '.index')

    return { 'datadir': datadir, 'base': base_name, 'index': index_file }
&lt;/pre&gt;
&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-3560881037600286899?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/3560881037600286899/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=3560881037600286899' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/3560881037600286899'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/3560881037600286899'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2010/09/have-you-seen-my-replication-files.html' title='Have you seen my replication files?'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-6762701833587863291</id><published>2010-08-18T20:49:00.005+02:00</published><updated>2010-08-18T21:01:36.291+02:00</updated><title type='text'>Binary Log Group Commit - Recovery</title><content type='html'>It was a while since I wrote the &lt;a href="http://mysqlmusings.blogspot.com/2010/04/binary-log-group-commit-implementation.html"&gt;previous article&lt;/a&gt;, but the merging of Oracle and Sun here resulted in quite a lot of time having to be spent on attending various events and courses for legal reason (one of the reasons I prefer working for smaller companies) and together with a summer vacation spent on looking over the house, there were little time for anything else.  This is the second post of three, and in the last one I will cover some optimizations that improves performance significantly.&lt;p&gt;

In the &lt;a href="http://mysqlmusings.blogspot.com/2010/04/binary-log-group-commit-implementation.html"&gt;previous article&lt;/a&gt;, an approach was outlined to handle the binary log group commit.  The basic idea is to use the binary log as a ticketing system by reserving space in it for the transactions that are going to be written.  This will provide an order on the transactions as well as allowing writing the transactions in parallel to the binary log, thereby boosting performance.

As noted in the previous post, a crash while writing transactions to the binary log requires recovery. To understand what needs to be changed, it is necessary to understand how the structure of the binary log as well as how recovery after a crash works currently together with the implementation of &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/xa.html"&gt;2-phase commit that MySQL uses&lt;/a&gt;.&lt;p&gt;

&lt;table class="figure-right"&gt;
  &lt;caption&gt;Figure&amp;nbsp;1. Binlog file structure&lt;/caption&gt;
  &lt;tr&gt;&lt;td&gt;
    &lt;object title="Binary log structure"
            data="http://www.kindahl.net/images/binlog-v4-structure.svg" type="image/svg+xml" width="400" height="196"&gt;
    &lt;object width="400" height="196" type="image/png" data="http://2.bp.blogspot.com/_X_imutbSFuE/TGwrkDsrIrI/AAAAAAAAAHQ/AgTJE5--3_4/s320/binlog-v4-structure.png" id="BLOGGER_PHOTO_ID_5506824342835241650"&gt;
    &lt;/object&gt;
    &lt;/object&gt;
  &lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;


&lt;h3&gt;A quick intro to the structure of the binary log&lt;/h3&gt;

Figure&amp;nbsp;1 gives the rough structure of the binary log with a set of &lt;em&gt;binlog files&lt;/em&gt; and an &lt;em&gt;binlog index file&lt;/em&gt;. The binlog index file just list the binlog files that makes up the binary log, while each binlog file have the real contents of the binary log that you can see when executing a &lt;code&gt;SHOW BINLOG EVENTS&lt;/code&gt;.&lt;p&gt;

Each binlog file consists of a sequence of &lt;em&gt;binlog events&lt;/em&gt;, where the most important events from our perspective is the &lt;em&gt;Format description event&lt;/em&gt;. In addition, each binlog file is also normally terminated by a &lt;em&gt;Rotate event&lt;/em&gt; that refers to the next binlog file in the sequence.&lt;p&gt;

The Format description event is used to describe the contents of the binlog file and therefore contain a a lot of information about the binlog file.  In this case we are interested in a special flag called &lt;code&gt;LOG_EVENT_BINLOG_IN_USE_F&lt;/code&gt;, which is used to tell if the binlog is actively being written by the server. When the server opens a new binlog file, this flag is set to indicate that the file is in use, and when the binary log is rotated and a new binlog file created, this flag is cleared when closing the old binlog file.&lt;p&gt;

In the event of a crash, the flag will therefore be set and the server can see that the file was not closed properly and start with performing recovery.

&lt;h3&gt;Recovery and the binary log&lt;/h3&gt;

When recovering, the server has to find all transactions that were partially executed and decide if they are going to be rolled back or committed properly. The deciding point when a transaction will be committed instead of rolled back is when the transaction has been written to the binary log. To do this, the server has to find all transactions that were written to the binary log and tell all storage engines to commit these transactions.&lt;p&gt;

The recovery procedure is executed when the binary log is opened&amp;mdash;which the server does calling &lt;code&gt;TC_LOG_BINLOG::open&lt;/code&gt; during startup. When the binary log is opened, recovery is done if the last open binlog file was not closed properly. An outline of the procedure executed is:

&lt;ol&gt;
  &lt;li&gt;Open the binlog index file and go through it to find the last binlog file mentioned there [&lt;code&gt;TC_LOG_BINLOG::open&lt;/code&gt;]&lt;/li&gt;
  &lt;li&gt;Open this binlog file and check if the &lt;code&gt;LOG_EVENT_BINLOG_IN_USE_F&lt;/code&gt; flag is set&lt;/li&gt;
  &lt;li&gt;If the flag was clear, then the server stopped properly and no recovery is necessary. Otherwise, the server did not stop properly and recovery starts by calling.&lt;/li&gt;
  &lt;li&gt;The last binlog file is now open, so the entire binlog file is scanned and the XID of each each Xid event is recorded. These XIDs denote the transactions that were properly written to the binary log&amp;mdash;that is, the transactions that shall be committed [&lt;code&gt;TC_LOG_BINLOG::recover&lt;/code&gt;].&lt;/li&gt;
  &lt;li&gt;Each storage engine is handed the list of XIDs of transactions to commit through the &lt;code&gt;handlerton::recover&lt;/code&gt; interface function [&lt;code&gt;ha_recover&lt;/code&gt;].&lt;/li&gt;
  &lt;li&gt;The storage engine will then commit each transaction in the list and roll back all the others.&lt;/li&gt;
&lt;/ol&gt;

&lt;table class="figure-right"&gt;
  &lt;caption&gt;Figure&amp;nbsp;2. Parallel binary log group commit&lt;/caption&gt;
  &lt;tr&gt;&lt;td&gt;
      &lt;object data="http://www.kindahl.net/images/binlog-crash-state.svg" width="300" height="300" type="image/svg+xml"&gt;
      &lt;object data="http://1.bp.blogspot.com/_X_imutbSFuE/TGwsUxcHC4I/AAAAAAAAAHY/TWfllb04458/s320/binlog-crash-state.png" width="300" height="300" type="image/png" id="BLOGGER_PHOTO_ID_5506825179747519362"&gt;
      &lt;/object&gt;
      &lt;/object&gt;
    &lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;

&lt;h3&gt;So, what's the problem?&lt;/h3&gt;

The procedure above works fine, so what are the problems we have to solve to implement the procedure described in the &lt;a href="http://mysqlmusings.blogspot.com/2010/04/binary-log-group-commit-implementation.html"&gt;previous article&lt;/a&gt;? If you look in Figure&amp;nbsp;2, you have a hint to what is the problem.&lt;p&gt;

Now, assume that thread 1, 2, and 3 in Figure&amp;nbsp;2 is writing transactions to disk (starting at positions &lt;span class="math"&gt;Trans_Pos&lt;sub&gt;1&lt;/sub&gt;&lt;/span&gt;, &lt;span class="math"&gt;Trans_Pos&lt;sub&gt;2&lt;/sub&gt;&lt;/span&gt;, and &lt;span class="math"&gt;Trans_Pos&lt;sub&gt;3&lt;/sub&gt;&lt;/span&gt; respectively) and that a preceding thread (a thread that got a binlog position before &lt;code&gt;Last_Complete&lt;/code&gt;) decides that it is time to call &lt;code&gt;fsync&lt;/code&gt; to group commit the state this far. The binlog file will then be written in this state&amp;mdash;where some transactions are partially written&amp;mdash;and &lt;code&gt;Last_Committed&lt;/code&gt; will be set to the value of &lt;code&gt;Last_Complete&lt;/code&gt;, leading to the situation depicted in Figure&amp;nbsp;2.&lt;p&gt;

As you can see in the figure, thread 2 has already finished writing data to the binary log and is therefore written to durable storage. Since thread 1&amp;mdash;which precedes thread 2 in the binary log&amp;mdash;has not completed yet, thread 2 has not yet committed and is still waiting for all the preceding transactions to complete. If a crash occurs in this situation, it is necessary to somehow find the XID of all transactions that have committed&amp;mdash;excluding the transaction that thread 2 has completed&amp;mdash;and commit them to the storage engine when recovering.&lt;p&gt;

&lt;h3&gt;A proposal for a new recovery algorithm&lt;/h3&gt;

In the original algorithm, the scan of the binlog file stopped when the file ended, but since there can be partially written events in the binlog file after the "real" end of the file (the binlog file ends logically at &lt;code&gt;Last_Committed&lt;/code&gt;/&lt;code&gt;Last_Complete&lt;/code&gt;), so we have to find some other way to detect the logical end of the file.&lt;p&gt;

To handle this, it is necessary to somehow mark events that are not yet committed so that the recovery algorithm can find the correct position where the binlog file ends. The same problem occurs if one wants to persist the end of the binlog file &lt;a href="http://forge.mysql.com/worklog/task.php?id=4925"&gt;preallocating the binlog file&lt;/a&gt;. There are basically three ways to handle this:
&lt;ul&gt;
  &lt;li&gt;Write the end of the binlog file in the binlog file header (that is, the Format description log event). &lt;/li&gt;
  &lt;li&gt;Mark each event by zeroing out a field that cannot be zero&amp;mdash;for example, the length, the event type, or event position&amp;mdash;before writing the event to the binary log. Then write this field with the correct value after the entire event has been written.&lt;/li&gt;
  &lt;li&gt;&lt;a href="http://forge.mysql.com/worklog/task.php?id=2540"&gt;Checksum the events&lt;/a&gt; and find the end of the worklog by scanning for the first event with an incorrect checksum.&lt;/li&gt;
&lt;/ul&gt;

&lt;dl&gt;
&lt;dt&gt;&lt;strong&gt;Write the length in the binlog file header&lt;/strong&gt;&lt;/dt&gt;

&lt;dd&gt;Finding the length of the binlog in this case is easy: just inspect the header and find the length of the binlog file there.  In this case, it is necessary to update the length after the event has been written since there may be an &lt;code&gt;fsync&lt;/code&gt; call at any time between starting to write the event data and finishing writing the event. Normally, this means updating two block of the file for each event written, which can be a problem since it requires at least the block containing the header and all the blocks that was written since the last group commit to be written when calling &lt;code&gt;fsync&lt;/code&gt;. If a large number of events is written between each &lt;code&gt;fsync&lt;/code&gt;, this might not impose a large penalty, but if &lt;code&gt;sync-binlog=1&lt;/code&gt; it might become quite expensive.  Some experiments done by &lt;a href="http://yoshinorimatsunobu.blogspot.com/"&gt;Yoshinori&lt;/a&gt; showed a drop from 15k events/sec to 10k events/sec, which means that we lose one third in performance.&lt;p&gt;

&lt;strong&gt;Digression.&lt;/strong&gt; The measurements that Yoshinori did consisted of one &lt;code&gt;pwrite&lt;/code&gt; to write the event, one &lt;code&gt;pwrite&lt;/code&gt; to write the length to the header and then a call to &lt;code&gt;fsync&lt;/code&gt;. It is, in other word, most similar to using &lt;code&gt;sync_binlog=1&lt;/code&gt;. In reality, however, this will not be the case since a user that is using the binary log group commit will have several events written between each call to &lt;code&gt;fsync&lt;/code&gt;. Since these writes will be to memory (the file pages are in memory), performance will not drop as much. To evaluate the behavior for a group commit situation better, writing 10 events at a time was compared as well (pretending to be &lt;code&gt;sync_binlog=10&lt;/code&gt;). Straight append (using &lt;code&gt;write&lt;/code&gt;) gave at that point 110k events/sec and write to the header before calling &lt;code&gt;fsync&lt;/code&gt; gave 80k events/sec. This means a performance reduction of 27%, which is an improvement but still a very large overhead.
&lt;/dd&gt;

&lt;dt&gt;&lt;strong&gt;Use a marker field&lt;/strong&gt;&lt;/dt&gt;

&lt;dd&gt;The second alternative is to use one of the fields as a marker field. By setting one of the fields that cannot be zero to zero, it is possible to detect that the event is incorrect and stop at the event before that. Good candidates as fields is the length&amp;mdash;which cannot be zero for any event and is four bytes&amp;mdash;and the event type, which is one byte and where zero denotes an unknown event and never occurs naturally in a binlog file. The technique would be to first blank out the type field of the event, write the event to the binlog file, and then use &lt;code&gt;pwrite&lt;/code&gt; to fill in the correct type code after the entire event is written. If an &lt;code&gt;fsync&lt;/code&gt; occurs before the event type is written, the event will be marked as unknown and if a crash occurs before the event is completely written (and written to disk), it will be possible to scan the binlog file to find the first event that is marked as unknown. In order for this technique to work, it is necessary to zero the unused part of the binlog file before starting to write anything there (or at least zero out the event type). Otherwise, crash recovery will not be able to correctly detect where the last completely written event is located.&lt;p&gt;

Compared to the previous approach, this does not require writing to locations far apart (except in rare circumstances when the event spans two pages). It also has the advantage of not requiring any change of the binlog format. This technique is likely to be quite efficient.  (Note that most of the writes will be to memory, so there will not be any extraneous "seeks" over the disk to zero out parts of the file.)&lt;/dd&gt;

&lt;dt&gt;&lt;strong&gt;Checksum on each event&lt;/strong&gt;&lt;/dt&gt;

&lt;dd&gt;The third alternative is to rely on an &lt;a href="http://forge.mysql.com/worklog/task.php?id=2540"&gt;event checksum&lt;/a&gt; to detect events that are incompletely written. This approach is by far the most efficient of the approaches since the event checksum is naturally written last. It also has the advantage of not requiring the unused parts of the binlog file to be zeroed since it is unlikely that the checksum will be correct for the event unless the event has been fully written.  This also makes it a very good candidate for detecting the end of the binlog file when preallocating the binlog file. The disadvantage is, of course, that it requires checksums to be enabled and implemented.&lt;/dd&gt;
&lt;/dl&gt;

With this in mind, the best approach seems to be to checksum each event and use that to detect the end of the binary log. If necessary, the second approach can be implemented when the binlog is not checksummed.&lt;p&gt;

The next article will wrap up the description  by pointing out some efficiency issues and how to solve them to get an efficient implementation.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-6762701833587863291?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/6762701833587863291/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=6762701833587863291' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/6762701833587863291'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/6762701833587863291'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2010/08/binary-log-group-commit-recovery.html' title='Binary Log Group Commit - Recovery'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-7498942908461072533</id><published>2010-04-30T06:23:00.008+02:00</published><updated>2010-08-18T21:02:33.749+02:00</updated><title type='text'>Binary Log Group Commit - An Implementation Proposal</title><content type='html'>It is with interest that I read &lt;a href="http://kristiannielsen.livejournal.com/12254.html"&gt;Kristian's&lt;/a&gt;
&lt;a href="http://kristiannielsen.livejournal.com/12408.html"&gt;three&lt;/a&gt;
&lt;a href="http://kristiannielsen.livejournal.com/12553.html"&gt;blogs&lt;/a&gt;
on the binary log group commit. In the article, he mentions InnoDB's
&lt;code&gt;prepare_commit_mutex&lt;/code&gt; as the main hindrance to accomplish group commits&amp;mdash;which it indeed is&amp;mdash;and proposes to remove it with the motivation that &lt;code&gt;FLUSH TABLES WITH READ LOCK&lt;/code&gt; can be used to get a good binlog position instead. That is a solution&amp;mdash;but not really a good solution&amp;mdash;as Kristian points out in the last post.&lt;p&gt;

The &lt;code&gt;prepare_commit_mutex&lt;/code&gt; is used to ensure that the
order of transactions in the binary log is the same as the order of
transactions in the InnoDB log&amp;mdash;and keeping the same order in the logs is critical for getting a true on-line backup to work, so
removing it is not really an option, which Kristian points out in his third article. In other words, it is necessary to ensure that the InnoDB transaction log and the binary log have the same order of
transactions.&lt;p&gt;

To understand how to solve the problem, it is necessary to take a
closer look at the XA commit procedure and see how we can change it to implement a group commit of the binary log.&lt;p&gt;

The transaction data is stored in a per-thread &lt;em&gt;transaction
cache&lt;/em&gt; and the &lt;em&gt;transaction size&lt;/em&gt; is the size of the data
in the transaction cache.

In addition, each transaction will have a &lt;em&gt;transaction binlog
position&lt;/em&gt; (or just &lt;em&gt;transaction position&lt;/em&gt;) where the
transaction data is written in the binary log.&lt;p&gt;

The procedure can be outlined in the following steps:&lt;p&gt;

&lt;ol&gt;
  &lt;li&gt;Prepare InnoDB [&lt;code&gt;ha_prepare&lt;/code&gt;]:&lt;/li&gt;

  &lt;ol&gt;
    &lt;li&gt;Write prepare record to log buffer&lt;/li&gt;

    &lt;li&gt;&lt;code&gt;fsync()&lt;/code&gt; log file to disk (this can currently do
    group commit)&lt;/li&gt;

    &lt;li&gt;Take &lt;code&gt;prepare_commit_mutex&lt;/code&gt;&lt;/li&gt;
  &lt;/ol&gt;

  &lt;li&gt;Log transaction to binary log [&lt;code&gt;TC_LOG_BINLOG::log_xid&lt;/code&gt;]:&lt;/li&gt;
  &lt;ol&gt;
    &lt;li&gt;Lock binary log&lt;/li&gt;
    &lt;li&gt;Write transaction data to binary log&lt;/li&gt;
    &lt;li&gt;Sync binary log based on &lt;code&gt;sync_binlog&lt;/code&gt;. This forces
    the binlog to always &lt;code&gt;fsync()&lt;/code&gt; (no group commit) due to
    &lt;code&gt;prepare_commit_mutex&lt;/code&gt;&lt;/li&gt;
    &lt;li&gt;Unlock binary log&lt;/li&gt;
  &lt;/ol&gt;

  &lt;li&gt;Commit InnoDB:&lt;/li&gt;
  &lt;ol&gt;
    &lt;li&gt;Release &lt;code&gt;prepare_commit_mutex&lt;/code&gt;&lt;/li&gt;
    &lt;li&gt;Write commit record to log buffer&lt;/li&gt;
    &lt;li&gt;Sync log buffer to disk (this can currently do group commit)&lt;/li&gt;
    &lt;li&gt;InnoDB locks are released&lt;/li&gt;
  &lt;/ol&gt;
&lt;/ol&gt;

There are mainly two problems with this approach:

&lt;ul&gt;

  &lt;li&gt;The InnoDB row level and table level locks are released very
  late in the sequence, which affects concurrency.  Ideally, we need
  to release the locks very early, preferably as soon as we have
  prepared InnoDB.&lt;/li&gt;

  &lt;li&gt;It is not possible to perform a group commit in step 2&lt;/li&gt;

&lt;/ul&gt;

As you can see here, the prepare of the storage engines (in this
case just InnoDB) is done before the binary log mutex is taken, and
that means that if the &lt;code&gt;prepare_commit_mutex&lt;/code&gt; is removed it
is possible for another thread to overtake a transaction so that the
prepare and writing to the binary log is done in different order.&lt;p&gt;

To solve this, Mark suggests using a queue or a ticket system to
ensure that transactions are committed in the same order, but we
actually already have such a system that we can use to assign tickets:
namely the binary log.&lt;p&gt;

The idea is to allocate space in the binary log for the transaction to
be written. This gives us a sequence number that we can use to order
the transactions.&lt;p&gt;

In the &lt;a
href="http://forge.mysql.com/worklog/task.php?id=5223"&gt;worklog on
binary log group commits&lt;/a&gt; you will find the complete description as
well as the status of the evolving work.&lt;p&gt;

In this post, I will outline an approach that &lt;a
href="http://harrison-fisk.blogspot.com"&gt;Harrison&lt;/a&gt; and I have
discussed, which we think will solve the problems mentioned above. In
this post, I will outline the procedure during normal operations, in
the &lt;a href="http://mysqlmusings.blogspot.com/2010/08/binary-log-group-commit-recovery.html"&gt;next post&lt;/a&gt; I will discuss recovery, and in the third post (but
likely not the last on the subject), I will discuss some optimizations
that can be done.&lt;p&gt;

I want to emphasize that the fact that we have a worklog does not
involve any guarantees or promises of what, when, or even if any
patches will be pushed to any release of MySQL.&lt;p&gt;

In &lt;a
href="http://forge.mysql.com/worklog/task.php?id=4007"&gt;Worklog #4007&lt;/a&gt; an
approach for writing the binary log is suggested where space is
allocated for the transaction in the binary log before actually
starting to write it. In addition to avoiding unnecessary locking of
the binary log, it also allow us to use the binary log to order the
transactions in-place. We will use this idea of reserving space in the
binary log to implement the binary log group commit.&lt;p&gt;

By re-structuring the procedure above slightly, we can ensure that
the transactions are written in the same order in both the InnoDB
transaction log and the binary log.&lt;p&gt;

There are two ways to re-structure the code: one simple and one more
complicated that potentially can render better performance. To
simplify the presentation, it is assumed that pre-allocation is
handled elsewhere, for example using &lt;a
href="http://forge.mysql.com/worklog/task.php?id=4925" &gt;Worklog
#4925&lt;/a&gt;. In a real implementation, pre-allocation can either be
handled when a new binlog file is created, or when transaction data is
being written to the binary log.

&lt;h2&gt;The sequential write approach&lt;/h2&gt;

&lt;table class="figure-right"&gt;
  &lt;caption&gt;Figure 1. Sequential binary log group commit&lt;/caption&gt;
  &lt;tr&gt;
    &lt;td&gt;
&lt;img style="cursor:pointer; cursor:hand;width: 235px; height: 320px;" src="http://4.bp.blogspot.com/_X_imutbSFuE/S9pqgibJrUI/AAAAAAAAAGM/vPJNJc0PFuM/s320/binlog-write-simple.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5465798204996562242" /&gt;
    &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;

In the sequential write approach, the transactions are still written
to the binary log in order and the code is just re-ordered to avoid
keeping mutexes when calling &lt;code&gt;fsync()&lt;/code&gt;.

To describe the algorithm, three shared variables are introduced to
keep track of the status of replication:

&lt;dl&gt;

  &lt;dt&gt;&lt;code&gt;Next_Available&lt;/code&gt; 

  &lt;dd&gt;This variable keeps track of where a new transaction can be written

  &lt;dt&gt;&lt;code&gt;Last_Committed&lt;/code&gt;

  &lt;dd&gt;This variable keeps track of the last committed transaction,
  meaning that all transactions preceding this position is actually
  on disc.  This variable is not necessary in the real implementation,
  but it is kept here to simplify the presentation of the algorithm.

  &lt;dt&gt;&lt;code&gt;Last_Complete&lt;/code&gt;

  &lt;dd&gt;This variable keeps track of the last complete transaction. All
  transactions preceding this point is actually written to the binary
  log, but are not necessarily flushed to disc yet.

&lt;/dl&gt;

You can see an illustration of how the variables are used with the
binary log in Figure&amp;nbsp;1 where you can also see three threads each
waiting to write a transaction.  Both variables are initially is set
to the beginning of the binary log and it is always true that &lt;span class="math"&gt;
&lt;code&gt;Last_Committed&lt;/code&gt; &amp;le;
&lt;code&gt;Last_Complete&lt;/code&gt; &amp;le; &lt;code&gt;Next_Available&lt;/code&gt;
&lt;/span&gt;.

The procedure can be described in the following steps:

&lt;ol&gt;
  &lt;li&gt;Lock the binary log&lt;/li&gt;

  &lt;li&gt;Save value of &lt;code&gt;Next_Available&lt;/code&gt; in a variable
  &lt;span class="math"&gt;Trans_Pos&lt;/span&gt; and increase
  &lt;code&gt;Next_Available&lt;/code&gt; with the size of the transaction.&lt;/li&gt;

  &lt;li&gt;Prepare InnoDB:&lt;/li&gt;

  &lt;ol&gt;

    &lt;li&gt;Write prepare record to log buffer (but do not
    &lt;code&gt;fsync()&lt;/code&gt; buffer here)&lt;/li&gt; 

    &lt;li&gt;Release row locks&lt;/li&gt;

  &lt;/ol&gt;
  
  &lt;li&gt;Unlock binary log&lt;/li&gt;
  
  &lt;li&gt;Post prepare InnoDB:&lt;/li&gt;

  &lt;ol&gt;
    &lt;li&gt;&lt;code&gt;fsync()&lt;/code&gt; log file to disk, which can now be done
    using group commit since no mutex is held.&lt;/li&gt;
  &lt;/ol&gt;

  &lt;li&gt;Log transaction to binary log:&lt;/li&gt;

  &lt;ol&gt;

    &lt;li&gt;Wait until &lt;code&gt;Last_Complete&lt;/code&gt; =
    &lt;code&gt;Trans_Pos&lt;/code&gt;. (This can be implemented using a
    condition variable and a mutex.)&lt;/li&gt;
  
    &lt;li&gt;Write transaction data to binary log using
    &lt;code&gt;pwrite&lt;/code&gt;.  At this point, it is not really necessary to
    use &lt;code&gt;pwrite&lt;/code&gt; since the transaction data is simply
    appended, but it will be used in the second algorithm, so we
    introduce it here.&lt;/li&gt;

    &lt;li&gt;Update &lt;code&gt;Last_Complete&lt;/code&gt; to
    &lt;code&gt;Trans_Pos&lt;/code&gt; + transaction size.&lt;br/&gt;&lt;/li&gt;

    &lt;li&gt;Broadcast the the new position to all waiting threads to wake
    them up.&lt;/li&gt;

    &lt;li&gt;Call &lt;code&gt;fsync()&lt;/code&gt; to persist binary log on disk. This
    can now be group committed.&lt;/ol&gt;

  &lt;li&gt;Commit InnoDB:&lt;/li&gt;

  &lt;ol&gt;

    &lt;li&gt;Write commit record to log buffer&lt;/li&gt;

    &lt;li&gt;Sync log buffer to disk, which currently can be group
    committed.&lt;/li&gt;

  &lt;/ol&gt;
&lt;/ol&gt;

To implement group commit, it is sufficient to have a condition
variable and wait for that for a specified interval. Once the interval
has passed, the transaction data can call &lt;code&gt;fsync()&lt;/code&gt;, after
which it broadcasts the fact that data has been flushed to disc to
other waiting threads so that they can skip this. Typically, the code
looks something along these lines (we ignore checking error codes here
to simplify the description):

&lt;pre class="code"&gt;
pthread_mutex_lock(&amp;amp;binlog_lock);
while (Last_Complete &amp;ge; Last_Committed) {
  struct timespec timeout;
  gettimeofday(&amp;amp;timeout, NULL);
  timeout.tv_usec += 1000;    /* 1 msec */
  int error= pthread_cond_timedwait(&amp;amp;binlog_flush, &amp;amp;binlog_lock, &amp;amp;timeout);
  if (error == ETIMEDOUT) {
    fsync(&amp;amp;binlog_file);
    Last_Committed = Last_Complete;
    pthread_cond_broadcast(&amp;amp;binlog_flush);
  }
}
pthread_mutex_unlock(&amp;amp;binlog_lock);
&lt;/pre&gt;

There are a few observations regarding this approach:

&lt;ul&gt;

  &lt;li&gt;Step 6a requires a condition variable and a mutex when waiting
  for &lt;code&gt;Last_Complete&lt;/code&gt; to reach
  &lt;code&gt;Trans_Pos&lt;/code&gt;. Since there is just a single condition
  variable, it is necessary to broadcast a wakeup to &lt;em&gt;all&lt;/em&gt;
  waiting threads, which each will evaluate the condition just to find
  a single thread that should continue, while the other threads go to
  sleep again.&lt;p&gt;

  This means that the condition will be checked &lt;span
  class="math"&gt;O(N&lt;sup&gt;2&lt;/sup&gt;)&lt;/span&gt; times to commit &lt;span
  class="math"&gt;N&lt;/span&gt; transactions.

  This is a waste of resources, especially if there is a lot of
  threads waiting, and if we can avoid this, we can gain
  performance.&lt;/li&gt;

  &lt;li&gt;Since the thread has a good position in the binary log where it
  could write, it could just as well start writing instead of
  waiting. It will not interfere with any other threads, regardless if
  locks are kept or not.&lt;/li&gt;

&lt;/ul&gt;

These observations lead us to the second approach, that of writing
transaction data to the binary log in parallel.

&lt;h2&gt;A parallel write approach&lt;/h2&gt;

&lt;table class="figure-right"&gt;
  &lt;caption&gt;Figure&amp;nbsp;2. Parallel binary log group commit&lt;/caption&gt;
  &lt;tr&gt;
    &lt;td&gt;
&lt;img style="cursor:pointer; cursor:hand;width: 235px; height: 320px;" src="http://3.bp.blogspot.com/_X_imutbSFuE/S9pq59HvM9I/AAAAAAAAAGU/fzoG6_TKEKo/s320/binlog-write-parallel.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5465798641659622354" /&gt;
    &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;

In this approach, each session is allowed to write to the binary log
at the same time using &lt;code&gt;pwrite&lt;/code&gt; since the space for the
transaction data has already been allocated when preparing the
engines.  Figure&amp;nbsp;2 illustrates how the binary log is filled in (grey
areas) by multiple threads at the same time.  Similar to the
sequential write approach, we still have the
&lt;code&gt;Last_Complete&lt;/code&gt;, &lt;code&gt;Last_Committed&lt;/code&gt;, and
&lt;code&gt;Next_Available&lt;/code&gt; variables.&lt;p&gt;

Each thread does not have to wait for other threads before writing,
but it &lt;em&gt;does&lt;/em&gt; have to wait for the other threads to
&lt;em&gt;commit&lt;/em&gt;.

This is necessary since we required the order of commits in the InnoDB
log and the binary log to be the same. In reality, this does not pose
a problem since the I/O is buffered, hence the writes are done to
in-memory file buffers.&lt;p&gt;

The algorithms look quite similar to the sequential write approach,
but notice that in step 6, the transaction data is simply written to
the binary log using &lt;code&gt;pwrite&lt;/code&gt;.

&lt;ol&gt;
  &lt;li&gt;Lock the binary log&lt;/li&gt;

  &lt;li&gt;Save value of &lt;code&gt;Next_Available&lt;/code&gt; in a local variable
  &lt;code&gt;Trans_Pos&lt;/code&gt; and increase
  &lt;code&gt;Next_Available&lt;/code&gt; with the size of the transaction.&lt;/li&gt;

  &lt;li&gt;Prepare InnoDB:&lt;/li&gt;

  &lt;ol&gt;

    &lt;li&gt;Write prepare record to log buffer (but do not
    &lt;code&gt;fsync()&lt;/code&gt; buffer here)&lt;/li&gt; 

    &lt;li&gt;Release row locks&lt;/li&gt;

  &lt;/ol&gt;
  
  &lt;li&gt;Unlock binary log&lt;/li&gt;
  
  &lt;li&gt;Post prepare InnoDB:&lt;/li&gt;

  &lt;ol&gt;
    &lt;li&gt;&lt;code&gt;fsync()&lt;/code&gt; log file to disk, which can now be done
    using group commit since no mutex is held.&lt;/li&gt;
  &lt;/ol&gt;

  &lt;li&gt;Log transaction to binary log:&lt;/li&gt;

  &lt;ol&gt;

    &lt;li&gt;Write transaction data to binary log using
    &lt;code&gt;pwrite&lt;/code&gt;. There is no need to keep a lock to protect
    the binary log here since all threads will write to different
    positions.&lt;/li&gt;

    &lt;li&gt;Wait until &lt;code&gt;Last_Complete&lt;/code&gt; =
    &lt;code&gt;Trans_Pos&lt;/code&gt;.&lt;/li&gt;
  
    &lt;li&gt;Update &lt;code&gt;Last_Complete&lt;/code&gt; to &lt;code&gt;Trans_Pos&lt;/code&gt; +
    transaction size.&lt;br/&gt;&lt;/li&gt;

    &lt;li&gt;Broadcast the the new position to all waiting threads to wake
    them up.&lt;/li&gt;

    &lt;li&gt;Call &lt;code&gt;fsync()&lt;/code&gt; to persist binary log on disk. This
    can now be group committed.

  &lt;/ol&gt;

  &lt;li&gt;Commit InnoDB:&lt;/li&gt;

  &lt;ol&gt;
    &lt;li&gt;Write commit record to log&lt;/li&gt;
    &lt;li&gt;Sync log file to disk&lt;/li&gt;
  &lt;/ol&gt;
&lt;/ol&gt;

This new algorithm has some advantages, but there are a few things to note:

&lt;ul&gt;

  &lt;li&gt;When a transaction is committed, it is guaranteed that
  &lt;code&gt;Trans_Pos&lt;/code&gt; &amp;ge; &lt;code&gt;Last_Committed&lt;/code&gt; for all
  threads (recall that &lt;code&gt;Trans_Pos&lt;/code&gt; is a thread-local
  variable).&lt;/li&gt;

  &lt;li&gt;Writes are done in parallel, but when waiting for the condition
  in step 6b still requires a broadcast to wake up all waiting
  threads, while only one will be allowed to proceed. This means that
  we still have the &lt;span class="math"&gt;O(N&lt;sup&gt;2&lt;/sup&gt;)&lt;/span&gt;
  complexity of the sequential algorithm.  However, for the parallel
  algorithm it is possible to improve the performance significantly,
  which we will demonstrate in the third part where we will discuss
  optimizations to the algorithms.&lt;/li&gt;

  &lt;li&gt;Recovery in the sequential algorithm is comparably simple since
  there are no partially written transactions. If you consider that a
  crash can occur in the situation described in Figure&amp;nbsp;2, it is
  necessary to device a method for correctly recovering. This we will
  discuss in the second part of these posts.&lt;/li&gt;

&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-7498942908461072533?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/7498942908461072533/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=7498942908461072533' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/7498942908461072533'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/7498942908461072533'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2010/04/binary-log-group-commit-implementation.html' title='Binary Log Group Commit - An Implementation Proposal'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_X_imutbSFuE/S9pqgibJrUI/AAAAAAAAAGM/vPJNJc0PFuM/s72-c/binlog-write-simple.png' height='72' width='72'/><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-1743671916737341505</id><published>2010-04-13T20:07:00.003+02:00</published><updated>2010-04-13T20:45:25.431+02:00</updated><title type='text'>MySQL Conference Replication tutorial: Article and Demo Software</title><content type='html'>The MySQL Conference and Expo started with me and Lars Thalmann doing the &lt;a href="http://en.oreilly.com/mysql2010/user/proposal/status/12432"&gt;replication tutorial&lt;/a&gt;. Unfortunately, we cannot at this time distribute the slides (please watch the &lt;a href="http://en.oreilly.com/mysql2010/public/schedule/detail/12432"&gt;replication tutorial page at the conference site&lt;/a&gt;), but there is a replication tutorial package for easy setup of server to play around with&amp;mdash;including some sample scripts&amp;mdash;and a paper that both explains how the package can be used as well as giving some example setups.

&lt;ul&gt;
&lt;li&gt;
  The software package can be &lt;a href="http://forge.mysql.com/w/images/8/85/Reptut-scripts.zip"&gt;downloaded from the forge&lt;/a&gt; and requires Perl at least version 5.6.0 to execute.
&lt;/li&gt;
&lt;li&gt;The article can can also be &lt;a href="http://www.kindahl.net/mats/ReplicationTutorialBooklet.pdf"&gt;downloaded from my site&lt;/a&gt; (PDF).&lt;/li&gt;
&lt;li&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-1743671916737341505?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/1743671916737341505/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=1743671916737341505' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/1743671916737341505'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/1743671916737341505'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2010/04/mysql-conference-replication-tutorial.html' title='MySQL Conference Replication tutorial: Article and Demo Software'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-9207672279467469732</id><published>2010-03-05T14:10:00.002+01:00</published><updated>2010-03-05T14:39:45.378+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='conference'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='tutorial'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='expo'/><category scheme='http://www.blogger.com/atom/ns#' term='binary log'/><title type='text'>Going to the O'Reilly MySQL Conference &amp; Expo</title><content type='html'>&lt;p&gt;As I've been doing the last couple of years, I will be going to the
&lt;a href=""&gt;O'Reilly MySQL Conference &amp;amp; Expo&lt;/a&gt;. In addition to
the tutorial and the replication sessions that I will be holding
together with &lt;a
href="http://en.oreilly.com/mysql2010/public/schedule/speaker/3109"&gt;Lars&lt;/a&gt;,
I will be holding a session about the binary log together with &lt;a
href="http://en.oreilly.com/mysql2010/public/schedule/speaker/162"&gt;Chuck&lt;/a&gt;
from the Backup team which the Replication team normally works very
close with.&lt;/p&gt;

&lt;p&gt;This year, O'Reilly also have a &lt;em&gt;Friend of the Speaker&lt;/em&gt;
discount of 25% that you can use when you register using the code
&lt;code&gt;mys10fsp&lt;/code&gt;.&lt;/p&gt;

The sessions that we are going to hold are listed below. Note that I
am using &lt;a href="www.microformats.org"&gt;Microformats&lt;/a&gt;, which will
allow you to easily extract and add the events to your calendar using,
for example, the &lt;a
href="https://addons.mozilla.org/en-US/firefox/addon/4106"&gt;Operator&lt;/a&gt;
plugin for Firefox.

&lt;p&gt;See you there!&lt;/p&gt;

&lt;dl&gt;
  &lt;div class="vevent"&gt;
  &lt;dt class="summary"&gt;&lt;a class="url" href="http://en.oreilly.com/mysql2010/public/schedule/detail/13476"&gt;Mysteries of the Binary Log&lt;/a&gt;
  &lt;dd&gt;
    &lt;abbr title="2010-04-14T10:50-07:00" class="dtstart"&gt;April 14th, 2010 10:50am&lt;/abbr&gt; -
    &lt;abbr title="2010-04-14T11:50-07:00" class="dtend"&gt;11:50am&lt;/abbr&gt;
    Room: &lt;span class="location"&gt;Ballroom F&lt;/span&gt;
  &lt;/div&gt;
    
  &lt;div class="vevent"&gt;
  &lt;dt class="summary"&gt;&lt;a class="url" href="http://en.oreilly.com/mysql2010/public/schedule/detail/12451"&gt;New Replication Features&lt;/a&gt;
  &lt;dd&gt;
    &lt;abbr title="2010-04-13T14:00-07:00" class="dtstart"&gt;April 13th, 2010 2:00pm&lt;/abbr&gt; -
    &lt;abbr title="2010-04-13T15:00-07:00" class="dtstart"&gt;3:00pm&lt;/abbr&gt;
    Room: &lt;span class="location"&gt;Ballroom A&lt;/span&gt;
  &lt;/div&gt;

  &lt;div class="vevent"&gt;
  &lt;dt class="summary"&gt;&lt;a class="url" href="http://en.oreilly.com/mysql2010/public/schedule/detail/12444"&gt;Replication Tricks &amp; Tips&lt;/a&gt;
  &lt;dd&gt;
    &lt;abbr title="2010-04-14T14:00-07:00" class="dtstart"&gt;April 14th, 2010 2:00pm&lt;/abbr&gt; -
    &lt;abbr title="2010-04-14T15:00-07:00" class="dtstart"&gt;3:00pm&lt;/abbr&gt;
    Room: &lt;span class="location"&gt;Ballroom B&lt;/span&gt;
  &lt;/div&gt;
  
  &lt;div class="vevent"&gt;
  &lt;dt class="summary"&gt;&lt;a class="url" href="http://en.oreilly.com/mysql2010/public/schedule/detail/12432"&gt;The Replication Tutorial&lt;/a&gt;
  &lt;dd&gt;
    &lt;abbr title="2010-04-12T08:30-07:00" class="dtstart"&gt;April 12th, 2010 8:30am&lt;/abbr&gt; -
    &lt;abbr title="2010-04-12T12:00-07:00" class="dtstart"&gt;12:00pm&lt;/abbr&gt;
    Room: &lt;span class="location"&gt;Ballroom E&lt;/span&gt;
  &lt;/div&gt;
&lt;/dl&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-9207672279467469732?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/9207672279467469732/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=9207672279467469732' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/9207672279467469732'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/9207672279467469732'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2010/03/going-to-oreilly-mysql-conference-expo.html' title='Going to the O&apos;Reilly MySQL Conference &amp;amp; Expo'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-4483668956482499738</id><published>2010-02-03T21:19:00.003+01:00</published><updated>2010-02-03T21:37:24.008+01:00</updated><title type='text'>MySQL Replicant: Architecture</title><content type='html'>&lt;table id="class-design" class="figure" noborder&gt;
  &lt;caption align="bottom"&gt;MySQL Replicant Library&lt;br&gt;Class Design&lt;/caption&gt;
  &lt;tr&gt;&lt;td&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_X_imutbSFuE/S2ncxglilYI/AAAAAAAAAC8/d_8QKtKW2WY/s1600-h/class-diagram.png"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 320px; height: 223px;" src="http://4.bp.blogspot.com/_X_imutbSFuE/S2ncxglilYI/AAAAAAAAAC8/d_8QKtKW2WY/s320/class-diagram.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5434117168518305154" /&gt;&lt;/a&gt;
&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;

In the &lt;a
href="/2009/12/mysql-replicant-library-for-controlling.html"&gt;previous
post&lt;/a&gt; I described the first steps of a Python library for
controlling the replication of large installations. The intention of
the library is to provide a uniform interface to such installations
and that will allow procedures for handling various situations to be
written in a uniform language.&lt;p&gt;

For the library to be useful, it is necessary to support installations
that use different operating systems for the machines, as well as
different versions of the servers. Specifically, it is necessary to
allow some aspects of the system to vary.

&lt;ul&gt;

  &lt;li&gt;&lt;p&gt;Depending on the operating system, or even just how the server
  is installed on the machine, the procedures for bringing the server
  down and up will differ.&lt;/p&gt;&lt;/li&gt;

  &lt;li&gt;&lt;p&gt;Configurations are managed different ways depending on the
  deployment and there are various other tools to manage
  configurations of large systems.&lt;/p&gt;

  &lt;p&gt;As part of the management of the topology, it is necessary to
  change the configuration files, but this should play well with
  other tools.&lt;/p&gt;

  &lt;p&gt;In either case, any specific method for configuration handling
  should neither be required nor enforced.&lt;/p&gt;&lt;/li&gt;

  &lt;li&gt;In the example in the &lt;a
  href="/2009/12/mysql-replicant-library-for-controlling.html"&gt;previous
  article&lt;/a&gt;, the technique for cloning a server was demonstrated. In
  this case the naive method of copying the database files was
  used. For the general case, however, &lt;em&gt;some&lt;/em&gt; backup method
  will be used, but it depends on the requirements of the
  deployment. In other words, it is necessary to parameterize the
  backup method as well.&lt;/li&gt;

  &lt;li&gt;&lt;p&gt;Each server in the system has a specific &lt;em&gt;role&lt;/em&gt; to
  fulfill. Some server are final slaves whose only purpose is to
  answer queries, at least one server is a master, and some servers
  are relay servers.&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;To allow the system to be parameterized on these aspects, a set of
abstract classes is introduced. In the figure you can see a UML
diagram describing the high-level architecture of the Replicant
library.&lt;/p&gt;

In the figure, there are four abstract classes:

&lt;dl&gt;

  &lt;dt&gt;&lt;code&gt;Machine&lt;/code&gt;

  &lt;dd&gt;The responsibility of this class it to handle all issues that
  are specific to the remote operating system, for example, to fetch
  files or issue commands to start and stop the server.

  &lt;dt&gt;&lt;code&gt;Config&lt;/code&gt;

  &lt;dd&gt;The responsibility of this class is to maintain the
  configuration of a server. To do this, it may need to parse
  configuration files to be able to extract the specific section
  containing the definition.

  &lt;dt&gt;&lt;code&gt;BackupMethod&lt;/code&gt;

  &lt;dd&gt;The responsibility of this class is to provide the primitives to
  create a backup and restore a backup. In both cases, the class
  supports taking a backup and potentially placing the backup image at
  a different machine, and restoring it.

  &lt;dt&gt;&lt;code&gt;Role&lt;/code&gt;

  &lt;dd&gt;The responsibility of this class is to provide all the
  information necessary to configure a server in a role. Since the
  role does not only entails pure configuration information, but can
  also involve keeping certain tables and other database objects
  available, this is modeled as a separate class.

&lt;/dl&gt;

The central &lt;code&gt;Server&lt;/code&gt; class relies on a &lt;code&gt;Machine&lt;/code&gt;
instance and a &lt;code&gt;Config&lt;/code&gt; instance to implement the interface
to the machine and to the configuration, respectively.&lt;p&gt;

&lt;h2&gt;Configuration Management&lt;/h2&gt;

&lt;p&gt;The configuration of the server is made part of the Replicant library
since manipulating the server configuration is usually necessary when
changing roles of servers.&lt;/p&gt;

&lt;p&gt;Depending on the deployment, other configuration managers such as
  &lt;a href="http://www.gnu.org/software/cfengine/"&gt;cfengine&lt;/a&gt; or &lt;a
  href="http://reductivelabs.com/trac/puppet"&gt;puppet&lt;/a&gt; are used to
  administer the configuration of all servers, while others hand-edit
  the configuration files (which has to be for small configurations,
  since it would be a pain to administer larger deployments in this
  way).&lt;/p&gt;

&lt;p&gt;Long-term, there should be support for some safety measures when
working with server configurations, so implementing an interface for
handling server configurations in a safe transaction-like
manner&amp;mdash;or maybe this should be called a RCU-style
manner&amp;mdash;seems like a good idea. To support that, the following
methods to fetch and replace configurations are introduced.&lt;/p&gt;

&lt;dl&gt;
  &lt;dt class="func"&gt;Server.fetch_config()
  &lt;dd&gt;Returns a &lt;code&gt;Config&lt;/code&gt; instance of the configuration for
  the server.

  &lt;dt class="func"&gt;Server.replace_config(&lt;var&gt;config&lt;/var&gt;)
  &lt;dd&gt;Replace the configuration of the server with the modified
    configuration instance &lt;var&gt;config&lt;/var&gt;.
&lt;/dl&gt;

&lt;p&gt;This will allow an implementation to keep version numbers around to
avoid conflicts, but is not required by the interface.&lt;/p&gt;

&lt;p&gt;Each &lt;code&gt;Config&lt;/code&gt; instance can then be manipulated by using
the following methods:&lt;/p&gt;

&lt;dl&gt;
  &lt;dt class="func"&gt;Config.get(&lt;var&gt;option&lt;/var&gt;)
  &lt;dd&gt;Get the value of &lt;var&gt;option&lt;/var&gt; as a string.

  &lt;dt class="func"&gt;Config.set(&lt;var&gt;option&lt;/var&gt;[, &lt;var&gt;value&lt;/var&gt;])
  &lt;dd&gt;Set the value of &lt;var&gt;option&lt;/var&gt; to &lt;var&gt;value&lt;/var&gt;. If no
  &lt;var&gt;value&lt;/var&gt; is supplied, &lt;code&gt;None&lt;/code&gt; is used, which
  denotes that the option is set but not given a specific string
  value.

  &lt;dt class="func"&gt;Config.remove(&lt;var&gt;option&lt;/var&gt;)
  &lt;dd&gt;Remove the &lt;var&gt;option&lt;/var&gt; from the configuration instance
  entirely.
&lt;/dl&gt;

So, for example, the &lt;var&gt;log-bin&lt;/var&gt; option can be set in the
following manner:

&lt;pre class="code"&gt;
config = server.fetch_config()
config.set('log-bin', 'master-bin')
server.replace_config(config)
&lt;/pre&gt;

&lt;h2&gt;Machines&lt;/h2&gt;

&lt;p&gt;A MySQL server can run on many different machines and in many
setups. A server can run on Linux, Solaris, or Windows, and even in
those cases, there can be multiple servers on a single machine.&lt;/p&gt;

&lt;p&gt;For a Linux machine with a single server, one usually uses the
script &lt;file&gt;/etc/init.d/mysql&lt;/file&gt; to start and stop the
server&amp;mdash;at least on my Ubuntu&amp;mdash;but if multiple servers are
used on a single machine, then &lt;file&gt;mysqld_multi&lt;/file&gt; should be
used instead.&lt;/p&gt;

&lt;p&gt;For Windows and Solaris, the procedure for starting and stopping
servers are entirely different.  Windows starts and stops the servers
using &lt;code&gt;net start MySQL&lt;/code&gt; and &lt;code&gt;net stop MySQL&lt;/code&gt;,
while Solaris uses the &lt;em&gt;&lt;a
href="http://docs.sun.com/app/docs/doc/819-2240/svcadm-1m?a=view"&gt;svcadm(1M)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To parameterize the system over the various ways it can be
installed, the concept of a &lt;em&gt;Machine&lt;/em&gt; is introduced (I actually
had problems figuring out a name for this, but this was suggested to
me and seems to be good enough).&lt;/p&gt;

&lt;p&gt;The responsibility of the &lt;code&gt;Machine&lt;/code&gt; class is to provide
an interface to access the installed server together with installation
information such as the location of configuration files.&lt;/p&gt;

&lt;h2&gt;BackupMethod&lt;/h2&gt;

&lt;p&gt;One of the more important techniques when managing a set of server is
the ability to clone a slave or a master to create new slaves. Cloning
involves taking a backup of a server and then restoring the backup
image on a the new slave. Since the techniques for taking backups vary
a lot and different techniques will be used in different situations,
parameterizing over the various backup methods is sensible.&lt;/p&gt;

&lt;dl&gt;
  &lt;dt class="func"&gt;BackupMethod.backup_to(&lt;var&gt;server&lt;/var&gt;, &lt;var&gt;url&lt;/var&gt;)

  &lt;dd&gt;This method will take a backup of &lt;var&gt;server&lt;/var&gt; and store it
  at the location indicated by &lt;var&gt;url&lt;/var&gt;.

  &lt;dt class="func"&gt;BackupMethod.restore_from(&lt;var&gt;server&lt;/var&gt;, &lt;var&gt;url&lt;/var&gt;)&lt;/code&gt;

  &lt;dd&gt;This method will restore the backup image indicated by
  &lt;var&gt;url&lt;/var&gt; into &lt;var&gt;server&lt;/var&gt;.

&lt;/dl&gt;

&lt;h2&gt;Role&lt;/h2&gt;

&lt;p&gt;In a deployment, each server is configured to play a specific
&lt;em&gt;role&lt;/em&gt;. It can either be acting as a master, a slave, or even a
relay. To represent a role, a separate &lt;code&gt;Role&lt;/code&gt; class is
introduced. Once a role is created, a server can be &lt;em&gt;imbued&lt;/em&gt;
with it.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Not every server have an assigned role.&lt;/li&gt;
  &lt;li&gt;Each server can just have a single role.&lt;/li&gt;
  &lt;li&gt;Each roles can be assigned to multiple servers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since a role may encompass much more than just setting some
configuration parameters, this more flexible approach was chosen.
When imbuing a server with a role, a piece of Python code is executed
to configure the server correctly.&lt;/p&gt;

&lt;p&gt;The use of roles in this case is actually just one of many choices,
and when using this approach, there is actually two different ways
that roles can be used. I am slightly undecided on the two and would
like to hear comments on which one to use.&lt;/p&gt;

&lt;ol&gt;

  &lt;li&gt;Roles are just applied to the initial deployment and does not
  play any role after the system have been deployed. Roles are imbued
  into a server initially, and then the configuration of the server
  can be changed by procedures to manipulate the deployment.&lt;/li&gt;

  &lt;li&gt;Roles exists in the entire deployment and when a server changes
  roles in the deployment, the Role instance will also change. Every
  server is assigned a role in the system, which is represented using
  a subclass of the &lt;code&gt;Role&lt;/code&gt; class.&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;The first is by far the easiest to implement, which is why I chose
this at this time. Since the roles are just containers for
configuration options and other items that needs to be added, they are
easy to write. Since this is what is used in the library currently, it
is also what you see in the class design above.&lt;/p&gt;

&lt;p&gt;The second approach seems better, but it has a number of
consequences:&lt;/p&gt;

&lt;ul&gt;

  &lt;li&gt;Every server has to have a role class associated with it, even
  the "initial" role is required.&lt;/li&gt;

  &lt;li&gt;If the role changes, another role class will be associated with
  it. This forces the role class to not only be able to imbue a server
  in a role, but to also &lt;em&gt;unimbue&lt;/em&gt; the server from that
  role.&lt;/li&gt;

  &lt;li&gt;It cannot be possible to change the configuration of a server
  directly, it has to be in the form of defining a role and then
  changing the server to that role. Unimbuing the server from a role
  becomes very hard if the configuration of the server is changed
  outside the control of the role.&lt;/li&gt;

&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-4483668956482499738?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/4483668956482499738/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=4483668956482499738' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/4483668956482499738'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/4483668956482499738'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2010/02/mysql-replicant-library-class-design-in.html' title='MySQL Replicant: Architecture'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_X_imutbSFuE/S2ncxglilYI/AAAAAAAAAC8/d_8QKtKW2WY/s72-c/class-diagram.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-804395234230368121</id><published>2009-12-18T17:12:00.005+01:00</published><updated>2009-12-28T10:11:14.828+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql replicant'/><title type='text'>MySQL Replicant: a library for controlling replication deployments</title><content type='html'>Keeping a MySQL installation up and running can be quite tricky at
times, especially when having many servers to manage and monitor. In
the replication tutorials at the annual MySQL Users' Conference, we
demonstrate how to set up replication appropriately and also how to
handle various issues that can arise.

Many of these procedures are routine: bring down the server, edit the
configuration file, bring the server up again, start a
&lt;code&gt;mysql&lt;/code&gt; client and add a user, etc.&lt;p&gt;

It has always annoyed me that these procedures are perfect candidates
for automation, but that we do not have the necessary interfaces to
manipulate an entire installation of MySQL servers.&lt;p&gt;

If there were an interface with a relatively small set of
primitives&amp;mdash;re-directing servers, bringing servers down, add a
line to the configuration file, etc.&amp;mdash;it would be possible to
create pre-canned procedures that can just be executed.&lt;p&gt;

To that end, I started writing on a library that would provide an
interface like this. Although more familiar with Perl, Python was
picked for this project, since it seems to be widely used by many
database administrators (it's just a feeling I have, I have no figures
to support it) and just to have a cool name on the library, we call it
&lt;em&gt;MySQL Replicant&lt;/em&gt; and it is (of course) &lt;a href="https://launchpad.net/mysql-replicant-python"&gt;available at
Launchpad&lt;/a&gt;.&lt;p&gt;

So what do we want to achieve with having a library like this?
Well... the goal is to to provide an generic interface to complete
installations and thereby make administration of large installations
easy.&lt;p&gt;

By providing such an interface, it will allow description of
procedures in an executable format, namely as Python scripts.&lt;p&gt;

In
addition to making it easy to implement common tasks for experienced
database administrators, it also promotes sharing by providing a way
to write complete scripts for solving common problems. Having a pool
of such scripts makes it easier for newcomers to get up and
running.&lt;p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_X_imutbSFuE/Szh2BecODfI/AAAAAAAAAC0/2LlYm1gsbeI/s1600-h/topology-with-model.png"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 320px; height: 289px;" src="http://2.bp.blogspot.com/_X_imutbSFuE/Szh2BecODfI/AAAAAAAAAC0/2LlYm1gsbeI/s320/topology-with-model.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5420211919263763954" /&gt;&lt;/a&gt;

The basic idea is that you create a model of the installation on a
computer and then manipulate the model. When doing these
manipulations, the appropriate commands&amp;mdash;either as SQL commands
to a running server or shell commands to the host where the server is
running&amp;mdash;will then be sent to the servers in the installation to
configure them correctly. &lt;p&gt;

So, to take small example, how does the code for re-directing a bunch
of servers to a master look?

&lt;pre class="code"&gt;
import mysqlrep, my_servers
for slave in my_server.slaves:
   mysqlrep.change_master(slave, my_servers.master)
&lt;/pre&gt;

In this case, the installation is defined in a separate file and is
imported as a Python module.  Right now, the interface for specifying
a topology is quite rough, but this is going to change.

&lt;pre class="code"&gt;
from mysqlrep import Server, User, Linux

servers = [Server(server_id=1, host="server1.example.com",
                  sql_user=User("mysql_replicant", "xyzzy"),
                  ssh_user=User("mysql_replicant"),
                  machine=Linux()),
           Server(server_id=2, host="server2.example.com",
                  sql_user=User("mysql_replicant", "xyzzy"),
                  ssh_user=User("mysql_replicant"),
                  machine=Linux()),
           Server(server_id=3, host="server3.example.com",
                  sql_user=User("mysql_replicant", "xyzzy"),
                  ssh_user=User("mysql_replicant"),
                  machine=Linux()),
           Server(server_id=4, host="server4.example.com",
                  sql_user=User("mysql_replicant", "xyzzy"),
                  ssh_user=User("mysql_replicant"),
                  machine=Linux())]
master = servers[0]
slaves = servers[1:]
&lt;/pre&gt;

Here, the &lt;code&gt;Server&lt;/code&gt; class represents a server and to be able
to do it's job, it is necessary to have one MySQL account on the
server and one shell account on the host machine. Right now, it is
also necessary to specify the server ID, but the plan is to just
require the host, port, socket, SQL account name, and SSH account
information. The remaining information can then be fetched from the
configuration file of the server. Each server have a small set of
primitives on top of which everything else is built:

&lt;dl&gt;

  &lt;dt&gt;&lt;code&gt;Server.sql(&lt;em&gt;SQL command&lt;/em&gt;)&lt;/code&gt;

  &lt;dd&gt;Execute the SQL command and return a result set.

  &lt;dt&gt;&lt;code&gt;Server.ssh(&lt;em&gt;command list&lt;/em&gt;)&lt;/code&gt;

  &lt;dd&gt;Execute the command given by the command list return an iterator
  to the result output.

  &lt;dt&gt;&lt;code&gt;Server.start()&lt;/code&gt;

  &lt;dd&gt;Start the server

  &lt;dt&gt;&lt;code&gt;Server.stop()&lt;/code&gt;

  &lt;dd&gt;Stop the server.

&lt;/dl&gt;

There is a small set of commands defined on top of these primitives
that can be used. Here is a list of just a few of them, but there are
some more in the library at Launchpad.

&lt;dl&gt;

  &lt;dt&gt;&lt;code&gt;change_master(slave, master, position=None)&lt;/code&gt;

  &lt;dd&gt;Change the master of &lt;code&gt;slave&lt;/code&gt; to be
  &lt;code&gt;master&lt;/code&gt; and start replicating from
  &lt;code&gt;position&lt;/code&gt;.

  &lt;dt&gt;&lt;code&gt;fetch_master_pos(server)&lt;/code&gt;

  &lt;dd&gt;Fetch the master position of &lt;code&gt;server&lt;/code&gt;, which is the
  position where the last executed statement ends in the binary log.
    
  &lt;dt&gt;&lt;code&gt;fetch_slave_pos(server)&lt;/code&gt;

  &lt;dd&gt;Fetch the slave position of &lt;code&gt;server&lt;/code&gt;, which is the
  position where the last executed event ends.

  &lt;dt&gt;&lt;code&gt;flush_and_lock_database(server)&lt;/code&gt;

  &lt;dd&gt;Flush all tables on &lt;code&gt;server&lt;/code&gt; and lock the database
  for read.

  &lt;dt&gt;&lt;code&gt;unlock_database(server)&lt;/code&gt;

  &lt;dd&gt;Unlock a previously locked database.
    
&lt;/dl&gt;

Using these primitives, it is easy to clone a master by executing the
code below. For this example, I use the quite naive method of backing
up a database by creating an archive of the database files and copying
them to the new slave.

&lt;pre class="code"&gt;
from mysqlrep import flush_and_lock_database, fetch_master_position
from subprocess import call

flush_and_lock_database(master)
position = fetch_master_position(master)
master.ssh("tar Pzcf " + backup_name + " /usr/var/mysql")
unlock_database(master)
call(["scp", source.host + ":" + backup_name, slave.host + ":."])
slave.stop()
slave.ssh("tar Pzxf " + backup_name + " /usr/var/mysql")
slave.start()
start_replication(slave)
&lt;/pre&gt;

What do you think? Would this be a valuable project to pursue?

Here are some links related to this post:

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://mituzas.lt/2009/08/17/dba-python-scripts/"&gt;Domas post on "MySQL DBA, python edition"&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://bitbucket.org/david415/mysql-cluster-tools/"&gt;Code from Spinn3r&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-804395234230368121?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/804395234230368121/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=804395234230368121' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/804395234230368121'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/804395234230368121'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2009/12/mysql-replicant-library-for-controlling.html' title='MySQL Replicant: a library for controlling replication deployments'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_X_imutbSFuE/Szh2BecODfI/AAAAAAAAAC0/2LlYm1gsbeI/s72-c/topology-with-model.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-8108146698981033750</id><published>2009-12-17T09:50:00.000+01:00</published><updated>2009-12-17T09:51:20.001+01:00</updated><title type='text'>Using mysqld_multi on Karmic</title><content type='html'>I wanted to set up several servers on my machine using the Ubuntu
distribution and control them using &lt;code&gt;mysqld_multi&lt;/code&gt;: the
typical way to manage several servers on your machine. However, I also
wanted to use MySQL 5.1 and not 5.0, which is the default on Jaunty
(Ubuntu 9.04).

About a month ago, I upgraded to Karmic Koala and one of the reasons
were that MySQL 5.1 is used by default. Even though I could install
the latest revision all the time, I usually want to use the real
distributions for my private projects for a number of reasons.&lt;p&gt;

I actually tried to upgrade to MySQL 5.1 on Ubuntu 9.04, but I
discovered that all kinds of applications had dependencies on MySQL
5.0, so I avoided to upgrade at that time.&lt;p&gt;

Anyway, the procedure for installing multiple servers on the same
machine is this:

&lt;ol&gt;

  &lt;li&gt;&lt;strong&gt;Shut down the running server.&lt;/strong&gt;

  &lt;p&gt;This is, strictly speaking, not necessary unless you are going
  to edit the options for the running server, but I do this as a
  precaution.&lt;/li&gt;

  &lt;li&gt;&lt;strong&gt;Edit your &lt;file&gt;my.cnf&lt;/file&gt; configuration file and add
  sections for &lt;code&gt;mysqld_multi&lt;/code&gt; and the new servers.&lt;/strong&gt;

  &lt;p&gt;I wanted to add four servers to play with, not counting the one
  that is already installed and running, so I added sections
  &lt;file&gt;mysqld1&lt;/file&gt; to &lt;file&gt;mysqld4&lt;/file&gt;.  Also add a section
  for &lt;file&gt;mysqld_multi&lt;/file&gt;&lt;/li&gt;

  &lt;li&gt;&lt;strong&gt;Create server directories and database files using
  &lt;file&gt;mysql_install_db&lt;/file&gt;&lt;/strong&gt;

  &lt;p&gt;The new servers need to be bootstrapped so that they have all
  the necessary databases and tables set up.

  &lt;li&gt;&lt;strong&gt;Optionally: install an &lt;file&gt;init.d&lt;/file&gt; script that uses  &lt;file&gt;mysqld_multi&lt;/file&gt;.&lt;/strong&gt;

  &lt;p&gt;This is currently not very well-supported in Debian (there is
  actually a comment saying that it is not supported), so I skipped
  this step. If you feel adventerous, you can always copy the
  &lt;file&gt;/usr/share/mysql/mysqld_multi.server&lt;/file&gt; as
  &lt;file&gt;/etc/init.d/mysql.server&lt;/file&gt; as they suggest in the file,
  but I will not do it, nor recommend it (because I haven't tried it).

  &lt;li&gt;&lt;strong&gt;Start the installed server(s).&lt;/strong&gt;
  &lt;p&gt;Well, not much to say here.
&lt;/ol&gt;

So, on my way, I edited the &lt;file&gt;/etc/mysql/my.cnf&lt;/file&gt; and added
the sections necessary. (You can see a diff of that &lt;a
href="#conf-changes"&gt;below&lt;/a&gt;.)&lt;p&gt;

The important options to add are &lt;em&gt;server-id&lt;/em&gt; so that each
server gets a unique server id (I'm going to replicate between them),
&lt;em&gt;port&lt;/em&gt; and &lt;em&gt;socket&lt;/em&gt; so that you can connect to each of
them both when you're on the local machine and from another machine,
and &lt;em&gt;pid-file&lt;/em&gt; to give each server a unique pid file name (this
is important, since the default will not work at all).&lt;p&gt;

Next step is to install the data directories for the servers, which
should be trivial:

&lt;pre class="code"&gt;
$ &lt;strong&gt;sudo mysql_install_db --user=mysql --datadir=/var/lib/mysqlfoo --basedir=/usr&lt;/strong&gt;
Installing MySQL system tables...
091120  9:40:23 [Warning] Can't create test file /var/lib/mysqlfoo/romeo.lower-test
091120  9:40:23 [Warning] Can't create test file /var/lib/mysqlfoo/romeo.lower-test
ERROR: 1005  Can't create table 'db' (errno: 13)
091120  9:40:23 [ERROR] Aborting

091120  9:40:23 [Warning] Forcing shutdown of 2 plugins
091120  9:40:23 [Note] /usr/sbin/mysqld: Shutdown complete


Installation of system tables failed!  Examine the logs in
/var/lib/mysqlfoo for more information.
    .
    .
    .
&lt;/pre&gt;

OK, the warning is a warning, but it seems I forgot the permissions on
the directory. Checking the write permissions, no
problems. Hmmm... checking that I can create the directories and files
manually as the &lt;code&gt;mysql&lt;/code&gt; user, no problems(!)&lt;p&gt;

What on earth is going on?&lt;p&gt;

After some digging around, I found &lt;a
href="https://bugs.launchpad.net/ubuntu/+source/mysql-dfsg-5.0/+bug/201799"&gt;bug
#201799&lt;/a&gt; which quite clearly explains that what I thought was a
permission problem is actually &lt;a
href="https://help.ubuntu.com/community/AppArmor"&gt;AppArmor&lt;/a&gt; doing
its job.&lt;p&gt;

So updating the AppArmor configuration file
&lt;file&gt;/etc/apparmor.d/usr.sbin.mysqld&lt;/file&gt; with this solved the
problem and I could get on with installing the servers.

&lt;pre class="code"&gt;
diff --git a/apparmor.d/usr.sbin.mysqld b/apparmor.d/usr.sbin.mysqld
index f9f1a37..7a94861 100644
--- a/apparmor.d/usr.sbin.mysqld
+++ b/apparmor.d/usr.sbin.mysqld
@@ -21,10 +25,20 @@
   /etc/mysql/my.cnf r,
   /usr/sbin/mysqld mr,
   /usr/share/mysql/** r,
   /var/log/mysql.log rw,
   /var/log/mysql.err rw,
+  /var/log/mysql[1-9].log rw,
+  /var/log/mysql[1-9].err rw,
   /var/lib/mysql/ r,
   /var/lib/mysql/** rwk,
+  /var/lib/mysql[1-9]/ r,
+  /var/lib/mysql[1-9]/** rwk,
   /var/log/mysql/ r,
   /var/log/mysql/* rw,
+  /var/log/mysql[1-9]/ r,
+  /var/log/mysql[1-9]/* rw,
   /var/run/mysqld/mysqld.pid w,
   /var/run/mysqld/mysqld.sock w,
+  /var/run/mysqld/mysqld[1-9].pid w,
+  /var/run/mysqld/mysqld[1-9].sock w,
 }
&lt;/pre&gt;

&lt;h2 id="conf-changes"&gt;Changes to &lt;file&gt;/etc/mysql/my.cnf&lt;/file&gt;&lt;/h2&gt;

Here is a unified diff of the changes I made to
&lt;file&gt;/etc/mysql/my.cnf&lt;/file&gt; to add some more servers.

&lt;pre class="code"&gt;
$ &lt;strong&gt;git diff mysql/my.cnf&lt;/strong&gt;
--- a/mysql/my.cnf
+++ b/mysql/my.cnf
@@ -111,7 +111,46 @@ max_binlog_size         = 100M
 # ssl-cert=/etc/mysql/server-cert.pem
 # ssl-key=/etc/mysql/server-key.pem
 
+[mysqld_multi]
+mysqld         = /usr/bin/mysqld_safe
+mysqladmin     = /usr/bin/mysqladmin
+user           = root
 
+[mysqld1]
+server-id      = 1
+pid-file = /var/run/mysqld/mysqld1.pid
+socket  = /var/run/mysqld/mysqld1.sock
+port  = 3307
+datadir = /var/lib/mysql1
+log-bin        = /var/lib/mysql1/mysqld1-bin.log
+log-bin-index  = /var/lib/mysql1/mysqld1-bin.index
+
+[mysqld2]
+server-id      = 2
+pid-file = /var/run/mysqld/mysqld2.pid
+socket  = /var/run/mysqld/mysqld2.sock
+port  = 3308
+datadir = /var/lib/mysql2
+log-bin        = /var/lib/mysql2/mysqld2-bin.log
+log-bin-index  = /var/lib/mysql2/mysqld2-bin.index
+
+[mysqld3]
+server-id      = 3
+pid-file = /var/run/mysqld/mysqld3.pid
+socket  = /var/run/mysqld/mysqld3.sock
+port  = 3309
+datadir = /var/lib/mysql3
+log-bin        = /var/lib/mysql3/mysqld3-bin.log
+log-bin-index  = /var/lib/mysql3/mysqld3-bin.log
+
+[mysqld4]
+server-id      = 4
+pid-file = /var/run/mysqld/mysqld4.pid
+socket  = /var/run/mysqld/mysqld4.sock
+port  = 3310
+datadir = /var/lib/mysql4
+log-bin        = /var/lib/mysql4/mysqld3-bin.log
+log-bin-index  = /var/lib/mysql4/mysqld3-bin.log
 
 [mysqldump]
 quick
&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-8108146698981033750?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/8108146698981033750/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=8108146698981033750' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/8108146698981033750'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/8108146698981033750'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2009/12/using-mysqldmulti-on-karmic.html' title='Using mysqld_multi on Karmic'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-7467363913754324621</id><published>2009-11-03T12:04:00.003+01:00</published><updated>2009-11-17T12:13:13.610+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='quilt'/><category scheme='http://www.blogger.com/atom/ns#' term='reengineering'/><category scheme='http://www.blogger.com/atom/ns#' term='bash'/><category scheme='http://www.blogger.com/atom/ns#' term='testing'/><category scheme='http://www.blogger.com/atom/ns#' term='bisect'/><title type='text'>Bisection testing using Quilt</title><content type='html'>Having produced a nice little series of 124 patches (yes, really), I
recently had to find out what patch introduced a problem for
&lt;code&gt;distcheck&lt;/code&gt; to pass.  Since &lt;code&gt;distcheck&lt;/code&gt; takes
quite some time to execute, I want to make as few runs as possible.&lt;p&gt;

In &lt;a href="git-scm.com"&gt;Git&lt;/a&gt;, there is the &lt;code&gt;bisect&lt;/code&gt;
command that can be used to perform bisection testing of a series of
patches, but quilt does not have anything like that, so to simplify my
job, I needed to implement that for quilt.&lt;p&gt;

I started by defining a shell function that did the actual test, and
returned the result.

&lt;pre class="code"&gt;
do_test () {
    echo -n "running distcheck..."
    make -j6 distcheck &gt;/dev/null 2&gt;&amp;1
}
&lt;/pre&gt;

After that, I added code to add values for some variables used and to
process options to the script. The script supports two options:
&lt;code&gt;--lower&lt;/code&gt; and &lt;code&gt;--upper&lt;/code&gt;. Both accept a number of
a patch: the lowest patch that was good and the number of the last
known patch to fail the test. I could have supplied the patch names
here, but this was good enough for my purposes.&lt;p&gt;

Note that I am using Bash since it has support.

&lt;pre class="code"&gt;
series=(`quilt series`)                  # Array of the patch names
lower=0                                  # Lowest item in tested range
upper=$((${#series[@]} - 1))             # Upper limit of range

while true; do
    case "$1" in
        -l|--lower)
            lower="$1"
            shift
            ;;
        -u|--upper)
            upper="$1"
            shift
            ;;
        *)
            shift
            break
            ;;
    esac
done

middle=$(($lower + ($upper - $lower) / 2))
&lt;/pre&gt;

Then we start by preparing the looping by moving to the middle of the
patches in the range.

&lt;pre class="code"&gt;
quilt pop -a &gt;/dev/null
quilt push $middle &gt;/dev/null
&lt;/pre&gt;

The main loop will keep pushing or popping depending on whether the
current patch fails the test or succeeds.  The invariant for the loop
is that that &lt;var&gt;$middle&lt;/var&gt; holds the number of the current patch
to be tested (the patch that &lt;code&gt;quilt top&lt;/code&gt; would report) and
we keep looping until &lt;var&gt;$lower&lt;/var&gt; == &lt;var&gt;$upper&lt;/var&gt;.  Just to
ensure that the right patch is tested, we test the invariant in the
loop.&lt;p&gt;

&lt;ul&gt;
  &lt;li&gt;If the test succeeds, we know that the first failing test is
  somewhere between this patch and the last known failing test. So, we
  compute the next midpoint to be between this patch and the last
  known unsuccessful patch and store it in &lt;var&gt;middle&lt;/var&gt;. We then
  push patches to reach this patch.&lt;/li&gt;

  &lt;li&gt;If the test fails, we know that the first failing test is
  somewhere between the current patch and the last known successful
  patch. So, we compute the next midpoint to be between this patch and
  the last successful patch and store it in &lt;var&gt;middle&lt;/var&gt;. We then
  pop patches to reach this patch.&lt;/li&gt;
&lt;/ul&gt;

&lt;pre class="code"&gt;
while test $lower -lt $upper
do
    top=`quilt top`
    echo -n "$top..."

    if test "$top" != "${series[$(($middle-1))]}"; then
        echo "invariant failed ($top != ${series[$(($middle-1))]})!" 1&gt;&amp;2
        exit 2
    fi

    if do_test $lower $upper; then
        lower=$(($middle + 1))
        middle=$(($lower + ($upper - $lower) / 2))
        cnt=$(($middle - $lower + 1))
        echo -n "succeeded..."
        if test $cnt -gt 0; then
            echo -n "pushing $cnt patches..."
            quilt push $cnt &gt;/dev/null
            echo "done"
        fi
    else
        upper=$middle
        middle=$(($lower + ($upper - $lower) / 2))
        cnt=$(($upper - $middle))
        echo -n "failed..."
        if test $cnt -gt 0; then
            echo -n "popping $cnt patches..."
            quilt pop $cnt &gt;/dev/null
            echo "done"
        fi
    fi
done
&lt;/pre&gt;

Next task: extend quilt to support the &lt;code&gt;bisect&lt;/code&gt; command.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-7467363913754324621?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/7467363913754324621/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=7467363913754324621' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/7467363913754324621'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/7467363913754324621'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2009/11/bisection-testing-using-quilt.html' title='Bisection testing using Quilt'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-1218764242625382167</id><published>2009-02-25T13:52:00.003+01:00</published><updated>2009-02-25T18:03:35.629+01:00</updated><title type='text'>Mixing engines in transactions</title><content type='html'>As &lt;a href="http://pento.net/2009/02/13/replication-with-innodb-and-myisam-transactions"&gt;Gary already pointed out&lt;/a&gt;, the replication behavior when mixing
non-transactional and transactional tables have changed between 5.1.30
and 5.1.31. Why did it change? Well, for starters, it actually never
worked and the existing behavior was fooling people that it actually
worked.&lt;p&gt;

There are several bugs reported on mixing transactional and
non-transactional tables in statements and in transactions. For the two latest
examples, see &lt;a href="http://bugs.mysql.com/bug.php?id=28976"&gt;BUG#28976&lt;/a&gt; and &lt;a href="http://bugs.mysql.com/bug.php?id=40116"&gt;BUG#40116&lt;/a&gt;.

To explain the situation, I will first start with some background...&lt;p&gt;

&lt;h3&gt;The binary log&lt;/h3&gt;

For this discussion, we call a statement that manipulates
transactional tables only a &lt;em&gt;transactional statement&lt;/em&gt; and call
it a &lt;em&gt;non-transactional statement&lt;/em&gt; otherwise.&lt;p&gt;

The binary log is a &lt;em&gt;serial history&lt;/em&gt; of the transactions
executed on the master, that is, each transaction is written to the
binary log at commit time. To handle this, the binary log has a
thread-specific transaction cache as well as the actual binary
log. Whenever a transactional statement arrives to the binary log, it
is cached in the transaction cache, and once the transaction commits,
the transaction cache is written to the binary log. Non-transactional
statement, on the other hand, are written directly to the binary log
since they take effect immediately.  Note that this is an idealized
picture of how it works, and it lacks a lot of critical details, which
we will cover below.&lt;p&gt;

&lt;h3&gt;Non-transactional statements in transactions&lt;/h3&gt;

When using a non-transactional table in a transaction, the effects are
actually written directly to the table and not managed by the
transaction at all. Thanks to the &lt;a
href="http://forge.mysql.com/wiki/MySQL_Internals_Locking_Overview#Life_cycle_of_a_table_during_statement_execution"&gt;locking
scheduler&lt;/a&gt;, this can guarantee a serialization order of the
statements, but there is no guarantees of atomicity. Note, however,
that the locks are released at the end of the &lt;em&gt;statement&lt;/em&gt;, not
the transaction. This is a result of the MyISAM legacy, which does not
have a notion of transactions. In other words, this is what you get in
a sample execution:

&lt;pre class="code"&gt;
T1&amp;gt; create table myisam_tbl (a int) engine=myisam;
Query OK, 0 rows affected (0.04 sec)

T1&amp;gt; create table innodb_tbl (a int) engine=innodb;
Query OK, 0 rows affected (0.00 sec)

T1&amp;gt; begin;
Query OK, 0 rows affected (0.00 sec)

T1&amp;gt; insert into myisam_tbl values (1),(2);
Query OK, 2 rows affected (0.04 sec)
Records: 2  Duplicates: 0  Warnings: 0

T1&amp;gt; insert into innodb_tbl values (1),(2);
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

T2&amp;gt; select * from myisam_tbl;
+------+
| a    |
+------+
|    1 | 
|    2 | 
+------+
2 rows in set (0.00 sec)

T2&amp;gt; select * from innodb_tbl;
Empty set (0.00 sec)
&lt;/pre&gt;

&lt;h3&gt;Replicating non-transactional statements in transactions is easy&lt;/h3&gt;

So, how should one handle this case in replication? Note that we right
now only consider statement-based replication (but we will consider
row-based replication later).&lt;p&gt;

So, let us say that we have the transaction above, that is:

&lt;pre class="code"&gt;
BEGIN;
INSERT INTO myisam_tbl VALUES (1),(2);
INSERT INTO innodb_tbl VALUES (1),(2);
COMMIT;
&lt;/pre&gt;

As far as possible, we want to mimic the actual behavior, meaning
that the non-transactional change should be replicated to the slave as
soon as possible so that even if the transaction does not commit for a
long time, the MyISAM change should be visible on the slave
immediately.&lt;p&gt;

OK, this is simple, if a non-transactional statement is inside a
transaction, we just write it to the binary log. The effects have
already taken place, so of course we are write it directly to the
binary log (right?).&lt;p&gt;

Fine, so what about this case then:

&lt;pre class="code"&gt;
BEGIN;
INSERT INTO innodb_tbl VALUES (1),(2);
INSERT INTO myisam_tbl SELECT * FROM innodb_tbl;
COMMIT;
&lt;/pre&gt;

Ew, um, well... no, that does not work. If the &lt;code&gt;myisam_tbl&lt;/code&gt;
insert is written before the &lt;code&gt;innodb_tbl&lt;/code&gt; insert, it will
not have the 1 and 2.&lt;p&gt;

Bummer...&lt;p&gt;

Hum... OK, so let's say that we only write the non-transactional
statement that is first in the transaction directly to the binary log.
Then we will have it replicated directly to the slave, but for the other case, when the non-transactional statement is not first in the transaction, we will have the
non-transactional statement stored in the transaction cache so that it
can read from the transactional table.&lt;p&gt;

This is how it worked in MySQL at least since 4.1 (I didn't look
earlier), and it serves well...&lt;p&gt;

&lt;h3&gt;... unless you consider all the caveats&lt;/h3&gt;

That was a very rosy picture, but the reality is not that easy to
handle.

&lt;h4&gt;Rolling back a transaction.&lt;/h4&gt;

If the transaction is rolled back, and a non-transactional statement
was in the transaction cache, it is still necessary to write the
transaction to the binary log (with a &lt;code&gt;ROLLBACK&lt;/code&gt; last).
This is wasting resources (disk space) since if the transaction
contained only transactional statements, it could just be
discarded.&lt;p&gt;

Also, it will be &lt;em&gt;incorrect&lt;/em&gt; in the case that the
non-transactional engine is replaced with a transactional engine on
the slave, or if there is a crash after the non-transactional
statement.  It only works if the statement is first in the
transaction, because then it will be committed as a separate
transaction ahead of the transaction that is rolled back.&lt;p&gt;

For this reason...

&lt;h4&gt;Non-transactional statement need to be at the beginning of a
transaction.&lt;/h4&gt;

Since only the statements that are at the beginning is "written
ahead", it means that the user have to remember to put the
non-transactional statements first in the transaction. This lulls an
inexperienced user, or one that just didn't consider replication when
writing the transactions, into the idea that replication can handle
the transaction even when the statement is not first in the
transaction.&lt;p&gt;

Incidentally, since we're talking about non-transactional
statements...

&lt;h4&gt;What statements are non-transactional?&lt;/h4&gt;

So, this is the scenario mentioned in &lt;a
href="http://bugs.mysql.com/bug.php?id=40116"&gt;BUG#40116&lt;/a&gt;. Let us
introduce two tables, one MyISAM log table and two InnoDB tables
holding the business data.  We then add a trigger to log any changes
to one of the InnoDB tables like this:

&lt;pre class="code"&gt;
CREATE TABLE tbl (f INT) ENGINE=INNODB;
CREATE TABLE extra (f INT) ENGINE=INNODB;
CREATE TABLE log (r INT) ENGINE=MYISAM; 
CREATE TRIGGER tbl_tr AFTER INSERT ON tbl FOR EACH ROW
    INSERT INTO log VALUES ( NEW.f );
&lt;/pre&gt;

Now, let's have a look at this transaction (deliberately without a
&lt;code&gt;COMMIT&lt;/code&gt; or &lt;code&gt;ROLLBACK&lt;/code&gt;):

&lt;pre class="code"&gt;
INSERT INTO tbl VALUES (1);
INSERT INTO extra VALUES (2);
&lt;/pre&gt;

What about the first statement in the transaction? Is it
non-transactional or transactional? What do we do once we have seen
only that statement?&lt;p&gt;

If we treat the statement as non-transactional statement and write it
ahead of the transaction, we have to make a decision there and then on
whether to commit it or to roll it back, which is just not possible
(it will be wrong whatever we decide).&lt;p&gt;

Another alternative is to look at the "top level" table and treat it
as transactional (because &lt;code&gt;tbl&lt;/code&gt; is transactional). This
appears what 5.0 is doing. This means that the statement would be put
in the transaction cache, but if it is later decided that the
statement should roll back, we have to write the transaction to the
binary log with a &lt;code&gt;ROLLBACK&lt;/code&gt; last. This would work in an
acceptable manner, but what happens if the engines are switched so
that the non-transactional table is the "top level" table? Does this
mean that the statement suddenly becomes non-transactional and is
written ahead of the transaction? If we do that, we will have all the
problems described the previous paragraph about being forced to make a
good decision there and then.&lt;p&gt;

Ooooookey... so we have to cache the statement even if it contains
non-transactional changes. Fine. The only case that we can actually
write ahead is when we have non-transactional statement not containing
any transactional changes and that statement is first in a
transaction... and there is a lot of logic to check that case.&lt;p&gt;

So, we decided to remove that logic and always write statements inside
a transaction to the transaction cache. The only remaining piece of
the logic is that a transaction containing a non-transactional change
is written to the binary log with a &lt;code&gt;ROLLBACK&lt;/code&gt; last. If
there are only transactional changes, the transaction cache is just
tossed and nothing written to the binary log.&lt;p&gt;

The only thing remaining is to print a warning when a statement
containing non-transactional changes is put in the transaction
cache. This is not the case right now: the server prints a warning
when a transaction holding a non-transactional change is &lt;em&gt;rolled
back&lt;/em&gt;, which in my view is a tad to late, since the problem
actually occurs &lt;em&gt;when the statement is written to the binary
log&lt;/em&gt;.

&lt;h4&gt;What about row-based replication?&lt;/h4&gt;

Until now, we have discussed statement-based replication only, but for
row-based replication we can actually do better. Any changes that are
non-transactional can be written ahead of the transaction since there
are no dependencies on statements inside the transaction. There are
only two problems:

&lt;ol&gt;

  &lt;li&gt;For performance reasons, rows are cached and written to the
  binary log when the statement commits. This means that if there is a
  crash before finishing the statement, the rows written for the
  statement as far is it got is lost. For most transactional engines,
  this is not a problem, since the changes will be rolled back, but
  for MyISAM, the changes can stay, leading to inconsistencies between
  the tables on the master and the slave.  Since it is not reasonable
  to handle this case, we assume that each statement for a
  non-transactional engines is atomic.&lt;/li&gt;

  &lt;li&gt;There is only one transaction cache, and for the
  non-transactional changes to "overtake" the transactional changes,
  we need an additional cache that is flushed for each statement. This
  will be implemented as part of &lt;a
  href="http://forge.mysql.com/worklog/task.php?id=2687"&gt;WL#2687&lt;/a&gt;.&lt;/li&gt;
  &lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-1218764242625382167?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/1218764242625382167/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=1218764242625382167' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/1218764242625382167'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/1218764242625382167'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2009/02/mixing-engines-in-transactions.html' title='Mixing engines in transactions'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-8778095209194708980</id><published>2008-08-20T10:13:00.006+02:00</published><updated>2008-08-29T08:47:59.794+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='protobuf'/><category scheme='http://www.blogger.com/atom/ns#' term='drizzle'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='programming'/><category scheme='http://www.blogger.com/atom/ns#' term='binlog'/><category scheme='http://www.blogger.com/atom/ns#' term='binary log'/><title type='text'>The missing pieces in the protobuf binary log</title><content type='html'>&lt;a href="http://code.google.com/apis/protocolbuffers/"&gt;Protobuf&lt;/a&gt; comes with a minor problem: it does not have support for handling "type tagged structures", that is, something reminiscent of &lt;em&gt;objects&lt;/em&gt; in OOP lingo, so if one is going to have a heterogeneous sequences of
messages, you have to roll it yourself. For that reason, I added a &lt;em&gt;transport frame&lt;/em&gt; for the messages in the binary log that wraps each with some extra information. In addition to allowing the binary log to be a sequence of messages, it also adds some integrity-checking data and simplifies some administrative tasks.&lt;p&gt;

&lt;table align="right"
  style="border-collapse: collapse; margin: 10pt; padding: 8pt; font-size: large;"&gt;
  &lt;caption style="font-weight: bold; font-size: smaller;"&gt;Transport frame with message&lt;/caption&gt;
  &lt;tr&gt;&lt;th style="border-width: thin; border-style: solid;"&gt;Length&lt;/th&gt;&lt;/tr&gt;
  &lt;tr&gt;&lt;th style="border-width: thin; border-style: solid;"&gt;Type Tag&lt;/th&gt;&lt;/tr&gt;
  &lt;tr&gt;&lt;th style="border-width: thin; height: 5ex; border-style: solid dashed;"&gt;Message&lt;/th&gt;&lt;/tr&gt;
  &lt;tr&gt;&lt;th style="border-width: thin; border-style: solid;"&gt;Checksum&lt;/th&gt;&lt;/tr&gt;
&lt;/table&gt;

The format of each message in the sequences is given in the table in
the margin. where the &lt;em&gt;length&lt;/em&gt; is a specially encoded length
that we will go through &lt;a href="#length-encoding"&gt;below&lt;/a&gt;,
&lt;em&gt;type&lt;/em&gt; is a single byte being the type tag, &lt;em&gt;message&lt;/em&gt;
being one of the messages given in the specification, and
&lt;em&gt;checksum&lt;/em&gt; being a checksum to ensure the integrity of the
transport.&lt;p&gt;

&lt;strong&gt;Checksum.&lt;/strong&gt; As checksum, the plan is to use a
CRC-32. We don't want it to be too large to affect performance, and we
want it to catch reasonable losses of integrity. I'm considering
storing this as a varint after the actual message, but for the time
being, it is given as 4 raw bytes (it is not implemented at all
yet). Please give me feedback on this: if we make it a varint, we can
stuff the checksum in there, but that will also run the risk of not
being able to read the checksum due to corruption of the checksum, so
offhand, I would say that a fixed number of bytes is preferable.&lt;p&gt;

&lt;strong&gt;Type Tag.&lt;/strong&gt; The type tag is a single byte giving the
type of the event. This means that we are limited to 255 events, but
considering that we don't even have 26 events in 5.1 right now, I
don't see that we will run into that limit very soon. It is possible
to put the type tag in the message as well, and specifying it as an
enum inside the protobuf specification, but that will just provide the
information in two places, so it is better to keep it separately.&lt;p&gt;

&lt;strong&gt;Length.&lt;/strong&gt; The length is the length of the
&lt;em&gt;message&lt;/em&gt;, that is, it does not include the type tag, checksum,
nor the length itself. This simplifies the normal processing, and in
the event that one needs to skip an event, it is easy to compute the
next position of a transport frame by just decoding the length (see
below).

&lt;h3 id="length-encoding"&gt;Length encoding&lt;/h3&gt;

The length is encoded using a special scheme to allow for very little
overhead for small events while still leaving room for giving the
length for very large events. This scheme currently allows for a
compact representation of lengths from 2 bytes 4 GiB (Gigabytes) and,
if you don't need to have a compact representation of 2, you can
represent lengths in the range 3 bytes to to 16 EiB (Exabytes)
[sic].&lt;p&gt;

The basic idea is to note that the length of a message can never be
zero, and the minimal length in this case is actually 16 bytes. Since
we will never have a length that is less than 16 for an event, that
leaves the lengths 0-16 available for denoting other information.  The
obvious solution is to let the first byte denote either the length, if
it is, say greater than 8, or the number of bytes following the byte
that gives the length if it is less than or equal to 8, but we can
actually do better.&lt;p&gt;

Storing &lt;em&gt;Length&lt;/em&gt; requires
&lt;em&gt;ceil(log&lt;sub&gt;256&lt;/sub&gt;(Length))&lt;/em&gt; bytes, so if we let the first
byte L be 

&lt;div style="font-style: italic; margin-left: 4em;"&gt;
ceil(log&lt;sub&gt;2&lt;/sub&gt;(ceil(log&lt;sub&gt;256&lt;/sub&gt;(Length)))) - 1
&lt;/div&gt;

(that is, the smallest power of two that is greater than the number of
bytes needed for the length, minus 1), we can get away with reserving
significantly fewer values.  So, for example:&lt;p&gt;

&lt;table style="border-collapse: collapse; margin-left: 4em;"&gt;
  &lt;tr&gt;
    &lt;th colspan="5"&gt;Bytes&lt;/th&gt;
    &lt;th&gt;Value&lt;/th&gt;
  &lt;tr&gt;
    &lt;td style="border-width: thin; border-style: solid;"&gt;2A&lt;/td&gt;
    &lt;td colspan="4"&gt;&lt;/td&gt;
    &lt;td style="padding: 0pt 1em;"&gt;42&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td style="border-width: thin; border-style: solid;"&gt;00&lt;/td&gt;
    &lt;td style="border-width: thin; border-style: solid;"&gt;FF&lt;/td&gt;
    &lt;td style="border-width: thin; border-style: solid;"&gt;02&lt;/td&gt;
    &lt;td colspan="2"&gt;&lt;/td&gt;
    &lt;td style="padding: 0pt 1em;"&gt;767&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td style="border-width: thin; border-style: solid;"&gt;01&lt;/td&gt;
    &lt;td style="border-width: thin; border-style: solid;"&gt;FF&lt;/td&gt;
    &lt;td style="border-width: thin; border-style: solid;"&gt;01&lt;/td&gt;
    &lt;td style="border-width: thin; border-style: solid;"&gt;01&lt;/td&gt;
    &lt;td style="border-width: thin; border-style: solid;"&gt;00&lt;/td&gt;
    &lt;td style="padding: 0pt 1em;"&gt;66047&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;&lt;p&gt;

Computing the number of bytes can be done by computing &lt;code&gt;1
&amp;lt;&amp;lt; (L + 1)&lt;/code&gt;, but computing the inverse is a little more
involved.

The following two functions does the job. Although it looks like a lot
of code, &lt;code&gt;length_encode()&lt;/code&gt; is actually only 29 instructions
on my machine (no function calls), while &lt;code&gt;length_decode()&lt;/code&gt;
is about 7 instructions. The trick here is to compute the logarithm at
the same time as we serialize the bytes into the memory, so the overhead
compared to just serializing the length is just a few instructions.

&lt;pre class="code"&gt;
inline unsigned char *
length_encode(size_t length, unsigned char *buf)
{
  unsigned char *ptr= buf;
  assert(length &amp;gt; 1);
  if (length &amp;lt; 256)
    *ptr++= length &amp;amp; 0xFF;
  else {
    int_fast8_t log2m1= -1;        &lt;em&gt;// ceil(log2(ptr - buf)) - 1&lt;/em&gt;
    uint_fast8_t pow2= 1;          &lt;em&gt;// pow2(log2m1 + 1)&lt;/em&gt;
    while (length &amp;gt; 0) {
      &lt;em&gt;// Check the invariants&lt;/em&gt;
      assert(pow2 == (1 &amp;lt;&amp;lt; (log2m1 + 1)));
      assert((ptr - buf) &amp;lt;= (1 &amp;lt;&amp;lt; (log2m1 + 1)));

      &lt;em&gt;// Write the least significant byte of the current&lt;/em&gt;
      &lt;em&gt;// length. Prefix increment is used to make space for the first&lt;/em&gt;
      &lt;em&gt;// byte that will hold log2m1.&lt;/em&gt;
      *++ptr= length &amp; 0xFF;
      length &amp;gt;&amp;gt;= 8;

      &lt;em&gt;// Ensure the invariant holds by correcting it if it doesn't,&lt;/em&gt;
      &lt;em&gt;// that is, the number of bytes written is greater than the&lt;/em&gt;
      &lt;em&gt;// nearest power of two.&lt;/em&gt;
      if (ptr - buf &amp;gt; pow2) {
        ++log2m1;
        pow2 &amp;lt;&amp;lt;= 1;
      }
    }
    &lt;em&gt;// Clear the remaining bytes up to the next power of two&lt;/em&gt;
    while (++ptr &lt; buf + pow2 + 1)
      *ptr= 0;
    *buf= log2m1;
    assert(ptr == buf + pow2 + 1);
  }
  return ptr;
}

inline const unsigned char *
length_decode(const unsigned char *buf, size_t *plen)
{
  if (*buf &amp;gt; 1) {
    *plen = *buf;
    return buf + 1;
  }

  size_t bytes= 1U &amp;lt;&amp;lt; (*buf + 1);
  const unsigned char *ptr= buf + 1;
  size_t length= 0;
  for (unsigned int i = 0 ; i &amp;lt; bytes ; ++i)
    length |= *ptr++ &amp;lt;&amp;lt; (8 * i);
  *plen= length;
  return ptr;
}
&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-8778095209194708980?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/8778095209194708980/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=8778095209194708980' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/8778095209194708980'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/8778095209194708980'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2008/08/missing-pieces-in-protobuf-binary-log.html' title='The missing pieces in the protobuf binary log'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-7530214196912126030</id><published>2008-08-19T10:14:00.007+02:00</published><updated>2008-08-19T15:19:32.047+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='protobuf'/><category scheme='http://www.blogger.com/atom/ns#' term='drizzle'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='google'/><category scheme='http://www.blogger.com/atom/ns#' term='binlog'/><category scheme='http://www.blogger.com/atom/ns#' term='binary log'/><title type='text'>Using protobuf for designing and implementing replication in Drizzle</title><content type='html'>So, following the lead of &lt;a
href="http://krow.livejournal.com/604544.html"&gt;Brian&lt;/a&gt;, I spent a few hours of the weekend to create a very simple replication scheme for &lt;a
href="http://launchpad.net/drizzle"&gt;Drizzle&lt;/a&gt; using &lt;a
href="http://code.google.com/p/protobuf/"&gt;protobuf&lt;/a&gt; for specifying
the binary log events.&lt;p&gt;

Since we are developing a replication for a cloud, there are a few
things we have to consider:

&lt;ul&gt;

  &lt;li&gt;&lt;strong&gt;Servers are unreliable.&lt;/strong&gt; We shall not trust
  server, but we shall expect them to crash at the worst possible time
  (&lt;a href="http://en.wikipedia.org/wiki/Murphy%27s_law"&gt;Murphy&lt;/a&gt; is
  a very good friend of mine, you know. He must be, since he visits me
  very often.) This means that we need to have support to allow
  statements to be sent to the slaves before the transaction is
  complete, which means that we need to support
  &lt;em&gt;interleaving&lt;/em&gt; and hence both &lt;em&gt;commit&lt;/em&gt; and
  &lt;em&gt;rollback&lt;/em&gt; events.&lt;p&gt;

  In order to handle interleaving, we need to have a &lt;em&gt;transaction
  id&lt;/em&gt;, but in order to handle session specific data in the event
  of a crash (for example, temporary tables), we need to have a
  &lt;em&gt;session id&lt;/em&gt; as well. However, the session id is only needed
  for statements, not for other events, so we add it there. This will
  allow the slave to expire any session objects when necessary.&lt;p&gt;

  Since we cannot always know if the transaction is complete when a
  statement has been executed, we need to have the commit and rollback
  events as separate events, instead of using the alternative approach
  of adding flags to each query event.&lt;/li&gt;
  
  &lt;li&gt;&lt;strong&gt;Reconnections are frequent.&lt;/strong&gt; Since masters go up
  and down all the time, we have to do what we can to make
  reconnections to another master easy. Among other things, it means
  that we cannot interrogate the master of a slave after it has
  crashed to figure out where we should start replication, so we need
  some form of &lt;a
  href="http://jan.kneschke.de/projects/mysql/mysql-proxy-and-a-global-transaction-id"&gt;&lt;em&gt;Global
  Transaction ID&lt;/em&gt;&lt;/a&gt; to be able to decide where to start
  replication when connecting to another master.&lt;p&gt;

  In our case, we want the transaction id to be transferable to other
  servers as well, so we combine the server id and the transaction id
  to form the Global Transaction ID for this replication.&lt;p&gt;

  Since reconnections are frequent, we also need to have techniques
  for resolving conflicts between events, and using a
  &lt;em&gt;timestamp&lt;/em&gt; is such a one. To handle that, we add a timestamp
  to each event, and we make room for a &lt;em&gt;nano-precision
  timestamp&lt;/em&gt; immediately, meaning that we need at least 64 bits
  for that.&lt;/li&gt;

  &lt;li&gt;&lt;strong&gt;Network is not reliable.&lt;/strong&gt; We expect the cloud to
  be spread all over the planet, so we cannot really trust the network
  to provide a reliable transport. This means that we need some form
  of &lt;em&gt;checksum&lt;/em&gt; on each event to ensure that it was transferred
  correctly.&lt;/li&gt;
&lt;/ul&gt;

Now, for the first simple implementation, we're aiming at a
statement-based replication, which means that there has to be a way to
transfer session context information. Remember that statement based
replication just sends statements that change the existing state of
the database, but which is also dependent on the context for the
session. For example, the &lt;code&gt;INSERT&lt;/code&gt; statement below cannot
be reliably replicated without the value of the user variable
&lt;code&gt;@foo&lt;/code&gt;.

&lt;pre class="code"&gt;
SET @foo = 'Hello world!';
INSERT INTO whatever VALUES (@foo);
&lt;/pre&gt;

This is already handled in MySQL Replication by preceeding each query
log events with a sequence of &lt;em&gt;context events&lt;/em&gt;, but for this
solution there is another approach that is more appropriate: adding a
set of name-value pairs to the query event.

Another problem is that there are functions that are
non-deterministic, or context dependent in other ways, but these can
be handled by rewriting the queries as follows:&lt;p&gt;

&lt;table border="1"&gt;
  &lt;tr&gt;&lt;th&gt;Instead of&lt;/th&gt;&lt;th&gt;...use this&lt;/th&gt;&lt;/tr&gt;
  &lt;tr&gt;
    &lt;td valign="top"&gt;INSERT INTO info VALUES (UUID(), 47);&lt;/td&gt;
    &lt;td valign="top"&gt;SET @tmp = UUID();&lt;br&gt;INSERT INTO info VALUES (@tmp, 47);&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;&lt;p&gt;

Now, the protobuf specification for all the above items, including
some events used to control the storage of the binary log, is:

&lt;pre code=""&gt;
package BinaryLog;

message Header {
  required fixed64 timestamp = 1;
  required uint32 server_id = 2;
  required uint32 trans_id = 3;
}

message Start {
  required Header header = 1;
  required uint32 server_version = 2;
  required string server_signature = 3;
}

message Chain {
  required Header header = 1;
  required uint32 next = 2;            // Sequence number of next file
}

message Query {
  message Variable {
    required string name = 1;
    required string value = 2;
  }

  required Header header = 1;
  repeated Variable variable = 2;
  required uint32 session_id = 3;
  required string query = 4;
}

message Commit {
  required Header header = 1;
}

message Rollback {
  required Header header = 1;
}
&lt;/pre&gt;

After tossing together a reader and a writer for the format, the
result is:

&lt;pre class="code"&gt;
$ &lt;strong&gt;./binlog_writer --trans 1 \
&amp;gt; --set nick=mkindahl --set name='Mats Kindahl' \
&amp;gt; 'INSERT INTO whatever VALUES (@nick,@name)'&lt;/strong&gt;
$ &lt;strong&gt;./binlog_writer --trans 1 \
&amp;gt; --set nick=krow --set name='Brian Aker' \
&amp;gt; 'INSERT INTO whatever VALUES (@nick,@name)'&lt;/strong&gt;
$ &lt;strong&gt;./binlog_writer --trans 1 \
&amp;gt; --set nick=mtaylor --set name='Monty Taylor' \
&amp;gt; 'INSERT INTO whatever VALUES (@nick,@name)'&lt;/strong&gt;
$ &lt;strong&gt;./binlog_reader&lt;/strong&gt;
# Global Id: (1,0)
# Timestamp: 484270929829911
set @name = 'Mats Kindahl'
set @nick = 'mkindahl'
INSERT INTO whatever VALUES (@nick,@name)
# Global Id: (1,0)
# Timestamp: 484330886264299
set @name = 'Brian Aker'
set @nick = 'krow'
INSERT INTO whatever VALUES (@nick,@name)
# Global Id: (1,0)
# Timestamp: 484391458447787
set @name = 'Monty Taylor'
set @nick = 'mtaylor'
INSERT INTO whatever VALUES (@nick,@name)
$ &lt;strong&gt;ls -l log.bin&lt;/strong&gt;
-rw-r--r-- 1 mats bazaar 311 2008-08-19 14:22 log.bin
&lt;/pre&gt;

Protobuf Rocks!

You find the branch containing the ongoing development of this at &lt;a
href=""&gt;Launchpad&lt;/a&gt;. Right now, there is no changes to the server,
we want the format to be stable first, so the branch is merged on a
regular basis to the main tree as well.

&lt;ul&gt;
  &lt;li&gt;Branch: &lt;a href="http://code.launchpad.net/~mkindahl/drizzle/replication-simple"&gt;&lt;code&gt;lp:~mkindahl/drizzle/replication-simple&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Blueprint: &lt;a href="https://blueprints.launchpad.net/drizzle/+spec/replication-simple" &gt;&lt;code&gt;drizzle/replication-simple&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-7530214196912126030?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/7530214196912126030/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=7530214196912126030' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/7530214196912126030'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/7530214196912126030'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2008/08/using-protobuf-for-designing-and.html' title='Using protobuf for designing and implementing replication in Drizzle'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-5560809553311962189</id><published>2008-08-14T18:05:00.006+02:00</published><updated>2008-08-19T10:14:01.427+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='streaming'/><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><category scheme='http://www.blogger.com/atom/ns#' term='drizzle'/><category scheme='http://www.blogger.com/atom/ns#' term='programming'/><category scheme='http://www.blogger.com/atom/ns#' term='manipulator'/><category scheme='http://www.blogger.com/atom/ns#' term='join'/><title type='text'>A join I/O manipulator for IOStream</title><content type='html'>I started playing around with protobuf when doing some stuff in &lt;a href="http://launchpad.net/drizzle"&gt;Drizzle&lt;/a&gt; (more about that later), and since the examples where using IOStream, the table reader and writer that &lt;a href="http://krow.livejournal.com/604544.html"&gt;Brian&lt;/a&gt; wrote is using IOStreams. Now, IOStreams is pretty powerful, but it can be a pain to use, so of course I start tossing together some utilities to make it easier to work with.&lt;p&gt;

Being a serious Perl addict since 20 years, I of course start missing a lot of nice functions for manipulating strings, and the most immediate one is &lt;a href=""&gt;join&lt;/a&gt;, so I wrote a C++ IOStream manipulator to join the elements of an arbitrary sequence and output them to an &lt;code&gt;std::ostream&lt;/code&gt;.&lt;p&gt;

In this case, since the I/O Manipulator takes arguments, it has to be written as a class. Recall that &lt;code&gt;std::cout &amp;lt;&amp;lt; foo(3)&lt;/code&gt; is just a shorthand for &lt;code&gt;operator&amp;lt;&amp;lt;(std::cout, foo(3))&lt;/code&gt;. Since we want to avoid constructing the full string before writing it to the output stream, we define our own &lt;code&gt;joiner&lt;/code&gt; class and create a &lt;code&gt;operator&amp;lt;&amp;lt;(std::ostream&amp;amp;, joiner&amp;amp;)&lt;/code&gt; function that work with the &lt;code&gt;joiner&lt;/code&gt; class in the following manner:

&lt;pre class="code"&gt;
template &amp;lt;class FwdIter&amp;gt; class joiner {
  friend std::ostream&amp;amp; operator&amp;lt;&amp;lt;(std::ostream&amp;amp; out, const joiner&amp;amp; j) {
    j.write(out);
    return out;
  }

public:
  explicit joiner(const std::string&amp;amp; separator, FwdIter start, FwdIter finish)
    : m_sep(separator), m_start(start), m_finish(finish)
  { }

private:
  std::string m_sep;
  FwdIter m_start, m_finish;

  void write(std::ostream&amp;amp; out) const {
    FwdIter fi = m_start;
    &lt;span style="color: red"&gt;if (m_start == m_finish)&lt;/span&gt;
    &lt;span style="color: red"&gt;  return;&lt;/span&gt;
    while (true) {
      out &amp;lt;&amp;lt; *fi;
      if (++fi == m_finish)
        break;
      else
        out &amp;lt;&amp;lt; m_sep;
    }
  }
};
&lt;/pre&gt;

So, now we can write something like:

&lt;pre class="code"&gt;
std::cout &amp;lt;&amp;lt; joiner&amp;lt;std::vector&amp;lt;int&amp;gt;::const_iterator&amp;gt;(",", seq.begin(), seq.end())
          &amp;lt;&amp;lt; std::endl;
&lt;/pre&gt;

This is an awful lot to type, and especially difficult to maintain since the type of the sequence we are printing might change, so we introduce two helper functions to infer the iterator type for us:

&lt;pre class="code"&gt;
template &amp;lt;class FwdIter&amp;gt;
joiner&amp;lt;FwdIter&amp;gt;
join(const std::string&amp;amp; delim, FwdIter start, FwdIter finish) {
  return joiner&amp;lt;FwdIter&amp;gt;(delim, start, finish);
}

template &amp;lt;class Container&amp;gt;
joiner&amp;lt;typename Container::const_iterator&amp;gt;
join(const std::string&amp;amp; delim, Container seq) {
  typedef typename Container::const_iterator FwdIter;
  return joiner&amp;lt;FwdIter&amp;gt;(delim, seq.begin(), seq.end());
}
&lt;/pre&gt;

Now we can use the following code to write out a comma-separated sequence and let the compiler infer the types for us.

&lt;pre class="code"&gt;
std::cout &amp;lt;&amp;lt; join(",", seq.begin(), seq.end())
          &amp;lt;&amp;lt; std::endl;
&lt;/pre&gt;

or even more compactly

&lt;pre class="code"&gt;
std::cout &amp;lt;&amp;lt; join(",", seq) &amp;lt;&amp;lt; std::endl;
&lt;/pre&gt;

&lt;strong&gt;Update:&lt;/strong&gt; There were a bug that cause the &lt;code&gt;write()&lt;/code&gt; function above to try to read the first element of an empty sequence. I have added some code in &lt;span style="color: red"&gt;red&lt;/span&gt; above that needs to be added to handle empty sequences.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-5560809553311962189?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/5560809553311962189/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=5560809553311962189' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/5560809553311962189'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/5560809553311962189'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2008/08/join-io-manipulator-for-iostream.html' title='A join I/O manipulator for IOStream'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-278304241592578840</id><published>2008-06-05T12:15:00.004+02:00</published><updated>2008-06-05T12:39:19.639+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='databases'/><category scheme='http://www.blogger.com/atom/ns#' term='transactions'/><category scheme='http://www.blogger.com/atom/ns#' term='isolation'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='transaction'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='falcon'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><category scheme='http://www.blogger.com/atom/ns#' term='concurrent'/><category scheme='http://www.blogger.com/atom/ns#' term='concurrency'/><title type='text'>Statement-based replication is disabled for Falcon</title><content type='html'>Contrary to what I said &lt;a href="http://mysqlmusings.blogspot.com/2008/02/statement-based-replication-for-falcon.html"&gt;earlier&lt;/a&gt;, Falcon has decided to deliberately disable statement-based replication using the same capabilities mechanism that InnoDB uses.&lt;p&gt;

The reason is that isolation between concurrent transactions cannot be guaranteed, meaning that two concurrent transactions are not guaranteed to be serializable (the result of a concurrent transaction that has committed can "leak" into an ongoing transaction). Since they are not serializable, it means they cannot be written to the binary log in an order that produce the same result on the slave as on the master.&lt;p&gt;

However, when using row-based replication they &lt;em&gt;are&lt;/em&gt; serializable, because whatever values are written to the tables are also written to the binary log, so if data "leaks" into an ongoing transaction, this is what is written to the binary log as well, so that when the transaction commits, the values written to the table are the same as those written to the binary log.&lt;p&gt;

It is a rational decision, but I hope that Falcon will support statement-based replication as well in the future.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-278304241592578840?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/278304241592578840/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=278304241592578840' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/278304241592578840'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/278304241592578840'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2008/06/statement-based-replication-is-disabled.html' title='Statement-based replication is disabled for Falcon'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-5000415620252200474</id><published>2008-04-13T21:34:00.003+02:00</published><updated>2010-01-08T15:07:20.932+01:00</updated><title type='text'>MySQL Conference Replication tutorial at Monday 9:00am - 12:30pm</title><content type='html'>&lt;div class="vevent"&gt;Lars and I will have the &lt;a class="url summary" href="http://en.oreilly.com/mysql2008/public/schedule/detail/2145"&gt;replication tutorial&lt;/a&gt; &lt;abbr class="dtstart" title="2008-04-14T09:00:00"&gt;Monday 9:00am&lt;/abbr&gt; to &lt;abbr class="dtend" title="2008-04-14T12:30:00"&gt;12:30pm&lt;/abbr&gt; in &lt;span class="location"&gt;Ballroom H&lt;/span&gt;.&lt;/div&gt;&lt;p&gt;

In order to make it easy to play around with replication, I threw together a little package with scripts that are available from the &lt;a href="http://forge.mysql.com/wiki/Replication/Tutorial"&gt;tutorial page&lt;/a&gt; at the forge. With this script, you can run several servers from a replication tutorial directory without interfering with existing installations of the server that you might have installed on your machine. In order to use this package, you need to download a binary distribution of the server without any installers, the replication tutorial package, and unpack them in the same directory.&lt;p&gt;

It is, however, not necessary to download the package to be able to attend the tutorial, but it gives you an option to experiment with replication.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-5000415620252200474?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/5000415620252200474/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=5000415620252200474' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/5000415620252200474'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/5000415620252200474'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2008/04/mysql-conference-replication-tutorial.html' title='MySQL Conference Replication tutorial at Monday 9:00am - 12:30pm'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-6733578046712441017</id><published>2008-02-07T12:01:00.000+01:00</published><updated>2008-02-07T12:19:46.594+01:00</updated><title type='text'>Statement-based replication for Falcon</title><content type='html'>I just read the &lt;a href="http://mysqlha.blogspot.com/2008/01/questions-about-falcon.html"&gt;post from Mark&lt;/a&gt; regarding some questions he had about the Falcon engine. One of the points that made me jump is the following:

&lt;blockquote&gt;
&lt;strong&gt;Row-level replication (instead of statement-based replication) is required when replicating Falcon objects.&lt;/strong&gt;&lt;p&gt;

This might be enough to keep me from upgrading, but I am not sure if this is limited to Falcon. Will future MySQL releases require the use of row based replication? Having SQL statements in the binlog is invaluable to me and I am not willing to give that up.
&lt;/blockquote&gt;

Now, seriously, it is not trivial to disable statement-based replication for any engine so it is basically always on; whether it works as expected is a different story. So, in short: &lt;em&gt;there is nothing specifically done to disable statement-based replication for the Falcon engine. It is just not &lt;strong&gt;supported&lt;/strong&gt; by them (yet).&lt;/em&gt;&lt;p&gt;

You can actually use statement-based replication to replicate any tables using any engine, Cluster tables as well. It is not a good idea to replicate Cluster tables for several reasons, but it is possible.&lt;p&gt;

We will continue to maintain statement-based replication for the foreseeable future, but it is just not possible to handle all the quirks that can occur in odd situations (for some examples, see &lt;a href="http://bugs.mysql.com/bug.php?id=31168"&gt;BUG#31168&lt;/a&gt;, &lt;a href="http://bugs.mysql.com/bug.php?id=3989"&gt;BUG#3989&lt;/a&gt;, and &lt;a href="http://bugs.mysql.com/bug.php?id=19630"&gt;BUG#19630&lt;/a&gt;). If you know how to write queries that avoid problems like these, you will not have any problems, but if you are concerned about whether you're up to it, switch to use row-based replication.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-6733578046712441017?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/6733578046712441017/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=6733578046712441017' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/6733578046712441017'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/6733578046712441017'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2008/02/statement-based-replication-for-falcon.html' title='Statement-based replication for Falcon'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-2786373943299353014</id><published>2008-01-07T07:21:00.002+01:00</published><updated>2009-04-23T19:10:44.674+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='virtual table'/><category scheme='http://www.blogger.com/atom/ns#' term='atom'/><category scheme='http://www.blogger.com/atom/ns#' term='syndication'/><category scheme='http://www.blogger.com/atom/ns#' term='sqlite'/><category scheme='http://www.blogger.com/atom/ns#' term='feed'/><category scheme='http://www.blogger.com/atom/ns#' term='rss'/><title type='text'>SQLite table to read Atom feeds</title><content type='html'>Ah, Christmas Holidays! Time to take a break from the daily
chores... to spend some time with family... but also time to catch up
on reading and spend some hours on some fun hacking.&lt;p&gt;

When catching up on my reading of &lt;a href="www.ddj.com"&gt;Dr. Dobb's Journal&lt;/a&gt;, I came across an &lt;a href="http://www.ddj.com/database/202802959"&gt;interesting article&lt;/a&gt; by &lt;a href="http://www.mikesclutter.com/"&gt;Michael Owens&lt;/a&gt; about writing virtual tables for &lt;a href="www.sqlite.org"&gt;SQLite&lt;/a&gt;, which got me thinking about a small hack I've wanted to do for a while: a table that reads an RSS/Atom feed and presents the data to the query engine. Originally, I was planning to implement this as a &lt;a href="www.mysql.com"&gt;MySQL&lt;/a&gt; &lt;a href="http://dev.mysql.com/tech-resources/articles/storage-engine/part_1.html"&gt;Storage Engine&lt;/a&gt;, but since I was reading this article and the interface seems easy enough to work with, I decided to just whip together a simple prototype for SQLite instead. Since I currently don't have a good place to publish the repository, I have a distro available at &lt;a href="http://www.kindahl.net/pub/sqlite-feedme-0.01.tar.gz"&gt;http://www.kindahl.net/pub/sqlite-feedme-0.01.tar.gz&lt;/a&gt;.&lt;p&gt;

After building and installing, the table can be created as simple as this:

&lt;pre class="code"&gt;
mats@romeo:~/proj/feedme$ sqlite3
SQLite version 3.4.2
Enter ".help" for instructions
sqlite&amp;gt; &lt;strong&gt;.load libfeedme.so&lt;/strong&gt;
sqlite&amp;gt; &lt;strong&gt;create virtual table onlamp&lt;/strong&gt;
   ...&amp;gt; &lt;strong&gt;using feedme('http://www.oreillynet.com/pub/feed/8');&lt;/strong&gt;
sqlite&amp;gt; &lt;strong&gt;select title from onlamp;&lt;/strong&gt;
PyMOTW: weakref
What the Perl 6 and Parrot Hackers Did on their Christmas Vacation
Least Appropriate Uses of Perl You've Seen
YAP6 Operator: Filetests?
WILFZ (What I Learned From Zope):  Buildout
TPT(Tiny Python Tip):  Watch Jeff Rush's Videos
PyCon 2008 Talks and Tutorials Finalized
TPT(Tiny Python Tip):  Python for Bash Scripters
What the X-Files Taught Us about Real Aliens
Python Web Framework Comparison:  Documentation and Marketing
Python Web Framework Comparison:  Documentation and Marketing
PyMOTW: mmap
Improving Test Performance
YAP6 Operator: Reduce Operators - Part II
WSGI:  Python Web Development's Howard Roark
&lt;/pre&gt;

Note that it is still a prototype. My plans are to at least:

&lt;ul&gt;

  &lt;li&gt;Read the entire feed into memory and parse it from there instead
  of writing the feed to disk before parsing it. Reading it to disk
  was the default for cURL, so I just stuck to that for the prototype
  (yeah, yeah. I know I'm lazy.)&lt;/li&gt;

  &lt;li&gt;Allow the feed format to automatically be detected and set the
  parser accordingly. Right now, it can just handle &lt;a
  href="http://atomenabled.org/"&gt;Atom feeds&lt;/a&gt;, and does not do a
  great job at that either.&lt;/li&gt;

  &lt;li&gt;Figure out a way to present multiple entries data in a useful
  way. For example, an entry can hold several links, but which one is
  really the interesting one?&lt;/li&gt;

&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-2786373943299353014?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/2786373943299353014/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=2786373943299353014' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/2786373943299353014'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/2786373943299353014'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2008/01/sqlite-table-to-read-atom-feeds.html' title='SQLite table to read Atom feeds'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-2850935712854540823</id><published>2007-08-14T21:02:00.000+02:00</published><updated>2007-09-05T14:15:59.365+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='log'/><category scheme='http://www.blogger.com/atom/ns#' term='stopping'/><category scheme='http://www.blogger.com/atom/ns#' term='binary'/><category scheme='http://www.blogger.com/atom/ns#' term='slave'/><category scheme='http://www.blogger.com/atom/ns#' term='master_pos_wait'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='binlog'/><title type='text'>Stopping the slave exactly at a specified binlog position</title><content type='html'>Catching up on some articles on the &lt;a href="http://www.planetmysql.org/"&gt;Planet MySQL&lt;/a&gt; feed, I just read &lt;a href="http://mysqldba.blogspot.com/2007/07/replication-syncing-masterposwait.html"&gt;the post by Dathan&lt;/a&gt; on how to promote a slave to be master by using &lt;code&gt;MASTER_POS_WAIT()&lt;/code&gt;.  The &lt;code&gt;MASTER_POS_WAIT()&lt;/code&gt; is an excellent function that allows you to wait until the slave reaches a point &lt;em&gt;at or after the given binlog position&lt;/em&gt;. Observe that after the statement issuing a &lt;code&gt;MASTER_POS_WAIT()&lt;/code&gt; returns, the slave threads are still running, so this means that even if a &lt;code&gt;STOP SLAVE&lt;/code&gt; is issued immediately after the statement with &lt;code&gt;MASTER_POS_WAIT()&lt;/code&gt;, it is bound to move a little more before actually stopping. For Dathan's situation, this is not necessary, but wouldn't it be great if you could stop a slave at &lt;em&gt;exactly&lt;/em&gt; the position that you want? Well, that is possible.&lt;p&gt;

There is another command that does exactly what you want, and that is &lt;code&gt;START SLAVE UNTIL&lt;/code&gt;. The problem with this command is that it can only be executed when the slave threads have stopped, but that is trivial to do by just issuing a &lt;code&gt;STOP SLAVE&lt;/code&gt;. In other words, instead of doing:

&lt;pre class="code"&gt;
SELECT MASTER_POS_WAIT('dbmaster1-bin.000002', 4);
SLAVE STOP;
&lt;/pre&gt;

like Dathan suggests, do:

&lt;pre class="code"&gt;
STOP SLAVE;
START SLAVE UNTIL
    MASTER_LOG_FILE='dbmaster1-bin.000002', MASTER_LOG_POS=4;
SELECT MASTER_POS_WAIT('dbmaster1-bin.000002', 4);
&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-2850935712854540823?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/2850935712854540823/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=2850935712854540823' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/2850935712854540823'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/2850935712854540823'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2007/08/stopping-slave-exactly-at-specified.html' title='Stopping the slave exactly at a specified binlog position'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-116311944731269857</id><published>2007-07-01T21:21:00.000+02:00</published><updated>2007-09-05T21:27:42.214+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='streaming'/><category scheme='http://www.blogger.com/atom/ns#' term='fawn'/><category scheme='http://www.blogger.com/atom/ns#' term='feisty'/><category scheme='http://www.blogger.com/atom/ns#' term='build'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='building'/><category scheme='http://www.blogger.com/atom/ns#' term='kubuntu'/><category scheme='http://www.blogger.com/atom/ns#' term='lua'/><category scheme='http://www.blogger.com/atom/ns#' term='proxy'/><category scheme='http://www.blogger.com/atom/ns#' term='blob'/><title type='text'>Musings on MySQL Proxy</title><content type='html'>When seeing that the &lt;a href=""&gt;MySQL Proxy was released&lt;/a&gt;, I decided to try to experiment with it since I see strong potential with this tool, both for replication and for other uses (recall that I'm a replication guy, so this is my primary focus).  I'm actually on vacation, but this will of course not stop me from tinkering with things (I know, I'm just a hopeless case in this aspect ;) ).&lt;p&gt;

After reporting &lt;a href="http://bugs.mysql.com/bug.php?id=29548"&gt;a minor bug&lt;/a&gt;, I managed to build and run it with some sample scripts. I'm using Kubuntu Feisty, and had some initial problems, but it was actually pretty straightforward. I'll repeat the steps anyway, in case anybody else have problems.

&lt;ol&gt;
&lt;li&gt;Get the source from the repository
&lt;blockquote class="code"&gt;svn co http://svn.mysql.com/svnpublic/mysql-proxy/ mysql-proxy&lt;/blockquote&gt;
&lt;li&gt;Make sure you have all packages necessary. Several of the packages below were not installed for me.
&lt;blockquote class="code"&gt;apt-get install pkg-config liblua5.1-0 liblua5.1-dev libevent-dev libevent1 libglib2.0 libglib2.0-dev libmysqlclient-dev&lt;/blockquote&gt;
&lt;li&gt;Switch to the directory where the source is.
&lt;blockquote class="code"&gt;cd mysql-proxy/trunk&lt;/blockquote&gt;
&lt;li&gt;Set up all the stuff necessary for the autotools (Autoconf, Automake, and Libtool)&lt;p&gt;
&lt;code&gt;./autogen.sh&lt;/code&gt;
&lt;li&gt;Run configure, but make sure to tell the configuration script that it should use Lua 5.1&lt;p&gt;
&lt;code&gt;./configure --with-lua=lua5.1&lt;/code&gt;
&lt;li&gt;Build the proxy&lt;p&gt;&lt;code&gt;make&lt;/code&gt;
&lt;/ol&gt;

&lt;h3&gt;Some applications&lt;/h3&gt;

After having experimented a little, I see some of the potential applications of the MySQL Proxy, but there are things missing to make these scenarios possible. Just to give some ideas, I will just present the ideas and last present what I see as missing pieces to make these possible.

&lt;h4&gt;Vertical and horizontal partitioning&lt;/h4&gt;

If it was possible to parse the query and decompose it into it's fragments, it could be possible to separate the query into two queries and send them off to different servers. The result sets could then be composed to form a new result set that is then delivered to the final client.  This can, of course, be accomplished thorough other means, but if you take a look at the next item, you have a variation that is not that simple to handle.&lt;p&gt;

It could be interesting to handle horizontal partitioning in the case that data for, e.g., a user is stored in different machines depending on geographic location. This is something that is interesting for companies like Flickr, Google, and YouTube since contacting a server near yourself geographically significantly improves response times.

&lt;h4&gt;BLOB Streaming&lt;/h4&gt;

In order to handle BLOB streaming, as I outlined in a &lt;a href="http://mysqlmusings.blogspot.com/2007/06/blob-locators-blob-streaming.html"&gt;previous post&lt;/a&gt;, it could be left to the proxy to build a final result set where the &lt;a href="http://forge.mysql.com/worklog/task.php?id=3583"&gt;BLOB Locators&lt;/a&gt; (URIs in my example) are replaced with the actual BLOBs by contacting a dedicated BLOB server that holds the BLOB and building the final result set inside the proxy.

&lt;h4&gt;Pre-heating slave threads&lt;/h4&gt;

By placing the proxy in between a master and a slave, it could be possible to pre-heat the slave by issuing a &lt;code&gt;SELECT&lt;/code&gt; query through a client connection to the slave. This is the &lt;a href="http://jan.prima.de/~jan/plok/archives/62-The-oracle-Algorithm-this-is-a-small-o.html"&gt;oracle algorithm&lt;/a&gt; presented by Paul Tuckfield from YouTube, but the solution in this case is simpler since it is not necessary to read the events from the relay log, but they can instead be caught "in the air" and acted upon.&lt;p&gt;

This actually requires some parsing of the replication stream, so it might be better to handle this by embedding Lua or Perl into the slave threads. (Personally, I would prefer Perl. Not because it is a better language, or easier to embed, but because I've been using Perl on an almost daily basis since 1988. OTOH, Lua is designed to be efficient and map internal structures to the language and seems very easy to work with, so we'll see what happens.)

&lt;h3&gt;The missing pieces&lt;/h3&gt;

Others have focused on what can be done with the MySQL Proxy, but I see some omissions that, if implemented, would turn the MySQL Proxy into an incredibly flexible tool.

&lt;ul&gt;
&lt;li&gt;It should be possible to parse and rewrite a query before sending it to the server. This is already possible, but a library to parse and build SQL queries would help a lot here.
&lt;li&gt;It should be possible to rewrite the result set in a convenient manner. Jan just added the ability to rewrite the packet that is sent back, so the proxy is heading in that direction, but a more convenient interface would be an advantage here as well.
&lt;li&gt;It should be possible to keep connections to several servers active, and decide what server to send the query to on a per-query basis (even sending queries to all servers, or different queries to different servers). AIUI, there is some rudimentary support for it right now, but not to the extent that I describe here.
&lt;li&gt;It should be possible to send several query-result sequences to servers for each query-result sequence sent to the proxy. This will make it possible to act on the result of a response to one server, and dynamically decide, e.g., what server to contact next in order to get the data that forms the final result set.
&lt;/ul&gt;

All-in-all, the MySQL Proxy is showing incredible potential and I, for one, will see what I can contribute with in order to make it even better.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-116311944731269857?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/116311944731269857/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=116311944731269857' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/116311944731269857'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/116311944731269857'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2007/07/musings-on-mysql-proxy.html' title='Musings on MySQL Proxy'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-2984017898494690606</id><published>2007-06-18T10:08:00.000+02:00</published><updated>2007-07-02T06:56:44.218+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='streaming'/><category scheme='http://www.blogger.com/atom/ns#' term='scaling'/><category scheme='http://www.blogger.com/atom/ns#' term='locator'/><category scheme='http://www.blogger.com/atom/ns#' term='scale-out'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='locators'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='blob'/><title type='text'>BLOB locators + BLOB streaming + Replication = Yeah!</title><content type='html'>On the MySQL Conference &amp;amp; Expo 2007, I had the chance of meeting up with &lt;a href="http://pbxt.blogspot.com/"&gt;Paul (the author of PBXT)&lt;/a&gt; and Mikael. We briefly touched the topic of the &lt;a href="http://pbxt.blogspot.com/2007/03/pbxt-and-scalable-blob-streaming.html"&gt;BLOB Streaming Protocol&lt;/a&gt; that Paul is working on, which I find really neat. On the way back home, I traveled with &lt;a href="http://www.mysqluc.com/cs/mysqluc2005/view/e_spkr/2059"&gt;Anders Karlsson&lt;/a&gt; (one of MySQL:s Sales Engineers), who is responsible for the &lt;a href="http://forge.mysql.com/worklog/task.php?id=3583"&gt;BLOB Locator worklog&lt;/a&gt; and he described the concepts from his viewpoint.&lt;p&gt;

Since I work with replication, these things got me thinking on what the impact is for replication and how it affects usability, efficiency, and scale-out. Being a RESTful guy, I started thinking about URIs both when Paul described the BLOB Streaming Protocol and when Anders starting describing the BLOB Locators. Apparently, &lt;a href="http://pbxt.blogspot.com/2007/06/geting-blob-out-of-database-with-blob.html"&gt;I wasn't the only one.&lt;/a&gt;&lt;p&gt;

&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_X_imutbSFuE/RnjdQ-cs5JI/AAAAAAAAAAM/MsJIMTw0bmI/s1600-h/BLOBStreaming.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://2.bp.blogspot.com/_X_imutbSFuE/RnjdQ-cs5JI/AAAAAAAAAAM/MsJIMTw0bmI/s320/BLOBStreaming.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5078051863571260562" align="left"/&gt;&lt;/a&gt;

Combining BLOB Locators with the BLOB Streaming Protocol has a significant impact on the scalability and performance of replication, and I'm going to show how by giving a typical use of replication: scaling out reads from a installation by replicating from a single master to several slaves.&lt;p&gt;

Now, when a client connects to get a blob from the database, the server delivers a result set containing one or more blob locators. Since we are using URI:s and the HTTP protocol, the blobs can be served by a normal web server, and the client can fetch the data in the Blobs using HTTP and build the real result set. The existence of blob locators is completely transparent to the client, who sees no difference from the previous implementation.&lt;p&gt;

Now, what does this give us that make this setup so scalable?

&lt;ul&gt;
&lt;li&gt;Instead of storing the actual blob data, we store a reference to the data (in the form of an URI). When working with the blob and copying it to another table, we will actually just copy the reference, which is a very quick operation compared to the size of most blobs. The use of the BLOB locator is entirely transparent to any operations on the blob: reading is not affected, and changing the blob can be accomplished using a copy-on-write semantics (which of course makes the operation slower).&lt;p&gt;
&lt;li&gt;Since we have a unique reference to a blob it is possible to implement caching mechanisms to cache results of, e.g., fulltext searches in the blobs.&lt;p&gt;
&lt;li&gt;The use of an URI makes the blob locator server-agnostic, which means that we can reliably replicate the URI instead of the blob and still expect any client that connects to the slave server to be able to fetch the blob using HTTP. There is no translation necessary when doing the replication, and the URI can be treated as just a string. This means that a scale-out strategy is trivial to implement. This is just a generalization of the recommended practice to store the blobs as files on a server, and save the file name in the tables instead: we just make it transparent to the user and simplify the deployment.&lt;p&gt;
&lt;li&gt;By using an URI as reference, we can put the blob data on a separate server, which can be dedicated to delivering blob data to requesters. Since everything is going via this server, it is very likely that "hot" data is available immediately, and since we are using an URI, delivery over the Internet can rely on Web Caches to avoid re-sending data that is already cached somewhere.&lt;p&gt;

We do not lose the ability to count the number of deliveries of the data, since we can always count the number of blob locators that we have been delivered instead of the number of BLOBs that have been delivered.&lt;p&gt;
&lt;li&gt;The HTTP protocol has support to both &lt;code&gt;PUT&lt;/code&gt; and &lt;code&gt;GET&lt;/code&gt; to read and write data to the server.&lt;p&gt;
&lt;li&gt;We unload a significant amount of "dumb" job from the server, that of assembling result sets consisting of blob data and other data, and therefore allow the server to perform more of the "intelligent" job of doing database searches.&lt;p&gt;
&lt;li&gt;The design is incredibly flexible since it is possible to, for example, allowing the blob server(s) to be placed anywhere, even in different towns, and can still keep the main operating site in one location.
&lt;/ul&gt;scale&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-2984017898494690606?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/2984017898494690606/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=2984017898494690606' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/2984017898494690606'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/2984017898494690606'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2007/06/blob-locators-blob-streaming.html' title='BLOB locators + BLOB streaming + Replication = Yeah!'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_X_imutbSFuE/RnjdQ-cs5JI/AAAAAAAAAAM/MsJIMTw0bmI/s72-c/BLOBStreaming.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-4817111782434452264</id><published>2007-06-18T00:08:00.000+02:00</published><updated>2007-06-18T00:12:52.528+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='poll'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='planning'/><title type='text'>Post on replication poll was lost</title><content type='html'>My &lt;a href="http://mysqlmusings.blogspot.com/2007/06/replication-poll-and-our-plans-for.html"&gt;last post on the replication poll&lt;/a&gt; was apparently lost from Planet MySQL. If you're interested, I commented on the replication poll, our future plans, and how they were affected by the poll.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-4817111782434452264?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/4817111782434452264/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=4817111782434452264' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/4817111782434452264'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/4817111782434452264'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2007/06/post-on-replication-poll-was-lost.html' title='Post on replication poll was lost'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-2875456887702898955</id><published>2007-06-03T12:46:00.000+02:00</published><updated>2007-06-15T18:56:46.472+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='resolution'/><category scheme='http://www.blogger.com/atom/ns#' term='online'/><category scheme='http://www.blogger.com/atom/ns#' term='semi-synchronous'/><category scheme='http://www.blogger.com/atom/ns#' term='checksum'/><category scheme='http://www.blogger.com/atom/ns#' term='detection'/><category scheme='http://www.blogger.com/atom/ns#' term='planning'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='performance'/><category scheme='http://www.blogger.com/atom/ns#' term='poll'/><category scheme='http://www.blogger.com/atom/ns#' term='multi-threaded'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='hash'/><category scheme='http://www.blogger.com/atom/ns#' term='conflict'/><title type='text'>The replication poll and our plans for the future</title><content type='html'>We've been running &lt;a href="http://dev.mysql.com/tech-resources/quickpolls/replication-features.html"&gt;replication poll&lt;/a&gt; and we've got some answers, so I thought I would comment a little on the results of the poll and what our future plans with respect to replication is as a result of the feedback. As I commented &lt;a href="http://mysqlmusings.blogspot.com/2007/05/coolest-future-replication-features.html"&gt;in the previous post&lt;/a&gt;, there are some items that require a significant development effort, but the feedback we got helps us to prioritize.&lt;p&gt;

The top five items from the poll above stands out, so I thought that I would comment on each of them in turn. The results of the poll were (when this post were written):&lt;p&gt;

&lt;table&gt;
&lt;tr&gt;&lt;td&gt;Online check that Master and Slave tables are consistent&lt;/td&gt;
    &lt;td&gt;45.4%&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Multi-source replication: replicating from several masters to one slave &lt;/td&gt;
    &lt;td&gt;36.3%&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Multi-threaded application of data on slave to improve performance&lt;/td&gt;
    &lt;td&gt;29.2%&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Conflict resolution: earlier replicated rows are not applied&lt;/td&gt;
    &lt;td&gt;21.0%&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Semi-synchronous replication: transaction copied to slave before commit&lt;/td&gt;
    &lt;td&gt;20.3%&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;

&lt;h3&gt;Online check that Master and Slave tables are consistent&lt;/h3&gt;

The most natural way to check that tables are consistent is to compute a hash of the contents of the table and then compare that with a hash of the same table on the slave. There are storage engines that have support for incrementally computing a hash, and for the other cases, the &lt;a href="http://www.xaprb.com/blog/2007/06/03/mysql-table-checksum-116-released"&gt;Table checksum&lt;/a&gt; that was released by Baron "Xaprb" can be used. The problem is to do the comparison while the replication is running, since any change to the table between computing the hash on the master and the slave will indicate that the tables are different when they in reality are not.

To solve this, we are planning to introduce support to transfer the table hash and perform the check while replication is running. By adding the hash to the binary log, we have a computed hash for a table at a certain point in time, and the slave can then compare the contents of the tables as it see the event, being sure that this is the same (relative) point in time as it were on the master.

We will probably add a default hash function for those engines that do not have something, and allow storage engines to return a hash of the data in the table (probably computed incrementally for efficiency).

&lt;h3&gt;Multi-source replication: replicating from several masters to one slave&lt;/h3&gt;

This is something that actually was started a while ago, but for several reasons is not finished yet. A large portion of the work is actually done, but since the code is a tad old (enough to be non-trivial to incorporate into the current clone), there is some work remaining to actually close this one.  Since there seems to be a considerable interest in this, both at the poll and at the &lt;a href="www.mysqlconf.com"&gt;MySQL Conference&lt;/a&gt;, we are considering finishing off this feature sometime in the aftermath of 5.1 GA. No promises here, though. There's a lot of things that we need to consider to build a high-quality replication solution, and we're a tad strained when it comes to manpower in the team.

&lt;h3&gt;Multi-threaded application of data on slave to improve performance&lt;/h3&gt;

This is something that we &lt;em&gt;really&lt;/em&gt; want to do, but which we &lt;em&gt;really&lt;/em&gt; do not have the manpower for currently. It is a significant amount of work, it would be a &lt;em&gt;huge&lt;/em&gt; improvement of the replication, but it would &lt;em&gt;utterly&lt;/em&gt; make us unable to do anything else for a significant period. Sorry folks, but however much I would like to see this happen, it would be irresponsible to promise that we will implement this in the near future. There are also some changes going on internally with the threading model, so it might be easier to implement in the near future. 

&lt;h3&gt;Conflict resolution: earlier replicated rows are not applied&lt;/h3&gt;

When multi-source comes into the picture, it is inevitable that some form of conflict resolution will be needed. We are currently working on providing a simple version of timestamp-based conflict resolution in the form of "latest change wins". This is ongoing work, so you will see it in a post-5.1 release in the near future.

&lt;h3&gt;Semi-synchronous replication: transaction copied to slave before commit&lt;/h3&gt;

There is already a &lt;a href="http://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplication"&gt;MySQL 4 patch for this&lt;/a&gt; written by folks at &lt;a href="http://code.google.com/"&gt;Google Code&lt;/a&gt; under the &lt;a href="http://code.google.com/p/google-mysql-tools/wiki/Mysql4Patches"&gt;Mysql4Patches&lt;/a&gt; work. The  idea is to not commit the ongoing transaction until the entire transaction has been successfully transferred to at least one slave. The reason for this is that it should be possible to switch to a slave in the event of a failure of the master, so it has to be certain that the transaction exists somewhere else (at least in disk). We consider this as very important for our ongoing work of being the best on-line database server for modern applications, so you will probably see it pretty soon. Compared to the patch above, we would like to generalize it slightly to allow it to be configurable how many slave should have received it before the transaction is committed. This will of course reduce performance of the master, but it will provide better redundancy in the case of a serious failure and it is a minor addition to the work anyway.

&lt;h3&gt;Binary log event checksum&lt;/h3&gt;

In addition to the things we mentioned above, this is very important to both find and repair problems with replication. The relay log is a potential source of problem, as is the code that writes the events to the relay log, so it is prudent to add a simple CRC checksum to each event to check the integrity of the event. Sadly enough, this does not exist currently, so we're trying to make this get into the code base as soon as possible, maybe even for 5.1 (keep your fingers crossed). This is &lt;em&gt;not&lt;/em&gt; a promise: we're doing what we can, but there are no guarantees.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-2875456887702898955?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/2875456887702898955/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=2875456887702898955' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/2875456887702898955'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/2875456887702898955'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2007/06/replication-poll-and-our-plans-for.html' title='The replication poll and our plans for the future'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-2899875718180793914</id><published>2007-05-11T08:36:00.000+02:00</published><updated>2007-06-03T12:53:22.802+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='poll'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='community'/><category scheme='http://www.blogger.com/atom/ns#' term='feature'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><title type='text'>The coolest future replication features...</title><content type='html'>...is something that you influence what it will be.&lt;p&gt;

The problem with replication is that we have so many things that we want to do, but we are not that many people. What we do is what everybody does when the to-do list is to long: prioritize. Since the replication features are developed for you (yes, you), we have added a quickpoll on the &lt;a href="http://dev.mysql.com/"&gt;http://dev.mysql.com/&lt;/a&gt; where you can pick the three most important replication features that you would like to see us focus on next (after the 5.1 GA).&lt;p&gt;

Do you think that on-line checks for table consistency is for weenies that cannot write a simple little script to do that? Please tell us that.&lt;p&gt;

Do you prefer to live on the edge and think that semi-synchronous replication is for safety junkies? Well, we'll be glad to hear your opinion.&lt;p&gt;

Do you think that the YouTube &lt;a href="http://jan.prima.de/~jan/plok/archives/62-The-oracle-Algorithm-this-is-a-small-o.html"&gt;oracle algorithm hack&lt;/a&gt; is the coolest thing on earth and that we should make sure to have it in a release soon? In this case you should especially tell me, because I think it is a pretty cool idea as well.&lt;p&gt;

We cannot promise that they will be done, and there are some features on the list that requires a substantial amount of work, so it might be that we decide to deliver many small features rather than one big feature... &lt;strong&gt;but we need your input&lt;/strong&gt;, so please go and take the quickpoll on &lt;a href="http://dev.mysql.com/"&gt;http://dev.mysql.com/&lt;/a&gt;, because you can be part of making MySQL the best on-line database for modern applications in the world.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-2899875718180793914?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/2899875718180793914/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=2899875718180793914' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/2899875718180793914'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/2899875718180793914'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2007/05/coolest-future-replication-features.html' title='The coolest future replication features...'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-9020639967776243192</id><published>2007-04-18T11:48:00.003+02:00</published><updated>2010-01-11T17:57:38.424+01:00</updated><title type='text'>Heading off to the MySQL Conference</title><content type='html'>&lt;img src="http://conferences.oreillynet.com/images/mysqluc2007/banners/speakers/468x60.jpg" alt="MySQL Conference &amp;amp; Expo" width="468" height="60"&gt;

&lt;br clear="all"&gt;

After some long months of intensive bug fixing, it's time to pack up the stuff and head off to the &lt;a href="http://mysqlconf.com"&gt;MySQL Conference &amp;amp; Expo&lt;/a&gt;. Since this is actually my first MySQL (User's) Conference, it's bound to be interesting.&lt;p&gt;

We've got a full schedule here, so it will definitely not be boring.&lt;p&gt;

&lt;div class="vevent"&gt;&lt;strong&gt;&lt;a class="url" href="http://mysqlconf.com/cs/mysqluc2007/view/e_sess/10950"&gt;&lt;abbr class="dtstart" title="2007-04-23T08:30:00"&gt;Monday, April 23, 8:30am&lt;/abbr&gt; - &lt;abbr class="dtend" title="2007-04-23T17:00:00"&gt;5:00pm&lt;/abbr&gt;, &lt;span class="location"&gt;Ballroom D&lt;/span&gt;&lt;/a&gt; (with lunch 12:00am to 1:30pm).&lt;/strong&gt; &lt;span class="description"&gt;&lt;a href="http://mysqlconf.com/cs/mysqluc2007/view/e_spkr/2655"&gt;Lars&lt;/a&gt;, &lt;a href="http://mysqlconf.com/cs/mysqluc2007/view/e_spkr/2156"&gt;JDD&lt;/a&gt; and I am going to hold a replication tutorial for a full day where we'll go over some basics on how to set up replication to work properly, but also delve into some advanced stuff like row-based replication and cluster replication to see what it can do for you. This is a must for anybody that is seriously going to work with replication.&lt;/span&gt;&lt;/div&gt;&lt;p&gt;

&lt;div class="vevent"&gt;&lt;strong&gt;&lt;a class="url" href="http://mysqlconf.com/cs/mysqluc2007/view/e_sess/10961"&gt;&lt;abbr class="dtstart" title="2007-04-24T10:45:00"&gt;Tuesday, April 24, 10:45am&lt;/abbr&gt; - &lt;abbr class="dtend" title="2007-04-24T11:45:00"&gt;11:45am&lt;/abbr&gt;, &lt;span class="location"&gt;Ballroom F&lt;/span&gt;.&lt;/a&gt;&lt;/strong&gt; &lt;span class="description"&gt;Lars and I will keep a &lt;span class="summary"&gt;"Tips and Tricks" session on replication&lt;/span&gt; and bring a bunch of useful tricks for getting replication to work the way &lt;em&gt;you&lt;/em&gt; want.&lt;/span&gt;&lt;/div&gt;&lt;p&gt;

&lt;div class="vevent"&gt;&lt;strong&gt;&lt;a class="url"href="http://mysqlconf.com/cs/mysqluc2007/view/e_sess/14355"&gt;&lt;abbr class="dtstart" title="2007-04-24T19:30:00"&gt;Tuesday, April 24, 7:30pm&lt;/abbr&gt; - &lt;abbr class="dtend" title="2007-04-24T21:00:00"&gt;9:00pm&lt;/abbr&gt;, &lt;span class="location"&gt;Bayshore&lt;/span&gt;.&lt;/a&gt;&lt;/strong&gt; Lars, &lt;a href="http://mysqlconf.com/cs/mysqluc2007/view/e_spkr/1763"&gt;Jeremy Cole&lt;/a&gt; and I will have a &lt;span class="description"&gt;&lt;abbr class="summary" title="Replication BoF"&gt;BoF session&lt;/abbr&gt; where we'll have an open discussion about the future of replication. We will also bring some food for though and a digression into upcoming features like conflict handling&lt;/span&gt;.&lt;/div&gt;&lt;p&gt;
 
&lt;strong&gt;&lt;a href="http://mysqlconf.com/cs/mysqluc2007/view/e_sess/12540"&gt;Tuesday, April 24, 5:30pm - 6:15pm, Ballroom E&lt;/a&gt;.&lt;/strong&gt; There will be a session on the replication roadmap where you will find Lars and me presenting some of the ideas that MySQL envision for the future. If you want to take part in making MySQL the best on-line database in the world, make sure to be there!&lt;p&gt;

You will also find &lt;a href="http://mysqlconf.com/cs/mysqluc2007/view/e_spkr/3182"&gt;Chuck&lt;/a&gt; and me in the Guru bar from 2pm to 4:30pm the same day, so bring your problems and we'll give them a good beating!&lt;p&gt;

On Friday, you will also find me at the &lt;a href="http://krow.livejournal.com/508872.html"&gt;Storage Engine Summit&lt;/a&gt; hosted by Brian.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-9020639967776243192?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/9020639967776243192'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/9020639967776243192'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2007/04/heading-off-to-mysql-conference.html' title='Heading off to the MySQL Conference'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-116679592766989005</id><published>2006-12-22T14:49:00.000+01:00</published><updated>2007-06-18T08:39:03.706+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='online'/><category scheme='http://www.blogger.com/atom/ns#' term='thread'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='slave'/><category scheme='http://www.blogger.com/atom/ns#' term='threads'/><category scheme='http://www.blogger.com/atom/ns#' term='feature'/><category scheme='http://www.blogger.com/atom/ns#' term='replication'/><category scheme='http://www.blogger.com/atom/ns#' term='error'/><category scheme='http://www.blogger.com/atom/ns#' term='status'/><title type='text'>The invisible I/O thread failures are no more</title><content type='html'>To get the status of the replication slave, it is possible to check the &lt;code&gt;Last_Error&lt;/code&gt; and &lt;code&gt;Last_Errno&lt;/code&gt; fields from &lt;code&gt;SHOW SLAVE STATUS&lt;/code&gt;.  Unfortunately, they only give information about the status of the &lt;em&gt;SQL&lt;/em&gt; thread (and not always that either).  If the I/O thread fails, for example, because the &lt;a href="http://ralpartha.blogspot.com/2007/06/not-all-mysql-errors-are-visible-to.html"&gt;server configuration is not correctly set up&lt;/a&gt;, or if the connection to the master is lost due to a network outage, it is necessary to dig through the error log to find out the reason. This might be possible, although annoying, for a DBA to do since he has access to the files on the machine where the server is running, but when using automatic recovery applications that watch the status of the replication, this is not practical. It is also easier to see the status of the server through a normal client connection, compared to logging into the machine and starting to locate the files.&lt;p&gt;

This is actually quite stupid, especially since it is possible to individually check if the threads are running, so to make it possible to check the status of the threads from a client (an application or a user connecting directly to the server), I just added four new fields to the output from &lt;code&gt;SHOW SLAVE STATUS&lt;/code&gt;: &lt;code&gt;Last_SQL_Error&lt;/code&gt;, &lt;code&gt;Last_SQL_Errno&lt;/code&gt;, &lt;code&gt;Last_IO_Error&lt;/code&gt;, and &lt;code&gt;Last_IO_Errno&lt;/code&gt;. The new fields were added last, and the two old fields &lt;code&gt;Last_Error&lt;/code&gt; and &lt;code&gt;Last_Errno&lt;/code&gt; are just aliases for &lt;code&gt;Last_SQL_Error&lt;/code&gt; and &lt;code&gt;Last_SQL_Errno&lt;/code&gt; respectively. Adding the new fields last and keeping the two old fields intact allow old applications to work as normal since they either use positional arguments or find the column by name. New applications, however, can take advantage of these new fields.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-116679592766989005?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/116679592766989005/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=116679592766989005' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/116679592766989005'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/116679592766989005'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2006/12/when-slave-threads-fail.html' title='The invisible I/O thread failures are no more'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-116138107798940090</id><published>2006-10-20T23:44:00.000+02:00</published><updated>2006-10-20T23:55:43.853+02:00</updated><title type='text'>Documentation for the unit tests API used by MySQL</title><content type='html'>For some time, MySQL has been using a unit test framework based on the &lt;a href="http://search.cpan.org/dist/Test-Harness/lib/Test/Harness/TAP.pod"&gt;Test Anything Protocol (TAP)&lt;/a&gt; used by Perl and PHP.&lt;p&gt;

The framework consists of a C library that can be used to generate TAP output suitable for processing with, for example, the Test::Harness Perl module.  In order to allow Test::Harness to execute the compiled programs, a simple wrapper called unit.pl exists in the unittest/ directory in the MySQL server tree.&lt;p&gt;

The documentation for the MyTAP API is available at &lt;a href="http://www.kindahl.net/mytap/doc/"&gt;&lt;code&gt;http://www.kindahl.net/mytap/doc/&lt;/code&gt;&lt;/a&gt;, until I can find another home for it.

All comments are welcome.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-116138107798940090?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/116138107798940090/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=116138107798940090' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/116138107798940090'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/116138107798940090'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2006/10/documentation-for-unit-tests-api-used.html' title='Documentation for the unit tests API used by MySQL'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-115831172107161876</id><published>2006-09-15T11:15:00.000+02:00</published><updated>2006-09-15T11:15:21.083+02:00</updated><title type='text'>Replication and the disappearing statements</title><content type='html'>After reading &lt;a href="http://gabrito.com/post/sql-statements-mysteriously-not-replicating-with-mysql-replication"&gt;Todd Huss blog about the gotcha when using statement-based replication&lt;/a&gt;, where statements can "disappear" (that is, not be applied to the slave database), I believe that I can shed some light on the reason for this behavior.&lt;p&gt;

Before that, some background.&lt;p&gt;

Traditionally, MySQL has been using what is called &lt;dfn&gt;statement-based replication&lt;/dfn&gt;. Statement-based replication replicates the changes to the slave by sending the actual statement that was executed on the master over to the slave, and the slave subsequently executes that statement. Of course, only statements that change something will be sent to the slave.&lt;p&gt;

Sometimes, you don't want to send all changes to the slaves. So therefore it is possible to prevent the master from sending changes to some databases using the &lt;code&gt;--binlog-do-db&lt;/code&gt; and &lt;code&gt;--binlog-ignore-db&lt;/code&gt; switches, which will allow you to filter out statements that updates certain databases (this is not the whole story, more about that filtering later).&lt;p&gt;

This works well for most queries, such as:

&lt;pre class='code'&gt;
INSERT INTO products   SET name='Gizmo2000', price='$2000'
&lt;/pre&gt;

But suppose that we have two databases &lt;code&gt;db1&lt;/code&gt; and &lt;code&gt;db2&lt;/code&gt; and we decide to not replicate  changes to &lt;code&gt;db1&lt;/code&gt; but will replicate changes to &lt;code&gt;db2&lt;/code&gt;. Now, consider the following statement:
&lt;pre class='code'&gt;
UPDATE db1.foo, db2.foo
SET db1.foo.a = db2.foo.a,
    db2.foo.b = db2.foo.b;
&lt;/pre&gt;

The statement updates both &lt;code&gt;db1&lt;/code&gt; and &lt;code&gt;db2&lt;/code&gt;, so shall we replicate it or shall we not? Since we need to handle even this situation in a &lt;em&gt;consistent&lt;/em&gt; manner, the current database is used to decide if the statement shall be replicated or not. (I didn't actually write the code, since it pre-dates me beginning at MySQL, but after being immersed in the code for almost two years, I'm pretty sure this is the reason.) This works well for most users, since one usually work with one database only, and set the current database to that before actually starting doing changes. However, for some special cases, like the one mentioned by Todd, it starts to look strange.&lt;p&gt;

Recently (that is, in 5.1), MySQL released something called &lt;dfn&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/replication-row-based.html"&gt;row-based replication&lt;/a&gt;&lt;/dfn&gt;, where the master sends the actual &lt;em&gt;rows&lt;/em&gt; that were inserted/deleted/updated to the slave, and the slave then subsequently insert/delete/update those rows from the database. For each row, the database and table that the row belongs to is known, so if you are using row-based replication (option &lt;code&gt;--binlog-format=row&lt;/code&gt; to the server, or use the &lt;code&gt;SET GLOBAL BINLOG_FORMAT=ROW&lt;/code&gt;), the filtering will be done on the &lt;em&gt;actual table being changed&lt;/em&gt; even if the statement updates several different tables in different databases.&lt;p&gt;

I'll summarize with some general advice when using statement-based replication:
&lt;ul&gt;
&lt;li&gt;Don't qualify your table names with a database name. If you do, you might have trouble with the replication, so this is something to look for.
&lt;li&gt;If you are going to make changes to tables in a database, always  &lt;code&gt;USE&lt;/code&gt; the database to set the current database correctly.
&lt;li&gt;Don't use multi-table updates (or other statements that manipulate several tables) unless the tables are all in the same database.
&lt;li&gt;Even if you are using multi-table updates on tables in the same database you might have problems. So watch out for any statement that manipulates several tables and make sure that they work by testing them.
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-115831172107161876?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/115831172107161876/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=115831172107161876' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/115831172107161876'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/115831172107161876'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2006/09/replication-and-disappearing.html' title='Replication and the disappearing statements'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-115521545318539067</id><published>2006-08-10T11:02:00.000+02:00</published><updated>2006-08-10T15:12:55.126+02:00</updated><title type='text'>More ways to encourage ideas</title><content type='html'>&lt;p&gt;Again, I cannot help but follow up on Zack's post on &lt;a href="http://www.theopenforce.com/2006/08/more_brainstorm.html"&gt;How to Come Up With Ideas&lt;/a&gt;. In the modern days of &lt;a href="http://www.strategy-business.com/press/16635507/14886"&gt;hypercompetition&lt;/a&gt;, where today's state-of-the-art solution quickly become yesterdays news, you have to set an environment where you continuously come up with new ideas and new solutions. Creating such an environment is not an easy task, since people are... well, people.&lt;/p&gt;

&lt;p&gt;With this in mind, I find especially Zack's first five points critical, but I would like to add two more items that I personally feel are missing.&lt;/p&gt;

&lt;ol&gt;
&lt;li value="9"&gt;
&lt;p&gt;&lt;strong&gt;Set an example&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When working as a doctoral student it was mandatory for all researchers to attend research seminars given by visiting fellows. The department's head professor always attended these seminars, and it was always the case that the most "stupid" and obvious questions was asked by the head professor. On the surface, the questions looked stupid, but actually they were very much to-the-point.&lt;/p&gt;
&lt;p&gt;This, of course, made all us doctoral students start asking "stupid" questions. Many of the questions asked were indeed stupid, but that didn't stop us since nobody frowned or criticized us. Over time, the questions grew more and more precise and the students started finding many gems of insight into the subjects that were presented.&lt;/p&gt;
&lt;p&gt;As time passed, I started realizing that the asking of questions where deliberately to create inquisitive and questioning researchers out of students that were used to learn just enough to pass the exam.  By setting an example, he created an atmosphere were questions were asked, and where it was expected that some of the questions were stupid, some where silly, but that was nevertheless what you have to do as a researcher: constantly ask questions.&lt;/p&gt;
&lt;/li&gt;
&lt;li value="10"&gt;
&lt;p&gt;&lt;strong&gt;Enjoy yourself&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;People's minds work best when relaxed and when not forced to produce ideas. Adopting the view of treating it as an experiment and taking the occational break are good ways to ensure that people are relaxed and not trying to force ideas to come (which never works, at least, it has never worked for me). In general, if you enjoy the session, you will also be more relaxed and more ideas will appear. Some ways to make the occation enjoyable and more relaxed is: provide food (for some reason, many insight come to me when I am going to fetch an extra sandwich, or pouring a coffee), make sure you have plenty of time ("Hey guys! We need 10 new fresh ideas before lunch!" is a sure killer), and don't keep a strict schedule (It depends on the people involved, but maybe you could bring the lunch to the room instead of going out?)
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-115521545318539067?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/115521545318539067/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=115521545318539067' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/115521545318539067'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/115521545318539067'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2006/08/more-ways-to-encourage-ideas.html' title='More ways to encourage ideas'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-115504646993279502</id><published>2006-08-08T15:38:00.000+02:00</published><updated>2006-08-08T19:51:02.353+02:00</updated><title type='text'>Two more ways to kill good ideas</title><content type='html'>&lt;p&gt;Since Zack brough &lt;a href="http://www.theopenforce.com/2006/08/brainstorm.html"&gt;8 ways to kill good ideas&lt;/a&gt;, I thought I'd add two of my own that I see popping up frequently.&lt;/p&gt;

&lt;ol&gt;
&lt;li value="9"&gt;&lt;p&gt;&lt;strong&gt;Insist on following procedure&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;People work differently, and those coming up with ideas that they want to try out are usually not good rule-followers. When forced to follow a certain procedure, only because it's company policy or because management want to reduce the risk (which is usually what the procedures are for), the idea will surely not get implemented.&lt;/p&gt;

&lt;li value="10"&gt;&lt;p&gt;&lt;strong&gt;Punish failures&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;p&gt;On an interview I was once asked the question "do you have many bad ideas?". I answered that "90% of my ideas are usually bad", to which the interviewer smirked and said "Not more? That's pretty good."  This was for a job where the continous creativity of their employees where the very essence of their survival, so the managers assumed that there would be many bad ideas and only a few good ideas.  They assumed that time would be wasted on bad ideas, but considered that as part of the trade.&lt;/p&gt;
&lt;p&gt;By punishing failures, you send a clear message that risk-taking is not looked kindly upon. Therefore nobody will take any risks; you do not waste time on any bad ideas, but in effect you will not get any of those world-changing good ideas either.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-115504646993279502?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/115504646993279502/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=115504646993279502' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/115504646993279502'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/115504646993279502'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2006/08/two-more-ways-to-kill-good-ideas.html' title='Two more ways to kill good ideas'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-114777553373079720</id><published>2006-05-16T11:47:00.000+02:00</published><updated>2006-08-05T21:01:57.043+02:00</updated><title type='text'>Replication of DELETE FROM versus TRUNCATE TABLE</title><content type='html'>&lt;p&gt;A bug titled &lt;em&gt;DELETE FROM inconsistency for NDB&lt;/em&gt; (&lt;a href="http://bugs.mysql.com/bug.php?id=19066"&gt;Bug#19066&lt;/a&gt;) dropped into my lap, and while fixing it, we had to make some hard decisions on what should be considered the "correct" way to solve this.&lt;/p&gt;

&lt;p&gt;The bug is related to the difference between &lt;code&gt;TRUNCATE TABLE&lt;/code&gt; and &lt;code&gt;DELETE FROM&lt;/code&gt; with no &lt;code&gt;WHERE&lt;/code&gt; clause.  On the surface, they seem to be equivalent, but when digging deeper, we will see that there is big difference between the statement when replication comes into play.&lt;/p&gt;

&lt;p&gt;Before delving into the problem and the solution, I'll start by recapitulate some selected parts of the manual.
&lt;ul&gt;&lt;li&gt;The &lt;code&gt;TRUNCATE TABLE&lt;/code&gt; and &lt;code&gt;DELETE FROM&lt;/code&gt; with no condition are "logically equivalent": &lt;q cite="http://dev.mysql.com/doc/refman/5.1/en/truncate.html"&gt;&lt;code&gt;TRUNCATE TABLE&lt;/code&gt; empties a table completely. Logically, this is equivalent to a &lt;code&gt;DELETE&lt;/code&gt; statement that deletes all rows, but there are practical differences under some circumstances.&lt;/q&gt; (&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/truncate.html"&gt;MySQL 5.1 Reference Manual, Section 13.2.9&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;TRUNCATE TABLE&lt;/code&gt; is implemented as a &lt;code&gt;DROP + CREATE&lt;/code&gt;: &lt;q cite="http://dev.mysql.com/doc/refman/5.1/en/truncate.html"&gt;Truncate operations drop and re-create the table, which is much faster than deleting rows one by one.&lt;/q&gt; (&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/truncate.html"&gt;MySQL 5.1 Reference Manual, Section 13.2.9&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;DELETE FROM&lt;/code&gt; without a condition will delete each row in the table: &lt;q&gt;...a &lt;code&gt;DELETE&lt;/code&gt; statement with no &lt;code&gt;WHERE&lt;/code&gt; clause deletes all rows.&lt;/q&gt;(&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/delete.html"&gt;MySQL 5.1 Reference Manual, Section 13.2.1&lt;/a&gt;)&lt;/li&gt;&lt;/ul&gt;
In other words, both statements are supposed to empty the table. When dealing with replication, however, the concept of logical equivalence is pushed to the limit. Normally, the operational behaviour and the post-condition is what we replicants use to guide us into what a certain statement should do.  So what are the operational behaviour and the post-condition of the two statements?&lt;/p&gt;

&lt;p&gt;The operational behaviour of &lt;code&gt;TRUNCATE TABLE&lt;/code&gt; (let's just call it TRUNCATE henceforth) is to drop the table and re-create it and the post-condition is an empty table.  In contrast, the operational behaviour of a &lt;code&gt;DELETE FROM&lt;/code&gt; with no &lt;code&gt;WHERE&lt;/code&gt; clause (let's call it &lt;em&gt;DELETE-ALL&lt;/em&gt; henceforth) is to remove each row from the table, and the post-condition is that every row that were in the table is removed.&lt;/p&gt;

&lt;p&gt;Wait a minute... there is no difference between these two. In both cases, the end result is an empty table! Isn't it?&lt;/p&gt;

&lt;p&gt;Yes, that is correct... if you are running an isolated statement and  executing the operations on the original table. Isolation on statement level  typically achived using locks (either table locks or row locks), so there will usually not be any other process interfering with the operation.&lt;/p&gt;

&lt;p&gt;Now, let us see what it is that makes these two statements behave differently even though they superficially appear to be equivalent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NDB Cluster swings, but misses.&lt;/strong&gt; The original bug report mention NDB, which is special for this particular bug in that the storage engine &lt;em&gt;does not have (global) table locks&lt;/em&gt; [sic].  When NDB deletes the rows of a table using DELETE-ALL, it does it row-by-row.  If another client starts to insert rows into that table, it &lt;em&gt;might&lt;/em&gt; be that the rows remains in the table after the DELETE-ALL has completed. So the table will not be empty.&lt;/p&gt;

&lt;p&gt;Interestingly enough, the TRUNCATE for NDB is implemented as a DELETE-ALL, so &lt;em&gt;neither&lt;/em&gt; of the statements will empty the table if another client starts inserting rows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Circular replication grapples and throws.&lt;/strong&gt; Suppose that you have a setup like the multi-master setup described by &lt;a href="http://www.onlamp.com/pub/a/onlamp/2006/04/20/advanced-mysql-replication.html"&gt;Giuseppe Maxia&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this case, the replication can progress in a circle. Suppose that you're adding rows to one of the servers (A) in the circle, while at the same time trying to empty the same table on another of the servers (B).  Does it make sense to remove the inserted row together with the all the rows that were present in B? Not really... that would make B delete things that it has no idea about. It could well be that the administrator checked the table visually, decided that there's nothing there that needed, and issued a DELETE-ALL.  It doesn't help if the operator lock the table to make sure that nobody changes the table.  In this sense, DELETE-ALL should delete what the server knows about, and nothing else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-source replication is standing by to finish the job.&lt;/strong&gt; We don't have multi-source replication yet, but we will have it in the near future. When we have and use it, there is a huge difference between emptying the table and deleting all rows in the table.  Reading data from several different masters and inserting it into one table is a convenient way to aggregate data from, e.g., several different branches of a company (where the table contain a branch id as well, to avoid conflict). If one branch decides to drop all the data it has, only the rows related to that branch should be deleted, not data about all branches.&lt;/p&gt;

&lt;h3&gt;The moral of the story&lt;/h3&gt;

&lt;ul&gt;&lt;li&gt;If you want an empty table, you should use &lt;code&gt;TRUNCATE TABLE&lt;/code&gt;, which is guaranteed to empty the table regardless of how the replication is set up.&lt;/li&gt;
&lt;li&gt;If you want to delete the rows that are in the table (on the master), use &lt;code&gt;DELETE FROM&lt;/code&gt;.&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-114777553373079720?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/114777553373079720/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=114777553373079720' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/114777553373079720'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/114777553373079720'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2006/05/replication-of-delete-from-versus.html' title='Replication of DELETE FROM versus TRUNCATE TABLE'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-114606226346417485</id><published>2006-04-26T13:19:00.000+02:00</published><updated>2006-09-25T08:32:03.110+02:00</updated><title type='text'>Row-based replication for the future</title><content type='html'>I just read Eric Bergen's blog on &lt;a href="http://ebergen.net/wordpress/2006/04/25/row-based-replication-and-application-developers/"&gt;row-based replication and application development&lt;/a&gt; from the presentation of &lt;a href="http://mysqluc.com/cs/mysqluc2006/view/e_sess/8178"&gt;Row-based replication&lt;/a&gt; at the User's Conference. Eric is giving all kinds of new ways to use the replication, for example that for a statement you can now configure the replication to only replicate changes to one of the tables in a multi-table statement.  He is, however, missing the most important aspect: everything that we can do now and will be able to do in the future that we couldn't do with just statement-based replication. Here are some things that you can do with row-based replication that was not possible with statement-based replication.&lt;p&gt;

&lt;span style="font-weight: bold;"&gt;Cluster Replication&lt;/span&gt;

Cluster replication is already in 5.1, but it's worth to mention since it could not be handled with statement-based replication. Inside the cluster, everything is rows: rows are passed back and forth between the nodes and rows are collected to form result of queries.

If you are inserting data into the tables using the MySQL NDB Cluster handler &lt;code&gt;ha_ndbcluster&lt;/code&gt;, it will be replicated as usual and both statement-based and row-based replication will work. However, if you are using the NDB API to insert rows into the cluster, the rows are "lost" since the server never see those rows. To solve this problem, we invented an "injector" whose sole purpose is to inject rows into the binary log. The injector is created inside &lt;code&gt;ha_ndbcluster&lt;/code&gt; and the rows that are inserted into the cluster are injected into the binary log.&lt;p&gt;

&lt;span style="font-weight: bold;"&gt;Replication of individual partitions to different servers&lt;/span&gt;

This is not in any version of the server and there are (not yet) any plans to add it, but it's something that is feasible when having row-based replication. Eric is mentioning ways to separate the rows going to different tables, and only replicate one of the tables. This is not limited to replicating different tables to different servers, you could even replicate different &lt;span style="font-style: italic;"&gt;parts&lt;/span&gt; of the &lt;span style="font-style: italic;"&gt;same&lt;/span&gt; table to different servers. For example, assume that you want to partition your data depending on how frequently it is accessed. Now, imagine that you could set up partitions for your table of blogs like this:
&lt;blockquote&gt;&lt;pre class="code"&gt;CREATE TABLE blogs (id INT ..., freq INT, ...)
  PARTITION BY RANGE (freq) (
   PARTITION blog0 VALUES LESS THAN (10),
   PARTITION blog1 VALUES LESS THAN (100),
      ...
      PARTITION blog5 VALUES LESS THAN (1000000)
 );
&lt;/pre&gt;&lt;/blockquote&gt;
Further imagine that you could set up replication to replicate the different partitions to different servers. We could, for example assume that we wanted to place the &lt;span style="font-family:courier new;"&gt;blog5&lt;/span&gt; partition on high-end servers dedicated for handling high loads, while the less frequently accessed blogs would be placed on low-end servers. Each access would then also update the statistics, causing the row to move to other servers as it becomes less frequently accessed. To prevent the row from moving back and forth between partitions when it is bordering one of the ranges, we could do the partitioning on the result of calling a UDF, which implements hysteresis by taking trends into account.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-114606226346417485?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/114606226346417485/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=114606226346417485' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/114606226346417485'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/114606226346417485'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2006/04/row-based-replication-for-future.html' title='Row-based replication for the future'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-114563029187199497</id><published>2006-04-21T09:26:00.000+02:00</published><updated>2006-04-27T14:38:59.190+02:00</updated><title type='text'>Replication of ALTER TABLE with AUTO_INCREMENT</title><content type='html'>Just got &lt;a href="http://bugs.mysql.com/bug.php?id=16993"&gt;Bug#16993&lt;/a&gt; closed, which provides a good lesson into the complications of replication.  The original title was &lt;em&gt;RBR: ALTER TABLE ZEROFILL AUTO_INCREMENT is not replicated correctly&lt;/em&gt;, but the problem is not related to row-based replication (RBR) nor to &lt;code&gt;ZEROFILL&lt;/code&gt;. The culprit is adding an &lt;code&gt;AUTO_INCREMENT&lt;/code&gt; column to a table, and it is not actually a bug (to be frank, it depends on your point of view, but "fixing" this bug causes more headaces than it solves as you'll see in a moment).

What the example code in the bug description does is creating a  table on the slave and the master, but inserting (identical) rows into the tables in &lt;em&gt;different order&lt;/em&gt; on the slave and the master; for example in this way:

&lt;blockquote&gt;&lt;pre class="code"&gt;
master&amp;gt; CREATE TABLE ages(name CHAR(30), age INT);
master&amp;gt; SET SQL_LOG_BIN=FALSE;
master&amp;gt; INSERT INTO ages SET name='Mats', age=37;
master&amp;gt; INSERT INTO ages SET name='Lill', age=25;
master&amp;gt; INSERT INTO ages SET name='Jon',  age=4;
master&amp;gt; SET SQL_LOG_BIN=TRUE;
 slave&amp;gt; INSERT INTO ages SET name='Mats', age=37;
 slave&amp;gt; INSERT INTO ages SET name='Jon',  age=4;
 slave&amp;gt; INSERT INTO ages SET name='Lill', age=25;
&lt;/pre&gt;&lt;/blockquote&gt;

Now, if you look at the tables, you will see that the same rows are present in both tables (I'm just showing the result on one of the servers since they are identical):

&lt;blockquote&gt;&lt;pre class="code"&gt;
mysql&amp;gt; SELECT * FROM ages ORDER BY name, age;
+------+------+
| name | age  |
+------+------+
| Mats |   37 | 
| Jon  |    4 | 
| Lill |   20 | 
+------+------+
3 rows in set (0.00 sec)
&lt;/pre&gt;&lt;/blockquote&gt;

Why the &lt;code&gt;ORDER BY&lt;/code&gt;? Well, potentially the rows could be listed in different order on the master and slave because of one of the following reasons:
&lt;ul&gt;
&lt;li&gt;If this were a real database, we would have been running the master and the slave separately for a while before deciding that we should replicate them and, even though the tables contain the same rows now, the rows would potentially be listed in different order.&lt;/li&gt;
&lt;li&gt;We are using NDB Cluster as storage engine, and there we have no guarantee on the order of the rows, regardless of the order they were inserted.&lt;/li&gt;
&lt;/ul&gt;

Now we (or management) decides that we need to assign a unique id to each person in the table. That is easy, just add a column with an &lt;code&gt;AUTO_INCREMENT&lt;/code&gt; option. Since we have replication running, the same change will be made to the table on the slave.

&lt;blockquote&gt;&lt;pre class="code"&gt;
master&amp;gt; ALTER TABLE ages
    ..&amp;gt;   ADD id INT AUTO_INCREMENT PRIMARY KEY;
&lt;/pre&gt;&lt;/blockquote&gt;

To our surprise, we do not get the same id:s assigned to the people on the master as on the slave. Look here:

&lt;blockquote&gt;&lt;pre class="code"&gt;
master&amp;gt; SELECT * FROM ages ORDER BY name, age;
+------+------+----+
| name | age  | id |
+------+------+----+
| Jon  |    4 |  3 | 
| Lill |   20 |  2 | 
| Mats |   37 |  1 | 
+------+------+----+

slave&amp;gt; SELECT * FROM ages ORDER BY name, age;
+------+------+----+
| name | age  | id |
+------+------+----+
| Jon  |    4 |  2 | 
| Lill |   20 |  3 | 
| Mats |   37 |  1 | 
+------+------+----+
&lt;/pre&gt;&lt;/blockquote&gt;
So, what is happening under the hood? When executing an &lt;code&gt;ALTER TABLE&lt;/code&gt;:
&lt;ol&gt;
&lt;li&gt;a new table is created with the extra column&lt;/li&gt;
&lt;li&gt;the rows are copied over one by one from the old table to the new table &lt;em&gt;in a storage engine-specific order&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;the old table is dropped&lt;/li&gt;
&lt;li&gt;the new table is renamed to the name of the old table&lt;/li&gt;
&lt;/ol&gt;
Observe that the &lt;code&gt;ALTER TABLE&lt;/code&gt; statement is replicated by statement, so it gets executed on both the slave and the master. So, if we're using MyISAM as storage engine, the rows are copied &lt;em&gt;in the order they were inserted&lt;/em&gt;.  For other storage engines, you get other row orders.

So much for the problem, what about the solution and why is this not considered a bug?  I will start with the solution, since this explains why we decided not to produce a bug fix for it, but rather adding a caveat to the documentation.

The solution is simple: sort the rows before when adding them to the table. To do this, you have to repeat the steps above yourself, but add an &lt;code&gt;ORDER BY&lt;/code&gt; clause when inserting the rows. So, the steps to add a column are:&lt;ol&gt;
&lt;li&gt;Create a new table with an extra column. Since you might want to add the new column at an arbitrary place, I give a generic solution.&lt;blockquote&gt;&lt;pre class="code"&gt;CREATE TABLE new_ages LIKE ages;
ALTER TABLE new_ages
  ADD id INT AUTO_INCREMENT PRIMARY KEY;
&lt;/pre&gt;&lt;/blockquote&gt;&lt;/li&gt;
&lt;li&gt;Copy the rows into the new table.&lt;/li&gt;
&lt;blockquote&gt;&lt;pre class="code"&gt;INSERT INTO new_ages(name,age) 
  SELECT name,age FROM ages ORDER BY name,age;
&lt;/pre&gt;&lt;/blockquote&gt;
&lt;li&gt;Drop the old table.&lt;blockquote&gt;&lt;pre&gt;DROP TABLE ages;
&lt;/pre&gt;&lt;/blockquote&gt;&lt;/li&gt;
&lt;li&gt;Rename the old table to the new table.&lt;blockquote&gt;&lt;pre class="code"&gt;ALTER TABLE new_ages RENAME ages;
&lt;/pre&gt;&lt;/blockquote&gt;&lt;/li&gt;
&lt;/ol&gt;
Now, why do we not write a patch to fix this "bug"?  Simply put, the &lt;code&gt;ORDER BY&lt;/code&gt; involves a complete sort of the table so implementing a patch to do it this way would force sorting the table for &lt;em&gt;every&lt;/em&gt; execution of an &lt;code&gt;ALTER TABLE&lt;/code&gt;. In addition, it might not be necessary to sort the rows on &lt;em&gt;every&lt;/em&gt; column in the table.  Forcing this solution on every user that needs to do an &lt;code&gt;ALTER TABLE&lt;/code&gt; would be a bigger evil than leaving the implementation as it is.

And of course, if don't care where the new column is, you can replace the first two steps with:&lt;blockquote&gt;&lt;pre class="code"&gt;CREATE TABLE new_ages(id INT AUTO_INCREMENT PRIMARY KEY)
  SELECT name,age FROM ages ORDER BY name,age;&lt;/pre&gt;&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-114563029187199497?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/114563029187199497/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=114563029187199497' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/114563029187199497'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/114563029187199497'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2006/04/replication-of-alter-table-with.html' title='Replication of ALTER TABLE with AUTO_INCREMENT'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-114322997578785377</id><published>2006-03-24T20:12:00.000+01:00</published><updated>2006-03-24T20:52:55.810+01:00</updated><title type='text'>Row-based replication and user defined functions</title><content type='html'>A little more than a year ago, I was hired to implement row-based replication to the MySQL database server.  Since the principles are easy enough, I thought this would be a straightforward task to be done in a few months, tops.  As always, I quickly got punished for my hybris: getting a basic row-based replication up and running was relatively straightforward but as the saying goes, &lt;q&gt;the devil is in the details.&lt;/q&gt; Row-based replication is now safely tucked away in MySQL 5.1 for anybody who wishes to use it, but the obvious question is then &lt;q&gt;what does it give me and why should I use it?&lt;/q&gt;

When using statement based replication, the replication is accomplished by replicating the actual SQL statements to the slave server directly.  That works fine for most statements, but in some situations, this does not work as expected.  This time, we will only look into one such situation. Consider the following SQL statements:
&lt;pre style="font-size: 85%;"&gt;
UPDATE account
 SET   balance = balance + 100
 WHERE name = "Sakila";
INSERT INTO transactions
      VALUES ('Sakila', 'deposit', 100, UUID(), NULL);
&lt;/pre&gt;
The purpose is to update the balance and add a line to the transaction log that to keep track of all the transactions done to the accounts, including the user-id clerk that performed the transaction. This works fine when executed on the master, but when the slave thread executes the statements, it will be the &lt;code&gt;UUID()&lt;/code&gt; of the SQL thread on the slave, which is really not what we want.  It is, of course, possible to handle this for the built-in SQL functions, but for user defined functions (UDFs), there is no chance at all that we can make it work.

Instead of logging the entire statement, we can log just the row change made to the table, and this is what row-based replication is about.  That way, we will insert &lt;em&gt;exactly&lt;/em&gt; the same row at the slave as we did on the master.

To control the logging format, we introduced a new server variable &lt;code&gt;BINLOG_FORMAT&lt;/code&gt; that can take the values &lt;code&gt;STATEMENT&lt;/code&gt;, &lt;code&gt;MIXED&lt;/code&gt;, and &lt;code&gt;ROW&lt;/code&gt;.  The formats &lt;code&gt;STATEMENT&lt;/code&gt; and &lt;code&gt;ROW&lt;/code&gt; do what you expect, use statement-based replication or row-based replication respectively, while the &lt;code&gt;MIXED&lt;/code&gt; mode will temporarily switch to row-based replication for the statement if it uses a function that will give a different results when executed on the master and by the SQL thread.

So, assume that I've got the following two tables:
&lt;pre style="font-size: 85%;"&gt;
master&amp;gt; describe transactions;
+--------+------------------------------+------+-----+-------------------+-------+
| Field  | Type                         | Null | Key | Default           | Extra |
+--------+------------------------------+------+-----+-------------------+-------+
| name   | char(20)                     | YES  |     |                   |       |
| kind   | enum('deposit','withdrawal') | YES  |     |                   |       |
| amount | decimal(10,2)                | YES  |     |                   |       |
| clerk  | int(11)                      | YES  |     |                   |       |
| time   | timestamp                    | YES  |     | CURRENT_TIMESTAMP |       |
+--------+------------------------------+------+-----+-------------------+-------+
5 rows in set (0.00 sec)

master&amp;gt; describe account;
+---------+---------------+------+-----+---------+-------+
| Field   | Type          | Null | Key | Default | Extra |
+---------+---------------+------+-----+---------+-------+
| name    | char(20)      | YES  |     |         |       |
| balance | decimal(10,2) | YES  |     |         |       |
+---------+---------------+------+-----+---------+-------+
2 rows in set (0.01 sec)
&lt;/pre&gt;
Here's the how you do to set replication to use &lt;code&gt;MIXED&lt;/code&gt; binlog format and execute the statements above:
&lt;pre style="font-size: 85%;"&gt;
master&amp;gt; SET BINLOG_FORMAT=MIXED;
Query OK, 0 rows affected (0.00 sec)

master&amp;gt; UPDATE account
    -&amp;gt; SET balance = balance + 100
    -&amp;gt; WHERE name = 'Sakila';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

master&amp;gt; INSERT INTO transactions
    -&amp;gt; VALUES ('Sakila', 'deposit', 100, UUID(), NULL);
Query OK, 1 row affected, 1 warning (0.00 sec)

master&amp;gt; SHOW BINLOG EVENTS FROM 975;
+-------------------+------+------------+-----------+-------------+------------------------------------------------------------------------------+
| Log_name          | Pos  | Event_type | Server_id | End_log_pos | Info                                                                         |
+-------------------+------+------------+-----------+-------------+------------------------------------------------------------------------------+
| master-bin.000001 | 975  | Query      | 1         | 1102        | use `test`; UPDATE account SET balance = balance + 100 WHERE name = 'Sakila' |
| master-bin.000001 | 1102 | Table_map  | 1         | 1155        | table_id: 17 (test.transactions)                                             |
| master-bin.000001 | 1155 | Write_rows | 1         | 1206        | table_id: 17 flags: STMT_END_F                                               |
+-------------------+------+------------+-----------+-------------+------------------------------------------------------------------------------+
3 rows in set (0.00 sec)
&lt;/pre&gt;
The &lt;code&gt;UPDATE&lt;/code&gt; statement is logged as a statement using a query event, while the &lt;code&gt;INSERT&lt;/code&gt; statement is logged using a table map event and an event containing the rows inserted (one, in this case).

I'll explain more about the table map events and the various forms of row-containing events, but for now you have to trust me when I say that there are rows in the &lt;code&gt;Write_rows&lt;/code&gt; event. :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-114322997578785377?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/114322997578785377/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=114322997578785377' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/114322997578785377'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/114322997578785377'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2006/03/row-based-replication-and-user-defined.html' title='Row-based replication and user defined functions'/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-23496029.post-114293211341864768</id><published>2006-03-21T09:35:00.000+01:00</published><updated>2006-03-21T20:29:58.463+01:00</updated><title type='text'></title><content type='html'>First day back from the Developers Conference in Sorrento, and the first post to this blog (thought I might just as well get started, I've been thinking about it for a while but always been distracted). Since I've been to Italy before, I knew part of what was waiting: excellent food, terrific coffee, and nice friendly people. What I wasn't prepared for was the intense discussions that were continously going on. The last year, in Prague, it was to a large part just the down-to-earth work of getting the code ready for the 5.0 release; this time, however, the conference was vibrant with ideas for the future and ways to leverage the skills of the people in MySQL to produce even better services.

Before the conference, I was mostly busy with getting the row-based replication in shape for shipping. All it all, I would say that it holds up very well. Since it's a fresh feature, there's (of course) a list of bugs, but these are mostly annoyances like the occational extra events in the binary log that doesn't really affect the result, but that are not needed. I feel I have a pretty good control of the code right now, and know where the problematic spots are.

There's, however, nothing like community testing to find the really difficult problems. I really want this feature to be rock-solid, so I'm prepared to get swamped with bug reports.

Please give me your worst. :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/23496029-114293211341864768?l=mysqlmusings.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mysqlmusings.blogspot.com/feeds/114293211341864768/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=23496029&amp;postID=114293211341864768' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/114293211341864768'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/23496029/posts/default/114293211341864768'/><link rel='alternate' type='text/html' href='http://mysqlmusings.blogspot.com/2006/03/first-day-back-from-developers.html' title=''/><author><name>Mats Kindahl</name><uri>http://www.blogger.com/profile/07528917029894926261</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
