Monday, June 18, 2007

BLOB locators + BLOB streaming + Replication = Yeah!

On the MySQL Conference & Expo 2007, I had the chance of meeting up with Paul (the author of PBXT) and Mikael. We briefly touched the topic of the BLOB Streaming Protocol that Paul is working on, which I find really neat. On the way back home, I traveled with Anders Karlsson (one of MySQL:s Sales Engineers), who is responsible for the BLOB Locator worklog and he described the concepts from his viewpoint.

Since I work with replication, these things got me thinking on what the impact is for replication and how it affects usability, efficiency, and scale-out. Being a RESTful guy, I started thinking about URIs both when Paul described the BLOB Streaming Protocol and when Anders starting describing the BLOB Locators. Apparently, I wasn't the only one.

Combining BLOB Locators with the BLOB Streaming Protocol has a significant impact on the scalability and performance of replication, and I'm going to show how by giving a typical use of replication: scaling out reads from a installation by replicating from a single master to several slaves.

Now, when a client connects to get a blob from the database, the server delivers a result set containing one or more blob locators. Since we are using URI:s and the HTTP protocol, the blobs can be served by a normal web server, and the client can fetch the data in the Blobs using HTTP and build the real result set. The existence of blob locators is completely transparent to the client, who sees no difference from the previous implementation.

Now, what does this give us that make this setup so scalable?

  • Instead of storing the actual blob data, we store a reference to the data (in the form of an URI). When working with the blob and copying it to another table, we will actually just copy the reference, which is a very quick operation compared to the size of most blobs. The use of the BLOB locator is entirely transparent to any operations on the blob: reading is not affected, and changing the blob can be accomplished using a copy-on-write semantics (which of course makes the operation slower).

  • Since we have a unique reference to a blob it is possible to implement caching mechanisms to cache results of, e.g., fulltext searches in the blobs.

  • The use of an URI makes the blob locator server-agnostic, which means that we can reliably replicate the URI instead of the blob and still expect any client that connects to the slave server to be able to fetch the blob using HTTP. There is no translation necessary when doing the replication, and the URI can be treated as just a string. This means that a scale-out strategy is trivial to implement. This is just a generalization of the recommended practice to store the blobs as files on a server, and save the file name in the tables instead: we just make it transparent to the user and simplify the deployment.

  • By using an URI as reference, we can put the blob data on a separate server, which can be dedicated to delivering blob data to requesters. Since everything is going via this server, it is very likely that "hot" data is available immediately, and since we are using an URI, delivery over the Internet can rely on Web Caches to avoid re-sending data that is already cached somewhere.

    We do not lose the ability to count the number of deliveries of the data, since we can always count the number of blob locators that we have been delivered instead of the number of BLOBs that have been delivered.

  • The HTTP protocol has support to both PUT and GET to read and write data to the server.

  • We unload a significant amount of "dumb" job from the server, that of assembling result sets consisting of blob data and other data, and therefore allow the server to perform more of the "intelligent" job of doing database searches.

  • The design is incredibly flexible since it is possible to, for example, allowing the blob server(s) to be placed anywhere, even in different towns, and can still keep the main operating site in one location.

Post on replication poll was lost

My last post on the replication poll was apparently lost from Planet MySQL. If you're interested, I commented on the replication poll, our future plans, and how they were affected by the poll.

Sunday, June 03, 2007

The replication poll and our plans for the future

We've been running replication poll and we've got some answers, so I thought I would comment a little on the results of the poll and what our future plans with respect to replication is as a result of the feedback. As I commented in the previous post, there are some items that require a significant development effort, but the feedback we got helps us to prioritize.

The top five items from the poll above stands out, so I thought that I would comment on each of them in turn. The results of the poll were (when this post were written):

Online check that Master and Slave tables are consistent 45.4%
Multi-source replication: replicating from several masters to one slave 36.3%
Multi-threaded application of data on slave to improve performance 29.2%
Conflict resolution: earlier replicated rows are not applied 21.0%
Semi-synchronous replication: transaction copied to slave before commit 20.3%

Online check that Master and Slave tables are consistent

The most natural way to check that tables are consistent is to compute a hash of the contents of the table and then compare that with a hash of the same table on the slave. There are storage engines that have support for incrementally computing a hash, and for the other cases, the Table checksum that was released by Baron "Xaprb" can be used. The problem is to do the comparison while the replication is running, since any change to the table between computing the hash on the master and the slave will indicate that the tables are different when they in reality are not. To solve this, we are planning to introduce support to transfer the table hash and perform the check while replication is running. By adding the hash to the binary log, we have a computed hash for a table at a certain point in time, and the slave can then compare the contents of the tables as it see the event, being sure that this is the same (relative) point in time as it were on the master. We will probably add a default hash function for those engines that do not have something, and allow storage engines to return a hash of the data in the table (probably computed incrementally for efficiency).

Multi-source replication: replicating from several masters to one slave

This is something that actually was started a while ago, but for several reasons is not finished yet. A large portion of the work is actually done, but since the code is a tad old (enough to be non-trivial to incorporate into the current clone), there is some work remaining to actually close this one. Since there seems to be a considerable interest in this, both at the poll and at the MySQL Conference, we are considering finishing off this feature sometime in the aftermath of 5.1 GA. No promises here, though. There's a lot of things that we need to consider to build a high-quality replication solution, and we're a tad strained when it comes to manpower in the team.

Multi-threaded application of data on slave to improve performance

This is something that we really want to do, but which we really do not have the manpower for currently. It is a significant amount of work, it would be a huge improvement of the replication, but it would utterly make us unable to do anything else for a significant period. Sorry folks, but however much I would like to see this happen, it would be irresponsible to promise that we will implement this in the near future. There are also some changes going on internally with the threading model, so it might be easier to implement in the near future.

Conflict resolution: earlier replicated rows are not applied

When multi-source comes into the picture, it is inevitable that some form of conflict resolution will be needed. We are currently working on providing a simple version of timestamp-based conflict resolution in the form of "latest change wins". This is ongoing work, so you will see it in a post-5.1 release in the near future.

Semi-synchronous replication: transaction copied to slave before commit

There is already a MySQL 4 patch for this written by folks at Google Code under the Mysql4Patches work. The idea is to not commit the ongoing transaction until the entire transaction has been successfully transferred to at least one slave. The reason for this is that it should be possible to switch to a slave in the event of a failure of the master, so it has to be certain that the transaction exists somewhere else (at least in disk). We consider this as very important for our ongoing work of being the best on-line database server for modern applications, so you will probably see it pretty soon. Compared to the patch above, we would like to generalize it slightly to allow it to be configurable how many slave should have received it before the transaction is committed. This will of course reduce performance of the master, but it will provide better redundancy in the case of a serious failure and it is a minor addition to the work anyway.

Binary log event checksum

In addition to the things we mentioned above, this is very important to both find and repair problems with replication. The relay log is a potential source of problem, as is the code that writes the events to the relay log, so it is prudent to add a simple CRC checksum to each event to check the integrity of the event. Sadly enough, this does not exist currently, so we're trying to make this get into the code base as soon as possible, maybe even for 5.1 (keep your fingers crossed). This is not a promise: we're doing what we can, but there are no guarantees.