Tuesday, September 16, 2014

MySQL Central: It's that time of the year

It's that time of the year again: yes, Oracle Open World is coming up and with that I'll be travelling to San Francisco. New for this year is that we are part of the main Open World event and therefore have our own MySQL Central. Here you will have the opportunity of meeting many of the engineers behind MySQL, discuss technical problems you have, and also learn some about how we look at the future of the MySQL ecosystem.

This year, me and Narayanan Venkateswaran will be presenting two sessions:

Elastic Scalability in MySQL Fabric with OpenStack (Thursday, Oct 2, 1:15 PM-2:00 PM in Moscone South, 252): In this session you will see how Fabric can use the new provisioning support to fetch servers from an OpenStack instance. The presentation will cover how to use the provisioning support to fetch servers from OpenStack Nova, OpenStack Trove, and also from Amazon AWS. You will also learn about the provisioning interface and how you can use it to create your own hardware registry support.
MySQL Fabric: High Availability at Different Levels (Wednesday, Oct 1, 2:00 PM-2:45 PM in Moscone South - 250): MySQL Fabric is a distributed system that requires coordination among different components to provide high availability: connectors, servers, and MySQL Fabric nodes must be orchestrated to create a solution resilient in the face of failures. Ensuring that each component alone is fault-tolerant does not guarantee that applications will continue working in the event of a failure. In this session, you will learn how all components in MySQL Fabric were designed to provide a high-availability solution and how they cooperate to achieve this goal. The presentation shows you how to create your own infrastructure to monitor MySQL servers and manage manual switchover or automatic failover operations together with MySQL Fabric.

As every year, it's going to be fun to meet all the people in the MySQL community, both new and old, so I'm looking forward to meeting you all there.

Tuesday, May 27, 2014

MySQL Fabric: Musings on Release 1.4.3

As you might have noticed in the press release, we just released MySQL Utilities 1.4.3, containing MySQL Fabric, as a General Availability (GA) release. This concludes the first chapter of the MySQL Fabric story.

It all started with the idea that it should be as easy to manage and setup a distributed deployments with MySQL servers as it is to manage the MySQL servers themselves. We also noted that some of the features that were most interesting were sharding and high-availability. Since we also recognized that every user had different needs and needed to customize the solution, we set of to create a framework that would support sharding and high-availability, but also other solutions.

With the release of 1.4.3, we have a range of features that are now available to the community, and all under an open source license and wrapped in an easy-to-use package:

High-availability support using built-in slave promotion in a master-slave configuration.
A framework with an execution machinery, monitoring, and interfaces to support management of large server farms.
Sharding using hash and range sharding. Range sharding is currently limited to integers, but hash sharding support anything that looks like a string.
Shard management support to move and split shards.
Support for failure detectors, both built-in and custom ones.
Connectors with built-in load balancing and fail-over in the event of a master failure.

Beyond MySQL Fabric 1.4.3

As the MySQL Fabric story develop, we have a number of challenges ahead.

Loss-less Fail-over. MySQL 5.7 have extended the support for semi-sync so that transactions that are not replicated to a slave server will not be committed. With this support, we can truly have a loss-less fail-over so that you cannot lose a transaction if a single server fails.

More Fabric-aware connectors. We currently have support for Connector/J, Connector/PHP, and Connector/Python, but one common request is to have support for a Fabric-aware C API. This is both for applications developed using C/C++, but also to add Fabric support to connectors based on the MySQL C API, such as the Perl and Ruby connector.

Multi-Node Fabric Instance. Many have pointed out that the Fabric node is a single point of failure, and it is instead a single node, but if the Fabric node goes down, the system do not stop working. Since the connectors cache the data, they can "run on the cache" for the time it takes for the Fabric node to be brought up again. Procedures being executed will stop, but once the Fabric node is on-line again, execution will resume from where it left off. To ensure that the meta-data (the information about the servers in the farm) is not lost in the event of a machine failure, MySQL Cluster can be used as storage engine, and will then ensure that your meta-data is safe.

There are, however, a few advantages in having support for multiple Fabric nodes:

The most obvious advantage is that execution can fail-over to another node and there will be no interruption in the execution of procedures. If the fail-over is built-in, you avoid the need for external clusterware to manage several Fabric nodes.
If you have several Fabric nodes available to deliver data, you improve responsiveness to bursts in meta-data requests. This can happen if you have a large bunch of connectors brought on-line at the same time.
If you have multiple data centers, having a local version of the data to serve the applications deployed in the same center improve locality of data and avoid an unnecessary round-trip over WAN to fetch some meta-data.
With several nodes to execute management procedures, you can improve scaling by being able to execute several management procedures in parallel. This would require some solution to avoid that that procedures do no step over each other.

Location Awareness. In deployments spread over several data-centers, the location of all the components suddenly become important. There is no reason for a connector to be directed to a remote server when a local one suffices, but that require some sort of location awareness in the model allowing the location of servers (or other components) to be given.

Extending the model by adding data centers is not enough though. The location of components withing a data center might be important. For example, if a connector is located in a particular rack in the data center, going to a different rack to fetch data might be undesirable. For this reason, the location awareness need to be hierarchical and support several levels, e.g., continent, city, data center, hall, rack, etc.

Multi-Shard Queries. Sharding can improve performance significantly since it split the data horizontally across several machines and each query therefore go directly to the right shard of the data. In some cases, however, you also need to send queries to multiple shards. There are a few reasons for this:

You do not have the shard key available, so you want to search all servers for some object of interest. This of course affect performance, but in some cases there are few alternatives. Consider, for example, searching for a person given name and address when the database is sharded on the SSN.
You want to generate a report of several items in the database, for example, find all customers above 50 that have more than 2 cars.
You want a summary of some statistic over the database, for example, generate a histogram over the age of all your customers.

Session Consistency Guarantees. As Alfranio point out, when you use multiple servers in your farm, and transactions are sent to different servers at different times, it might well be that you write one transaction that goes to the master of a group and then try to read something from the same group. If the write transactions have not reached the server that you read from, then you might get an incorrect result from your transaction. In some cases, this is fine, but in other cases, you have certain guarantees that you want to have on your session. For example, you want to ensure that anything you write will also be available when you read in transactions following the write, you might want to guarantee that multiple reads read later data all the time (called "read monotonicity"), or other forms of guarantees on the result sets you get back from the distributed database. This might require connectors to wait for transactions to reach slaves before reading, but this should be transparent to the application.

This is just a small set of the possibilities for the future, so it is really going to be interesting to see how the MySQL Fabric story develops.

You can download MySQL Utilities (which includes MySQL Fabric) from http://dev.mysql.com/downloads/tools/utilities
You can read MySQL Utilities documentation at http://dev.mysql.com/doc/mysql-utilities/1.4/en/index.html
You can report bugs or request features on http://bugs.mysql.com
MySQL Forum Fabric, Sharding, HA, Utilities

Tuesday, April 29, 2014

MySQL Fabric: Tales and Tails from Percona Live

Going to Percona Live and presenting MySQL Fabric gave me the opportunity to meet a lot of people and get a lot of good feedback. I talked to developers from many different companies and got a lot of great feedback that will affect the priorities we make, so to all I spoke to I would like to say a great "Thank you!" for the interesting discussions that we had. Your feedback is very valuable. It was very interesting to read the comments on MySQL Fabric on MySQL Performance Blog. The article discuss the current version of MySQL Fabric distributed with MySQL Utilities and give some brief points on features of MySQL Fabric. I think it could be good to give some context to some of the points they raise, both to elaborate on the points and show what they mean in reality, and also to give some background to how we were thinking around these points.

The Art of Framing the Fabric

It was a deliberate decision to make MySQL Fabric extensible, so it is not surprising that it have the feel of a framework. By making MySQL Fabric extensible, we allow community and users to explore ideas or add user-specific support.

In the MySQL Team at Oracle we are strong believers in the open source model and are working hard to keep it that way. There are many reasons to why we believe in this model, but one of the reasons is that we do not believe that one size fit all. For any users, there are always minor variations or tweaks that are required by the users own specific needs. This means that the ability to tweak and adapt the solution to their specific needs is very important. Without MySQL being open-source, this would not be possible. As you can see from WebScaleSQL, this is not just a theoretical exercise, this is how companies really use MySQL.

From the start, we therefore focused on building a framework and created the sharding and high-availability as plugins; granted, they are very important plugins, but they are nevertheless plugins. This took a little more effort, and a little more thinking, but by doing it this way we can ensure that the system is truly extensible for everybody.

Hey! I've got a server in my farm!

As noted, many if the issues related to high-availability and sharding require server-side support to get it really solid. This is also something we recognized quite early; the alternative would be to place the logic in the connectors or the Fabric node. We recognized that the right place to solve this is in the server, not in connector layer since that put a lot of complexity at the wrong place. Even if it was possible to handle everything in the connector, there is still a chance that something goes wrong if the constraints are not enforced in the server. This could be because of bugs, because of mistakes in the administration of the server, or any other number of reasons, so to build a solid solution, constraints on the data should be enforced by the servers and not in the connectors or in a proxy.

An example given is that there is no way to check that a row ends up in the right shard, which is very true. A generic solution to this would be to add CHECK constraint on the server, but unfortunately, this is a very big change in the server code-base. Adding triggers to the tables on the server is probably a good short-term solution, but that require managing and deploying extra code on all servers, which in turn is an additional burden on managing the servers, which is something we would like to avoid (the more "special" things you have to do with the servers, the higher the risk is of something going wrong).

On the proximity of things...

One of the central components of MySQL Fabric are the high-availability groups (or just groups, when it is clear from the context) that were discussed in an earlier post. The central idea around a group is that each group manages the same piece of data and MySQL Fabric is designed to handle and coordinate multiple groups into a federation of databases. The feature of being able to manage multiple groups is something that is critical to create a sharded system. On thing that is quite often raised is that it should be possible for a server to belong to multiple groups, but I think this comes from a misunderstanding on what a group represents. It is not a "replica set", which gives information about the topology, that is, how replication is set up, nor does it say anything about how the group is deployed. It is perfectly OK to have members of the group in different data centers (for geographical redundancy), and it is perfectly OK to have replication between groups to support, for example, functional partitioning. If a server belonged to two different groups, it would mean that it manages two different sets of data at the same time.

The fact that group members can be located in different data centers raises another important aspect, something that was often mentioned at Percona Live, that of managing the proximity of components in the system. There is some support for this in Hadoop where you have rack-awareness, but we need a slightly more flexible model. Imagine that you have a group set up with two servers in different data centers and you further have scale-out slaves attached locally. You have connectors deployed in both data centers, but when reading data you do not want to go to the other data center to execute the transaction, it should always be done locally. So, is it sufficient to be able to just have a simple grouping of the components? No, because you can have multiple levels of proximity, for example, data centers, continents, and even rooms or racks within a data center. You can also have different facets that you want to model, such as latency, throughput, or other properties that are interesting for particular uses. For that reason, whatever proximity model we deploy, it need to support a hierarchy and also have a more flexible cost model where you can model different aspects. Given that this problem have been raised several times on Percona Live and also by others, it is likely to be something we need to prioritize.

The crux of the problem

As most of you have already noted, there is a single Fabric node running that everybody talk to. Isn't this a single point of failure? It is indeed, but there is more to the story than just this. A single point of failure is a problem because if it goes down, so does the system... but in this case, it doesn't really go down, it will keep running most of the time.

The Fabric node does a lot of things: it keeps track of the status of all the components of the farm, execute procedures to handle fail-over, and deliver information about the farm on request. However, the connectors are the ones that route the transactions to the correct place, and to avoid having to ask the Fabric node about information each time, the connectors maintain caches. This means that in the event of a Fabric node failure, connectors might not even notice that it is gone unless they had to re-fill their caches. This means that if you restart the Fabric node, it will be able to serve the information again.

Another thing that stops when the Fabric node goes down is that no more fail-overs can be done and ongoing procedures are stopped in their tracks, which could potentially leave the farm in an unknown state. However, the state of the execution of any ongoing procedures are stored in the backing store, so when you bring up the Fabric node again, it will restore the procedures from the backing store and continue executing. This feature alone do not help against a complete loss of the machine where the Fabric node and the backing store are put, but, MySQL Fabric is not relying on specific storage engine features, any transactional engine will do, so by using MySQL Cluster as the storage engine it is possible to ensure safe-keeping of the state.

There are still good reasons to support multi-node Fabric instances:

If one Fabric node goes down, it should automatically fail over to another and continue execution. This will prevent any downtime in handling executions.
Detecting and bringing up a secondary Fabric node can become very complicated in the case of network partitions since it require handling split-brain scenarios reliably. It is then better to have this built into MySQL Fabric since it makes deployment and management significantly simpler.
Management of a farm does not put any significant pressure on the database back-end, but having a single Fabric node can be a bottleneck. In this case, it would be good to be able to execute multiple independent procedures on different Fabric nodes and coordinate the updates.
If a lot of connectors are required to fill their caches at the same time, we have a risk of a thundering herd. Having a set of Fabric nodes for read scale-out can then be beneficial.
If a group is deployed in two very remote data centers, it is desirable to have a local Fabric node for read-only purposes instead of having to go to the other data center.

More Fabric-aware Connectors

Currently we support connectors for Python, Java, and PHP, but one point that pop up quite often (both at Percona Live and elsewhere) is the lack of a Fabric-aware C connector. It is the basis for implementing both the Perl Database Interface MySQL driver DBD::mysql and for the Ruby connector, but is also desirable in itself for applications using C or C++ connector. All I can say at this point is that we are aware of the situation and know that it is something desired and important.

Interesting links

MySQL Fabric Setup using ndb Cluster

Tuesday, April 01, 2014

MySQL Fabric 1.4.2 Released

As you saw in the press release, MySQL Fabric 1.4.2 is now released! If you're interested in learning more about MySQL Fabric, there is a session April 3, 2014 11:10–12pm titled Sharding and Scale-out using MySQL Fabric in Ballroom G. MySQL Fabric is a relatively new project in the MySQL ecosystem and it focuses on building a framework for working with large deployments of MySQL Servers. The architecture of MySQL Fabric is such that it allows extensions to be added and the first two extensions that we added were support for high-availability using High-Availability groups (HA groups) and sharding to manage very large databases. The first version of sharding have hash and range sharding implemented as well as procedures for moving and splitting shards.
A critical part of working with a collection of servers is the ability to route transactions to the correct servers, and for efficiency reasons we quite early decided to put this routing logic into the connectors. This avoid one extra network hop and hence improve performance by reducing latency, but it does require that the connectors containing routing logic, caches, and support for fetching data from MySQL Fabric. Putting the routing logic into the connector also make it easy to extend the API to add new support that applications can require.
MySQL Fabric 1.4.2 is distributed as part of MySQL Utilities 1.4.2. To avoid confusion, we have changed the version numbering to match the version of MySQL Utilities it is distributed in.

You can download MySQL Utilities 1.4.2 from http://dev.mysql.com/downloads/tools/utilities
You can read MySQL Utilities 1.4.2 documentation at http://dev.mysql.com/doc/mysql-utilities/1.4/en/index.html

We have just done a few public releases, even though we did a few internal releases as well, but a brief history of our releases this far is:

MySQL Fabric 1.4.0
- First public release
- High-Availability groups for modeling farms
- Event-driven Executor for execution of management procedures.
- Simple failure detector with fail-over procedures.
- Hash and Range sharding allowing management of large databases.
- Shard move and shard split to support management of a sharded database.
- Connector interfaces to support federated database systems.
- Fabric-aware Connector/Python (labs)
- Fabric-aware Connector/J (labs)
- Fabric-aware Connector/PHP (labs)
MySQL Fabric 1.4.1
- More solid scale-out support in connectors and MySQL Fabric
- Improvements to the Executor to avoid stalling reads
- Connector/Python 1.2.0 containing:

Labs release of Connector/J with Fabric-support

MySQL Fabric 1.4.2

Credentials in MySQL Fabric
External failure reporting interfaces supporting external failure detectors
Support for unreliable failure detectors in MySQL Fabric
Credentials support in Connector/Python
Connector/Python 1.2.1 containing:

Failure reporting
Credentials Support

Connector/J 5.1.30 containing Fabric support

Do you want to participate?

There is a lot you can do if you want to help improve MySQL Fabric.

If you find bugs or want specific feature, please report a bug at http://bugs.mysql.com
MySQL Forum Fabric, Sharding, HA, Utilities

Blogs about MySQL Fabric

Monday, October 21, 2013

MySQL Fabric: High Availability Groups

As you might have noticed, we have released a framework for managing farms (or grids, as Justin suggested) of MySQL servers called MySQL Fabric. MySQL Fabric is focused on being easy to use and extensible, and two extensions are currently part of the framework: one to manage high-availability and one to implement sharding.

High-Availability Group

High-Availability Groups

One of the central concepts used to construct a farm is the high-availability group (or just group when there is no risk of confusion) and is introduced by the high-availability extension. As mentioned in the previous post, the group concept does not really represent anything new but is rather a formalization of how we think and work with the structure of the farm. The key to supporting high-availability is to have redundancy in the system: if one component fail, another one should be ready to pick up the job of the failing component. Hardening the systems (by using hardware less prone to fail or hardware with built-in redundancy) can help reduce the chance of a component failing, but not completely eliminate it. Even a hardened system is susceptible to failure in a power outage or an earthquake. With this in mind, we introduced the group concept for managing pieces of data in our farm:

each group consists of several machines that are responsible for managing the same piece of data.The concept of a group is an abstraction to model the basic concept that we're after, but does not really say anything about how it is implemented. This is intentional: it should be concrete enough to support all the operations we need, but abstract enough to not restrict how it is implemented. This is important because connectors (or any other "outside" observer) that work with groups should not have to be updated whenever new implementations are added. For example, it should not make a difference to a connector if the group is implemented using a traditional Master-Slave setup, a MySQL Cluster, or using replicated storage such as DRBD.

Server properties in groups

There are a few key properties that we assume for groups:

A server belong to (at most) one group.
At any time, each server in the group have a designated a status.
At any time, each server has a mode indicating if it accepts reads, writes, both, or neither.
Each server also has a weight, which is the relative power of the server and is used to balance the load.

Note that these properties might change over time, depending on events that happen.For handling load-balancing and high-availability the properties Status, Mode, and Weight where introduced. The mode and weight properties are used by a connector when it comes to deciding where to send a transaction, while the status property is used by the Fabric to keep track of the status of the server. Let's take a closer look at the properties.

Figure 1. Server Status

Server Status (or Role). The status of the server provide information about what the server is currently doing in the group. The status of a server is Fabric's view of the status of the server and changes as time passes and the Fabric notice changes. A primary server accept both a write and a read load and sending high-priority read transactions here mean that they get current data. A secondary server can handle reads but, in the case of a master-slave configuration, it should not accept writes since that would lead to a split-brain situation. Secondary servers are servers waiting to pick up the job of the primary if it fails. Spare servers do not accept reads nor writes, but are ready and running and can therefore change status in the group to replace other servers in the event of failures. In addition, spare servers can be used to handle reads.

In Figure 1 you can see an example of how servers could change status, but note that at this time, we do not track all states. For example, we are considering how to handle the provisioning of new servers in flexible and extensible way, but more about that in a separate post.

Server Mode. The mode of the server gives information on whether it can be read or written and provide information for the connector on how it should send queries. For now, we only have three modes: Offline, Read-only, and Read-Write. Offline servers cannot be read from or written to, and usually does not accept connections. Read-only servers can only be read from and write transactions should not be sent to these. Read-Write servers are usually primaries of the group. They can accept writes and will propagate them correctly to other servers in the group.

Server Weight. The weight of a server is used to balance the load between servers. The weight represent the relative power of the server. When balancing the load between servers, the connector will figure out what servers are eligible for accepting a transaction and then pick one of the servers in such a way that the distribution over time will be proportional to the weight of the server.

What's up with transactions?

So what is it with transactions that makes it necessary to declare the properties of the transaction before it actually? The reason is quite simple: the transaction is fed to the server one line at a time. This means that it is not always possible to know if it is a read-only or read-write transaction until the entire transaction has been seen.

An example is this transaction (yeah, it's a little constructed, but it's just an example):

START TRANSACTION;
SELECT salary INTO @salary FROM salaries
 WHERE emp_no = 20101 AND to_date = DATE('9999-01-01');
UPDATE titles SET to_date = CURRENT_DATE()
 WHERE emp_no = 20101 and to_date = DATE('9999-01-01');
UPDATE salaries SET to_date = CURRENT_DATE()
 WHERE emp_no = %s and to_date = DATE('9999-01-01');
INSERT INTO titles VALUES
  (20101, 'Master of the Universe', CURRENT_DATE(), DATE('9999-01-01'));
INSERT INTO salaries VALUES
  (20101, 10 * @salary, CURRENT_DATE(), DATE('9999-01-01'));
COMMIT;

On line 1 or 2, it is not possible to know if this is a read-only or a read-write transaction, it cannot be known until line 3, where an UPDATE is executed.

Transaction properties

As mentioned before, one of the goals is to support sharding in the presence of transactions and to make that work correctly, it is necessary to declare up-front what the transaction will contain. Not everything, but the key elements of the transaction: what tables it will access, what sharding key is used, and if it is a read-only or read-write transaction. The first two properties are only necessary if you are working with a sharded system, so we skip those for now; the last one, however, is important for handling load-balancing in the connector.When executing transactions using a Fabric-aware connector, you provide the information about the transaction using transaction properties. There are several properties available, but we will focus on the ones related to group handling: group and type. The group property is used to provide the name of the group you want to connect to (you can have several), and the type property is used to tell if this is a read-only or read-write transaction. In the future, we might add more properties such as priority to indicate that this is an urgent transaction and a prompt reply is needed. For example, the following code is using a Fabric-aware connector to promote an employee.

from mysql.connector.fabric import (
   TYPE_READWRITE,
)

def promote_employee(conn, emp_no):
    stmts = [
        ("SELECT salary INTO @salary FROM salaries"
         " WHERE emp_no = %s AND to_date = DATE('9999-01-01')"),
        ("UPDATE titles SET to_date = CURRENT_DATE()"
         " WHERE emp_no = %s and to_date = DATE('9999-01-01')"),
        ("UPDATE salaries SET to_date = CURRENT_DATE()"
         " WHERE emp_no = %s and to_date = DATE('9999-01-01')"),
        ("INSERT INTO titles VALUES"
         " (%s, 'Master of the Universe', CURRENT_DATE(), DATE('9999-01-01'))"),
        ("INSERT INTO salaries VALUES"
         " (%s, 10 * @salary, CURRENT_DATE(), DATE('9999-01-01'))"),
        ]

    # Use the group for the ACME company
    conn.set_property('group', 'ACME')
    conn.set_property('type', TYPE_READWRITE)
    conn.start_transaction()
    cur = conn.cursor()
    for stmt in stmts:
        print "Executing:", stmt % (emp_no,)
        cur.execute(stmt, (emp_no,))
    conn.commit()

On line 20 and 21 you see how the properties of the transaction is set. In this case, we declare the group that we will access (for example, a fictional company "ACME") and also the type of the transaction. After that, a transaction is started as normal and executed. The Fabric-aware connector will pick the right server to send the transaction to and you will get the result back in the normal fashion.

Note that the property type is not yet implemented in Connector/Python, some work remains to make it support load-balancing fully.

Picking a server

But are these server and transaction properties sufficient for a connector to make a decision on what to do with a transaction? Let's take a look and see how the server can be selected.A server can be chosen by first selecting a set of candidates and then picking one of the candidates based on the weight of the server. Picking the candidates are done by matching the transaction properties and the server properties to find all server that are eligible for accepting the transaction. When a list of candidates are available you can, for example, pick one at random based on the weight of the servers. You can see an example Python code below that illustrates how this could be done. The first function find_candidates computes the set of candidates from the set of all servers SERVERS, while the second function pick_server pick one of the servers at random based on the weight of the server.

def find_candidates(props):
   candidates = []
   for srv in SERVERS:
      if props.group == srv.group and (props.mode & srv.mode):
         candidates.append(srv)
   return candidates

def pick_server(servers):
   random_weight = random() * sum(srv.weight for srv in servers)
   sum_weight = 0.0
   for idx, srv in enumerate(servers):
      sum_weight += srv.weight
      if sum_weight > random_weight:
         return servers[idx]
   return servers[-1]    # Last server in list

# Example code for picking a server based on transaction properties
pick_server(find_candidates(trans.props))

Implementation of groups

The reason to why we introduced the group concept in this manner is to be able to vary the implementation of a group, so the question is then, does it work? To see if it works, it is good to consider some sample implementations of high-availability groups and see if they can be described in this manner, so let's do that. Note that the only version that is currently implemented is the primary-secondary approach: the other ones are just food for thought (at this point).

The primary-secondary approach (also known as primary-backup or master-slave) is the traditional way to set up MySQL servers for high-availability. The idea is that there is a single primary managing the data and one or more secondaries that replicate the data from the primary and are ready to become primary in the event that the primary dies. In addition, there is a number of pure read slaves that are used to scale-out reads.

In this approach, the primary would be in read-write mode, and the secondaries could either be offline or in read mode. Secondaries cannot accept writes since that might cause a split-brain situation, but they can either be in read-only mode or offline. Not loading the servers with read-only transactions can make it easier for the secondaries to be up to date with the primary, but this depends on the general load on the system. Scale-out slaves added would then, of course, be pure read-only servers, and they cannot be promoted to be masters because they do not have the binary log enabled. However, if the primary master fails, they still need to fail-over to the new primary.

If (when?) the primary master fails, MySQL Fabric will detect the failure and start executing a procedure to promote one of the secondary master to be primary instead of the one that failed. MySQL Fabric have to do this because the servers do not know how to handle the fail-over themselves, and in addition it is necessary to inform the connectors about the new topology. In this procedure, the scale-out servers have to be moved over to the new primary as well.

Another popular solution for high-availability shared storage (for example, using shared network disks) or replicated storage (for example, using DRBD to replicate the block device). In this case, one of the server will be on-line, but the other will be on standby. For both DRBD and shared storage, it is necessary that the standby is completely offline and in the case of DRBD the server should not even be running on the standby machine. In addition to the primary and the secondary, you could have read slaves attached to the primary.

In this setup, the primary would then be a read-write server, while the standby server would be in offline mode. Scale-out servers would be in read-only mode, in this case attached to the primary.

Another approach is to use MySQL Cluster as a group. The cluster consists of several data nodes and employ a shared-nothing architecture to ensure high availability. In this case, all the servers will be both write and read servers, and all might be primaries. In the event that an NDB data node fails, the other nodes are always ready to pick up the job, so a MySQL Cluster group is self-managing. (There is an excellent overview of MySQL Cluster at http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-overview.html.)

The two solutions above employ different fail-over procedures that are executed by the Fabric node when it notices the failure. In contrast with the solutions above, MySQL Cluster is self-governing and does not require any fail-over handling implemented in the Fabric node.

Summary and considerations for the future

For the examples above, the properties we have outlined is definitely sufficient, but there might be other cases where more information is needed.One property that is missing in the current implementation is a way to select a server based on the proximity to the connector. For example, it could be possible to put the primaries and secondaries in a group in different data centers to ensure that it can handle a catastrophic failure. This, however, opens two issues:

There will be a set of read servers in each data center that should be connected to the primary or secondary in the same data center.
When the connector picks one of the candidates, it should prefer to use those in the same data center.

Both these issues mean that we need some measure of the distance between the servers and connectors so that when adding new scale-out servers it is not added in such a way that the same data is shipped several times between data centers, nor should a connector connect to a server in a different data center. Adding a complete matrix with distances between each and every server would not really work well, so it is likely that some other way to model the proximity is needed.Another case that might require some additional information is if the Fabric node fails or is unavailable temporarily (for instance, is restarting). Such an event should not block the entire system and since the connectors have information it would be possible to "run on the cache" for a brief period. The key issues here is that nothing can be updated, so each connector need to have a fallback plan in the event that a server fails. For example, if a master fails, the fail-over will not be executed, but it would still be possible to read information from the slaves.

Tuesday, October 08, 2013

MySQL Connect presentations on MySQL Fabric available on SlideShare

Going to MySQL Connect was truly a blast. We got a lot of good questions and feedback in the sessions and there were a lot of interest in both MySQL Fabric and the MySQL Applier for Hadoop.

A big thank you to all that attended the talks, I got a lot of good questions and comments that will help us build good solutions.

The talks are available on SlideShare:

Saturday, September 21, 2013

A Brief Introduction to MySQL Fabric

As you saw on the keynote, we are introducing an integrated framework for managing farms of MySQL servers with support for both high-availability and sharding. It should be noted that this is a very early alpha and that it at this point is not ready for production use.

MySQL Fabric is an integrated system for managing a collection of MySQL servers and is the framework on which high-availability and sharding is built. MySQL Fabric is open-source and is intended to be extensible, easy to use, and support procedure execution even in the presence of failure, an execution model we call resilient execution.

To ensure high-availability, it is necessary to have redundancy in the system. For database systems, the redundancy traditionally takes the form of having a primary server acting as a master and using replication to keep secondaries available to take over in case the primary fails. This means that the "server" that the application connects to is in reality a collection of servers, not a single server. In a similar manner, if the application is using a sharded database, it is in reality working with a collection of servers, not a single server. In this case, we refer to a collection of servers as a farm.

Now, just having a collection of servers does not really help us that much: it is necessary to have some structure imposed on the farm as well as an API to work with the farm, and this is where MySQL Fabric comes in.

Before going over the concepts, have a look at the farm below. In the figure, there is an application that want to connect to the farm and there are a set of database servers at the bottom organized into groups called high-availability groups. To manage the structure of the farm, there is a MySQL Fabric Node available that keeps, among other things, track of the meta-data as well as handles procedure execution. Both the application and an operator can connect to the Fabric node to get information about the farm being managed.

The MySQL Fabric nodes are responsible for keeping track of the structure of the farm as well as all information about their status and is also where any management procedures are executed. If a server fails, a the Fabric node will handle the procedure of promoting one of the slaves to be the new master, but it also contain the logic for handling shard moving and shard splitting.

High-Availability Group

High-availability Groups

The central concept to handling high-availability in Fabric are the high-availability groups. These are collections of servers that shall work together to deliver a database service to the application that connects. The groups are introduced to give structure to the farm and allow you to describe how the servers are organized to support the redundancy necessary to ensure high-availability.

Inside the group, all the servers manage the same data and hence have the same schema. In addition, each server have distinct roles and no server can belong to more than one group. The group concept does not really represent anything new: it is just a way to structure the farm in such a manner that managing it is easy to understand and the roles of servers are clear. To manage redundancy, a traditional Master-Slave Setup is used, a topology that we all are familiar with (it is often called the Primary-Secondary Approach, hence the names that follow). Each group have a primary that is is the master for all the data. Any queries that update data is sent here and that data is propagated to the other servers in the group. Redundancy is achieved by keeping one or more secondaries in the group that receive changes from the primary and are ready to take over the role as primary should the primary dissapear. To handle scale-out, the group also contain scale-out servers which are severs that receive changes from the primary but are not eligable for being promoted to primary. In the group, there are also spares, which are servers that are available for use but which are not assigned any active role yet.

Sharding

In addition to high-availability support, MySQL Fabric also offer support for sharding, which is a technique for handling very large databases and/or very high write loads. The database is split into a large number of shards, where each shard contain a fragment of the data in the database. Each shard is stored on a separate server (or a separate set of servers if you want to ensure high-availability) and the transactions are directed to each shard based on the shard key. Splitting the database in this way allow you both to manage a larger database by separating it onto more servers, but it also scale the write traffic because you can execute writes independently.

When using sharding, MySQL Fabric separate tables into sharded tables MySQL Fabric allow you to shard just some of the tables in the database and keep the other tables available on all shards, which we call global tablesand global tables. Since databases usually have multiple tables with foreign key relationships between, it is critical to be able to shard several tables the same way (but possibly on different columns), which is something that MySQL Fabric supports. Using this support, you can give all the tables that are sharded and what column should be used as the sharding key and MySQL Fabric will shard all tables and distribute the rows on the shards. Tables that are not sharded are the global tables and they will be available on all shards.

If you want to know more about how Fabric support sharding, or about sharding in general, you should come to the session MySQL Sharding: Tools and Best Practices for Horizontal Scaling, September 21, 4:00pm-5:00pm in Imperial Ballroom B.

Connecting to a MySQL Farm

To provide better control when working with a MySQL farm we have extended the connector API in such a manner that it hides the complexities of handling fail-over in the event of a server failure as well as dispatching transactions to shards correctly. There is currently support for Fabric-aware versions of Connector/J, Connector/PHP, Connector/Python as well as some rudimentary support for Hibernate and Doctrine. If you are interested in how the extensions to the interface look and how you can use them to scale your application, you should come to Scaling PHP Applications, September 22, 10:00am-11:00am in Union Square Room 3/4.

More information about MySQL Fabric

There are several blogs being published on high-availability and sharding from the developers working on the Fabric system or the connectors.

If you are interested in discussing and asking questions about MySQL Fabric, or sharding and high-availability in general, the forum on Fabric, Sharding, HA, Utilities is an excellent place for discussions. Also, if you are at MySQL Connect, going to MySQL Sharding, Replication, and HA (September 21, 5:30-6:30pm in Imperial Ballroom B) is an excellent opportunity to learn more about the project, meet us developers and team leads, and provide feedback to us. The BOF will cover several areas, some overlapping, but the discussion is likely to cover MySQL Fabric and MySQL replication.

MySQL Musings