MySQL Musings: The missing pieces in the protobuf binary log

Protobuf comes with a minor problem: it does not have support for handling "type tagged structures", that is, something reminiscent of objects in OOP lingo, so if one is going to have a heterogeneous sequences of messages, you have to roll it yourself. For that reason, I added a transport frame for the messages in the binary log that wraps each with some extra information. In addition to allowing the binary log to be a sequence of messages, it also adds some integrity-checking data and simplifies some administrative tasks.

Transport frame with message
Length
Type Tag
Message
Checksum

The format of each message in the sequences is given in the table in the margin. where the length is a specially encoded length that we will go through below, type is a single byte being the type tag, message being one of the messages given in the specification, and checksum being a checksum to ensure the integrity of the transport.

Checksum. As checksum, the plan is to use a CRC-32. We don't want it to be too large to affect performance, and we want it to catch reasonable losses of integrity. I'm considering storing this as a varint after the actual message, but for the time being, it is given as 4 raw bytes (it is not implemented at all yet). Please give me feedback on this: if we make it a varint, we can stuff the checksum in there, but that will also run the risk of not being able to read the checksum due to corruption of the checksum, so offhand, I would say that a fixed number of bytes is preferable.

Type Tag. The type tag is a single byte giving the type of the event. This means that we are limited to 255 events, but considering that we don't even have 26 events in 5.1 right now, I don't see that we will run into that limit very soon. It is possible to put the type tag in the message as well, and specifying it as an enum inside the protobuf specification, but that will just provide the information in two places, so it is better to keep it separately.

Length. The length is the length of the message, that is, it does not include the type tag, checksum, nor the length itself. This simplifies the normal processing, and in the event that one needs to skip an event, it is easy to compute the next position of a transport frame by just decoding the length (see below).

Length encoding

The length is encoded using a special scheme to allow for very little overhead for small events while still leaving room for giving the length for very large events. This scheme currently allows for a compact representation of lengths from 2 bytes 4 GiB (Gigabytes) and, if you don't need to have a compact representation of 2, you can represent lengths in the range 3 bytes to to 16 EiB (Exabytes) [sic].

The basic idea is to note that the length of a message can never be zero, and the minimal length in this case is actually 16 bytes. Since we will never have a length that is less than 16 for an event, that leaves the lengths 0-16 available for denoting other information. The obvious solution is to let the first byte denote either the length, if it is, say greater than 8, or the number of bytes following the byte that gives the length if it is less than or equal to 8, but we can actually do better.

Storing Length requires ceil(log₂₅₆(Length)) bytes, so if we let the first byte L be

ceil(log₂(ceil(log₂₅₆(Length)))) - 1

(that is, the smallest power of two that is greater than the number of bytes needed for the length, minus 1), we can get away with reserving significantly fewer values. So, for example:

Bytes					Value
2A					42
00	FF	02			767
01	FF	01	01	00	66047

Computing the number of bytes can be done by computing 1 << (L + 1), but computing the inverse is a little more involved. The following two functions does the job. Although it looks like a lot of code, length_encode() is actually only 29 instructions on my machine (no function calls), while length_decode() is about 7 instructions. The trick here is to compute the logarithm at the same time as we serialize the bytes into the memory, so the overhead compared to just serializing the length is just a few instructions.

inline unsigned char *
length_encode(size_t length, unsigned char *buf)
{
  unsigned char *ptr= buf;
  assert(length > 1);
  if (length < 256)
    *ptr++= length & 0xFF;
  else {
    int_fast8_t log2m1= -1;        // ceil(log2(ptr - buf)) - 1
    uint_fast8_t pow2= 1;          // pow2(log2m1 + 1)
    while (length > 0) {
      // Check the invariants
      assert(pow2 == (1 << (log2m1 + 1)));
      assert((ptr - buf) <= (1 << (log2m1 + 1)));

      // Write the least significant byte of the current
      // length. Prefix increment is used to make space for the first
      // byte that will hold log2m1.
      *++ptr= length & 0xFF;
      length >>= 8;

      // Ensure the invariant holds by correcting it if it doesn't,
      // that is, the number of bytes written is greater than the
      // nearest power of two.
      if (ptr - buf > pow2) {
        ++log2m1;
        pow2 <<= 1;
      }
    }
    // Clear the remaining bytes up to the next power of two
    while (++ptr < buf + pow2 + 1)
      *ptr= 0;
    *buf= log2m1;
    assert(ptr == buf + pow2 + 1);
  }
  return ptr;
}

inline const unsigned char *
length_decode(const unsigned char *buf, size_t *plen)
{
  if (*buf > 1) {
    *plen = *buf;
    return buf + 1;
  }

  size_t bytes= 1U << (*buf + 1);
  const unsigned char *ptr= buf + 1;
  size_t length= 0;
  for (unsigned int i = 0 ; i < bytes ; ++i)
    length |= *ptr++ << (8 * i);
  *plen= length;
  return ptr;
}

MySQL Musings

Wednesday, August 20, 2008

The missing pieces in the protobuf binary log

Length encoding

No comments: