Secrets of the ANTs Data Server 03 -- Mutli-Version Concurrency Control

The speciality of ADS was its ability to handle large numbers of concurrent transactions, all modifying the data. On the 4-core and 8-core hardware of the day, ADS could handle one or two orders of magnitude more active connections than other commercial databases such as Oracle. Part of the reason that it could do this was the use of multi-version concurrency control (MVCC) at the cell level --that is, at the level of one column data item in one row. It was possible for multiple transactions to modify the same row simultaneously as long as they didn’t have any conflicts.

In fact, it was possible for multiple transactions to modify the same cell if the modifications used an operator that was commutative and associative. That is, given a set of values with a binary operation such that

then you can do a sequence of updates in any order and the results will be the same. We called these “commutative updates” and I’ll describe them in more detail later. ADS only implemented addition as a commutative operator on the theory that it would speed up financial transactions and similar things. We also considered implementing identity updates (that is, updating a cell the the same value it already has) but I don’t think we ever did.

Other databases have used MVCC, but to my knowledge, none of them have allowed simultaneous updates at the cell level, none of them implemented commutative updates (which was patented by ANTs), and all of them allow for uncommitted transactions to go to disk. ADS required the entire transaction space to fit in (virtual) memory, which allowed certain optimizations in the way that updates were handled.

By not allowing uncommitted transactions to go to disk, ADS was relieved of the complexity and overhead of creating on-disk structures for updated-but-not-committed data, but there are disadvantages. This strategy doesn’t really limit the size of your transactions if you are willing to have a large enough swap space, but when you let your work exceed the size of RAM, you are leaving it up to the operating system to page memory in and out for you, and the operating system doesn’t usually have enough information to do a good job of this --or not a as good a job as you could do yourself with your knowledge of what’s going on. ADS actually went to considerable trouble to make sure that it didn’t hit the swap space except for very large transactions and in those cases ADS lost most of its performance advantages.

Since the ADS work, 64-bit machines have become more prevalent and the server operating systems now all have memory-mapped files, so I’ve been contemplating how we might have changed the implementation to use memory-mapped files if they’d been ready when we started. Maybe I’ll write up some speculation about it later.

The following will describe the in-memory data structures and algorithms used to implement MVCC in ADS. It is based on my own memory and I haven’t seen the source code in half a dozen years, so it may have some inaccuracies, but I it should give you a general idea of how it worked.

In-memory Row Structure

ADS rows were in 64KB pages. This is quite a bit larger than more traditional disk-reading programs which typically use page sizes of 8KB or less. The justification for using such large reads is that modern disks read in chunks on the order of a megabyte so it doesn’t cost more to read 64K off the disk. It does still costs more to transfer a bigger chunk over the I/O bus but the I/O bus is very fast these days and relatively faster with larger chunks so in those cases where we only need a small piece of data, reading 64KB really doesn’t cost that much more than reading 8KB. But in cases where we need all 64KB, doing one read can make a big difference vs. doing 8 reads. Furthermore, if our data is laid out well, we will tend to use more than one data item close together, so once we read a 64KB page to use one data item, there is a good chance that we will need it again for another data item. The chances of this happening with smaller pages is smaller. And finally, using 64KB pages allows us to handle much larger data items without doing anything special for multi-page items.

All cells in a row were 8 bytes, so the size of a row was a fixed size of 8 bytes per column plus a rowid, some bitfields, and a rowlock used when deleting a row. Rows were allocated in a page starting at the low-numbered end and counting up. Variable-sized data was kept in the same 64KB page but was allocated from the high-numbered end of the page counting down.

For example, suppose that you have a table with two columns: an integer, i, and a varchar, s. Each row will be represented by three 64-bit numbers: a header field, a 64-bit integer, and a 64-bit pointer to a piece of memory representing the varchar (when on-disk, the pointer is replaced with an offset into the page). If a page has three rows:

i	s
0	‘abc’
1	‘def’
2	‘ghi’

Then the 64K page might look something like this.

Note that the first row is pointing to the last string because rows are allocated from the front and strings from the end. Each string consists of a length and a sequence of characters so they could hold binary data.

The pointers could be represented either as a real machine pointer (in which case the string could be anywhere in memory, or as an offset into the page. ADS had a bit field in the row to distinguish these two cases, but the bit field wasn’t necessary. It is easy to ensure that the lowest 64KB in memory are not used for string pointers so all string pointers would be represented by integers larger than 64K. If you do this then you can distinguish between page offsets and pointers just by looking at the number. If the number is less than 64K then it’s an offset, otherwise it’s a pointer.

The first eight bytes in each row was used to hold a bit field containing two 1-bit flags for each cell in the row. The first flag was to distinguish pointers from offsets as discussed above. The second flag was to distinguish the regular on-disk representation of the cell from a cell with an update or history. This eight bytes of bit field supported 32 columns (64 bits at 2-bits per column). If the table had more than 32 columns then after column 32 there would be another 8 bytes of bit field added after column 32 for additional columns.

Update Blocks and History Blocks

Suppose the third row in the example above is to be updated to change the 2 to 14. Originally the row would look like this

To indicate the update we create an update block that contains the new value, 14, and a history block containing the old value 2. Then we replace the value 2 with a pointer to this structure and change the flag to indicate that the first cell is a pointer to an update block rather than a value:

The hnext field points to the next block in the history list. The snext field points to the next update block in the session list or a chain of all history blocks to be garbage collected together. All update blocks for a particular transaction were linked together by the snext field. The ts field is used to timestamp the history. I’m not sure what it was set to at this point because there will be no timestamp until the session doing the update commits.

There were additional fields as well, including hprev and sprev fields associated with hnext and snext to form doubly-linked lists. There was also a pointer to another object that had various transaction-specific information including three function pointers to handle the operations of commit, rollback, and cleanup specifically for the transaction.

Non-Blocking Reads

Reading and writing the flag and the value field has to be done in a safe manner. If we were to set the value field to point to the update block before setting the bit telling what that field is, then a reader could come in between the two assignments and pick up the pointer, thinking it was an integer value. If we set the bit first, then the reader could come in between and pick up an integer thinking it was a pointer.

I don’t recall the details of how this was handled, but it was probably something like this: the writer would set the value field first, and then the pointer. The reader would read the flag first. If the flag says it is a pointer, then the reader can read the value field knowing it is a pointer because there was no regular operation to go back the other way. If the flag says it is an integer, then the reader reads the value field. After reading the value field, the reader checks the flag again. If the flag has not changed, then the reader has an integer. If the flag has changed, then the reader has to read the value field again. This time it knows that it has picked up the pointer because it would have been set to a pointer before the flag was changed.

If the reader finds that it has a pointer to an update block then it checks to see if the update block has the same session ID as the reader (this is another field that I left off the diagram because I don’t recall if it was in the update block itself or reached indirectly). If the session ID of the reader matches the session ID of the update block, then the reader returns the value in the update block, 14. Otherwise, the reader traverses the list of history blocks and finds the one that was current when the reader’s transaction started. I’ll explain later how that happens but in this example, there is only one history block in the list, which indicates that there are no updates that are more recent than the most recently started transaction. Therefore the reader returns the value in the history block.

Blocking and Conflicts

If a writer from session A tries to update a cell with an update block from session A, then this was allowed. The result of the transaction when finally committed will be the last update that the session made in the transaction.

If a writer from session A tries to update a cell with an update block from session B, then session A has to block on session B. This is necessary because there may be another cell where B tries to update a cell that already has an update from session A. If both sessions commit, then we don’t know what consistent value to assign to the two cells to make it appear that one transaction happened before the other.

If session A is blocked on session B and then B tries to block on session A, this would create a deadlock. In this case, session B gets rolled back and restarted. This aborting and restarting is a fairly heavy-weight solution. It is possible that the transaction that gets aborted has already done a lot of work and not only do we lose this work, we have to do additional work to roll back. There was a fallback mechanism to prevent permanent starvation of this type: eventually a transaction would just lock the entire table to do its updates if it had been rolled back often enough.

Finally, If session A tries to merely read (not write) a cell that has an update block from session B it would normally just go through and read the history. But if A is currently blocked on B then it can’t do this because if A is blocked on B then this means that there is another cell somewhere such that if B commits, then A will read the value the B put in the cell, but if B rolls back then A will read the historic value. Since we don’t know whether A logically happens after B or if B logically never happens, we don’t have enough information to decide whether to read the value in B’s update block or read the historic value. In this case, the task suspends just as if it had been a writer.

Committing

If the session commits its transaction, then it goes through the snext field of all of its update blocks and turns them into history blocks by changing the type field, setting the timestamp field to the current timestamp and making a few other changes. The resulting structure looks like this:

All blocks were updated in one atomic operation as I will describe in the section on phased execution. For future reads, if the reading session started after TS then it will read the new history block. If it started before TS then it will read the next history block. The block will be cleaned up at some future time when there is no longer any active transaction that may need the value.

The new value is in an in-memory format so it can’t be saved to disk. ADS would commit without writing the updated page to disk but it would write the new value to the transaction log so that the most recent committed state could be recovered in case of a crash.

Rolling Back

When a transaction rolled back, it would traverse the snext links, removing the update blocks and leaving a structure like this:

The update block would then be freed back to the storage management system.

Commutative Updates

Two writers that want to write to the same cell don’t always conflict in ADS. A statement like

UPDATE t SET i = i+8 WHERE ...

can stack up with another statement such as

UPDATE t SET i = i-7 WHERE ...

Using the same row as in the examples above, the first UPDATE will produce this structure:

Notice that the type of the update block is “upd+”, not just “upd”. This means that it is a commutative update with addition. Also, the val field contains, not the new value, but the value that is to be added.

Now when the next update statement comes in, the writer will see that it is looking at a commutative update of the same type that it wants to add, so it just adds it:

A reader that comes in now has to go all the way to the history block without hitting any conflicts before it can return. When one of the writers rolled back, it just removed the commutative update block as before. The trick happens on commit. On commit, ADS would move the newly committed update block down to just before the first history block, and change it to a proper history block with a value. No locking was necessary because of phased execution. There were also upper and lower bounds on the value for each history block but I don’t recall what these were used for.