Published using Google Docs
Numeric B+tree reference
Updated automatically every 5 minutes

Numeric B+tree reference

This article describes new append-only multiversion B+tree specialized for time-series data. This data structure has built-in compression specifically tailored for time-series data and it also stores aggregates that can be used to speed up some queries. New crash recovery mechanism was designed for this data-structure that doesn't rely on logging (write-ahead log or command log). 

Problems and goals

Many time-series storage systems are using aggregation hierarchies or materialized views. E.g. with Graphite’s Whisper database one can configure set of rollups (http://graphite.readthedocs.io/en/latest/whisper.html#rollup-aggregation). Whisper will create lower precision archives (one archive per rollup) and store resampled data in these archives. Also Graphite can use these archives to draw graphs, perform aggregations on data and retrieve resampled time-series when needed. Data can be resampled using several functions - average, min, max, last and sum. The problem with this approach is that user should think about rollups and retention policies in advance.

Related work

TSM tree (used in InfluxDB) is based on LSM-tree and inherits all its downsides. One of the main disadvantages of the LSM-tree is that compactions should be performed from time to time. This can cause latency spikes and it introduces additional write-amplification.

OpenTSDB uses HBase as its primary storage so it's basically the same LSM-tree based design. Users of this database should maintain Hadoop deployment and it requires maintenance expertise.

UC Berkeley introduced BTrDB (Berkeley Tree Database) as part of the OpenBAS project. It's somewhat close to NB+tree design because it precomputes statistics in the same way. But data layout is a bit different. NB+tree stores data in fixed size blocks and each block stores time-ranges of different sizes. BTrDB stores aligned by the power of two time-ranges in each block. This makes it less suitable for irregular data sources. Also the paper that describes the system doesn't provides any detail about data retention policy.

Facebook recently introduced in-memory time-series database called Gorilla. They described compression algorithms used in this system. Compression in InfluxDB is based on this paper. Akumuli uses similar but more sophisticated and performant compression scheme. Gorilla uses Delta-Delta followed by hardcoded variable bit-width encoding for timestamps and encodes XOR'd floating points using variable bit-width encoding. This introduces a lot of bit manipulations. Akumuli, on the other hand, uses byte level encoding and thus works much faster.

Reference

NB+tree is an augmented B+tree optimized for numeric time-series data. Each tree is composed of fixed-size blocks (default: 4KB). Data points are stored in leaf nodes in compressed format. Each leaf node can store variable amount of data points (up to several thousands) and covers time-range of variable size. NB+tree is an append-only data-structure. It’s impossible to alter data inside tree nodes. All updates performed via path copying.

It is possible to have millions of NB+tree instances in one archive (NB+tree instance per series) and to work with each one of those separately. This can help to utilize high internal parallelism of the SSD. NB+tree writes data using fixed size aligned blocks and never overwrites anything minimizing write amplification and pressure on the FTL(Flash Translation Layer) of the SSD.

Leaf nodes are grouped together by inner nodes. Each inner node contains up to 32 links to other inner nodes or leaf nodes. Each link to subtree (leaf node or inner node) contains some aggregates. These aggregates are:

All this aggregates are composable. Unlike averages, medians and percentiles they can be calculated from other aggregates so when we’re adding link to inner node we don’t need to examine all leaf nodes in that subtree. Subtrees are never overlaps by the covered time-range. We’re need to calculate aggregates from leaf nodes only at the edges of the search interval.

All nodes (both leaf and inner nodes) has links to predecessors. This backreferences are used during crash recovery.

(fig. 1)

Note that back-ref is added only if node is not the first child of the higher level node (fig. 1 shows this for leaf nodes but this rule is used on any level of the tree). This allows us to use path-copying to modify  the tree (updates/deletes). E.g. to modify leaf 3 we’ll need to copy leafs 3, 4, 5 and all higher level nodes. Number of modified nodes during path copying process will be (worst case):

N = Fan_out_ration*Tree_height;

Storage is composed from many NB+tree instances that shares the same underlying archive (blocks from different trees are interleaved in one file). This complicates crash recovery process because we can’t use command log or write-ahead log (because it should be shared between many trees and this is hard to achieve or we’ll need to create log per tree and this approach will totally defeat the purpose of this storage engine). NB+tree uses novel crash recovery schema that doesn’t rely on logging. It uses external storage (sqlite in current implementation) to store some small amount of data that can be used by recovery procedure.

Because NB+tree is an append-only data structure updates and deletes will create unreachable blocks. This garbage blocks aren’t reclaimed by existing implementation and there is no plans for this because these events should occur rarely. Normal ingestion process doesn’t generate any garbage.

Queries

Scan query is a simple depth-first search that works the same way as B+tree search. Search is started from the root of the tree and tree is traversed recursively from top to bottom. Depending on the query tree can be traversed in forward or backward direction (in time order). During ingestion tree can be split into several independent extents. In this case scan will be performed on each individual extent in order and results will be concatenated.

Aggregate query works the same way as Scan but instead of returning all data in range it will calculate aggregates. This query can use aggregates stored inside inner nodes to prune search space. It can use precalculated aggregates if entire subtree lies inside searched time-range.

Candlesticks query can return precalculated aggregates stored inside inner nodes at particular tree level. Candlesticks has variable step because each subtree or leaf node covers variable time interval. This property limits their usability but candlesticks can be used to quickly draw graphs using piecewise curve fitting (each candlestick has four data-points - open, close, min and max).

Trimming

NB+tree is designed to operate in limited disk space. Old data is removed and disk space is reclaimed without rebalancing the tree. Rebalancing each tree is infeasible because number of  trees in one database can be large. Old data is deleted by simple underlying storage trimming (e.g. by truncating file at the front or removing entire archive). NB+tree is data loss proof. If old blocks of the tree is not available it is still possible to read and write new data.

(fig. 2)

In the picture above we can remove first half of the archive and both trees will be readable anyway (because root nodes stored in the second half of the archive). Inner nodes (or root node) is always closer to the end of the archive than its children. This means that if inner node becomes old (this can be detected by looking at its address) then all leafs referenced by it should be even older and can be safely removed.

Ingestion

Note: data from each source should be ordered by timestamp before ingestion.

The data is compressed using specialized compression algorithm. This algorithm allows for incremental writes and partial reads (we can read data points even if leaf node is not full yet).

NB+tree doesn’t produce any inaccessible disk blocks during ingestion process because there is no node-splitting (TBD: link to some B+tree reference). Instead of node splitting NB+tree maintains series of extents and merges them periodically. Extent is an NB+tree instance of certain height. Root of the tree is always memory resident, all the other nodes should be committed to disk. Different extents shouldn’t overlap by time-range or contain the same nodes.

E.g.

(fig. 3 - NB+tree extents)

We can see that NB+tree can be very shallow. Because of high fan out ratio of the tree and compression we can store 1B data-points using only four levels. Maximum number of data-points in one time-series in Akumuli is about 1e17 because tree height is limited (max tree height is 10).

When the root of the extent becomes full (root is always memory resident) it gets committed to disk and then added to the higher level extent as a subtree. This is a recursive process, if the root of the higher level extent becomes full it will be committed and added as a subtree to the higher level extent. Overflows occurs most often at level 0, then at level 1 and so on. Important property of the NB+tree is that each inner node is younger (was committed later) than nodes that it has links to.

Compression

Compression algorithm is used to write data into 4KB leaf nodes. It should be able to use all available bytes to minimize internal fragmentation of the storage system. This algorithm should be incremental. It should be possible to add data to each block without reading and decoding all previously written data. Partially filled blocks should be readable, otherwise we will be forced to maintain another copy of data somewhere.

General purpose algorithms achieves good performance when used on large continuous data-streams. They tailored for large streams of data and not for fine grained incremental reads and writes. Leaf nodes used by NB+tree is too small for them. In my tests zlib was slower then this custom algorithm by the factor of 30 (performance difference is data dependant so numbers can vary).

Each data-point consist of 64-bit timestamp and 64-bit floating point value.

To compress values akumuli relies on DFCM (differential finite context method) predictor. Previous values are used to predict the next one. Predicted value then being XORed with the current one and difference is encoded.

Here is algorithm outline (pseudocode):

This algorithm returns xor-ed value without trailing bits and the set of flags (4-bits). Flags is used to encode length of the xor-ed value and what part of the u64 value was truncated (algorithm can truncate leading or trailing bits). Only 3 bits is used to encode length so value length should be in range (1, 8). If predicted value is the same as actual observed value we will store one byte anyway. Because each value takes integral number of bytes and flags can be stored in a half of the byte akumuli groups consequent values into pairs:

This technique allows us to write data only on byte boundary to improve performance.

Timestamps are encoded using combination of Delta-Delta encoding and LEB128 encoding. It’s efficient and fast enough for most cases. Values compression has the most impact on the total compression ratio in my experience.

Closing NB+tree

NB+tree instance can be closed for writing. Outline of the algorithm:

At the end of this process we should get address of the single root node. Note that this process produces some fragmentation because it write nodes that isn’t necessary full. Akumuli uses copy on write if node is not at least ¾ full. Also note that subsequent open-close operations performed in a loop can potentially eat up all storage. Akumuli fights this pathologic pattern by using lazy tree initialization.

Crash recovery

Each extent has memory resident root node. This node can be lost during crash. There is two major cases here:

For each tree Akumuli maintains a list of rescue points. Rescue points list is a simple array that contains node addresses. Each time an extent at level i gets committed to persistent storage its address should be saved in the i-th element of this array. Then we can use i-th element of this array to restore root node of the (i+1)-th extent.

Note that the last element of the rescue points list is always empty because when last extent becomes full and gets committed new extent will be created and new empty element will be added to the end of the list. But if tree was properly closed the last element of the rescue points list will contain address of the root node. This property is used to determine whether or not it was closed or crash occurred.

Recovery algorithm outline:

This procedure restores extents starting from the oldest and largest one. Each inner node can be restored based on information stored in it’s children. Children can be either inner nodes or leaf nodes.

Pseudocode for restore_inner_node procedure:

Note that `read_node` and `calculate_aggregates` functions should work differently for inner nodes and leaf nodes.

 

(fig. 4)

Figure 4 shows extent before crash. Nodes A and B are persisted to disk and node C is memory resident. Function `restore_inner_node` will read node B first, because it’s address will be stored in rescue points list (because it was saved last). It will calculate aggregates and add it to newly created replacement for node C. Then our procedure will do the same with node A.

Note that crash recovery process is crash safe itself because it restores only memory-resident nodes and doesn’t writes anything to disk.

Performance evaluation

NB+tree was implemented in Akumuli project and this implementation was benchmarked. Number of points to write was chosen high enough to measure sustained write throughput. Test writes 400 000 000 data-points through four OS threads. Each thread has its own set of NB+tree instances and writes data sequentially in time order (this is the worst case because each write hits different NB+tree instance). Total write time is measured and reported. All tests was performed on Dell XPS 13 laptop with 8GB of ram and NVMe SSD.

# Threads

# Trees

# Data points

Time (s)

Avg. throughput

4

10000

400 000 000

210

1900000 points/second

4

100

400 000 000

79.5

5000000 points/second

1

1

400 000 000

302

1300000 points/second

We can see that throughput degrades when number of trees becomes large. Profiling showed that the reason of such behaviour is TLB-misses. This can be fixed by batching and performing writes in different order. The second tests works the same way but  writes data into each NB+tree one by one in large chunks.

# Threads

# Trees

# Data points

Time (s)

Avg. throughput

4

10000

400 000 000

91

4300000 points/second

4

100

400 000 000

94

4250000 points/second

Crash-recovery performance test showed that database with one billion data points and 10000 NB+tree instances can be restored in less than a ten seconds.

Conclusion

NB+tree provides solid foundation for time-series database. It can achieve fast write and read throughput, low-overhead statistical queries and graphing. Crash safety, recovery and data retention are also described in this article.