Compression algorithms in Akumuli

This document describes compression algorithms used in Akumuli Time-Series Database.

General purpose compression libraries (lz4, libz) is not a good fit for Time-Series Database. All this libraries are designed with large data volumes in mind. They require large amount of memory per data stream to maintain a sliding window of previously seen samples. The larger the context size the better compression ratio gets. This general purpose compression algorithms usually achieve very good compression ratio on time-series data at the expense of memory use. E.g. lz4 requires about 12KB of memory per context (stream).

Akumuli uses NB+tree data structure to store data. This data-structure consists of 4KB blocks. Each block should have it's own compression context (otherwise it wouldn't be possible to read and decompress individual data blocks). With most general purpose libraries this context gets larger then the block itself. Of cause only one compression context needed per tree most of the time. But the problem is that we have many trees (Akumuli maintains NB+tree instance per time-series) and for each tree we will need individual compression context. E.g. if we're dealing with 100'000 time-series we'll need about 1GB of memory only for compression contexts.

This is the reason why we need specialized compression algorithm. This algorithm should have small memory footprint in the first place. It should deal with both timestamps and values. General purpose encoders can process both timestamps and values (compression ratio can benefit from preprocessing). Specialized encoder should use different algorithms for timestamps and values. There are different kinds of time-series and corner cases that encoder should deal with. E.g. timestamps in regular time-series can be easily compressed using Delta-RLE encoder but if time-series is irregular or contains noise in lower bits Delta-RLE becomes ineffective. In this case Delta-Delta encoder is a better option.

Table 1.

High precision | Low precision | |

Periodic | Delta-Delta | Delta-RLE |

Non-periodic | Delta-Delta | Delta-Delta |

Lot of data sources are periodic and has low precision timestamps thus can be compressed using Delta-RLE. But there are other cases in which Delta-Delta is a better option. Some time-series databases uses Delta-Delta encoding for everything but Akumuli is trying to get the best result by combining both approaches.

- FCM algorithm.
- FPC algorithm.
- Gorilla paper.

This algorithm works only with sorted timestamps. On the lowest level encoder uses variant of LEB128 encoding. Original LEB128 encoding uses 7 bits of each byte to store data and the most significant bit is used as a word separator. In this encoding 64-bit integer can be represented using up to nine bytes. Each encoded byte should be shifted because bits is not aligned with original word.

As an example, here is how the unsigned number 624485 gets encoded (this is an example from Wikipedia article):

10011000011101100101 In raw binary

010011000011101100101 Padded to a multiple of 7 bits

0100110 0001110 1100101 Split into 7-bit groups

00100110 10001110 11100101 Add high 1 bits on all but last group

0x26 0x8E 0xE5 In hexadecimal

0xE5 0x8E 0x26 Output stream

In modified LEB128 version used by Akumuli each byte stores 8-bits of the original word. Each word has 4-bit flag that stores number in [0-8] range. This number represent how many bytes is used to encode original 64-bit word. Word are encoded in pairs and 4-bit flags are combined into single byte. This is basically the same LEB128 encoding with the most significant bits moved into separate byte. It is also very similar to VByte encoding.

Here is how the unsigned number 624485 gets encoded:

10011000011101100101 In raw binary

1001 10000111 01100101 Split into 8-bit groups

0x09 0x87 0x65 In hexadecimal

0x?3 0x09 0x87 0x65 Output stream

(Note: the first byte in output stream is a flag and ‘?’ mark indicates that higher bits of the flag will be used to store flags of the next value.)

This encoding is better because it reduces branch misprediction. With regular LEB128 series of branches should be checked:

- if value < (1 << 7):

- put_byte(value & 0x7F)

- else if value < (1 << 14):

- put_byte((value & 0x7F) | (1 << 7))
- value >>= 7
- put_byte(value & 0x7F)

- else if value < (1 << 21):

- …

Up to five branches should be taken for each 32-bit value and up to nine for each 64-bit value. Experiments showed that this can cause a lot of branch mispredictions if each value stored in two or more bytes.

Pseudo-code for modified LEB128 encoding is a lot more simpler:

- len1 = clzl(value1)
- len2 = clzl(value2)
- ctrl1 = 8 - len1 / 8
- ctrl2 = 8 - len2 / 8
- flags = ctrl1 | (ctrl2 << 4)
- put_byte(flags)
- put_nbytes(value1, len1)
- put_nbytes(value2, len2)

Function clzl returns number of leading zeroes. It can be implemented using __builtin_clzl compiler intrinsic function.

Note that this encoding can represent zeroes using only flags. With original LEB128 we will store each zero value in one byte. With modified LEB128 two consequent zero values will occupy only one byte.

The entire stream is encoded using chunked Delta Delta encoding first. Stream is divided into 16-element frames and each frame is Delta-coded first. Previous value is carried between frames so there is no visible difference between framed and normal Delta encoding. After that smallest element of the frame is subtracted from each element of the frame. As result we have smallest element and 16-element frame with smaller values. After that step smallest element is saved using normal LEB128 encoding and 16-element frame is saved using modified LEB128 encoding.

- frame = get_next_frame()
- min = frame[0] - prev_value
- for i = 0; i < 16; i++:

- delta = frame[i] - prev_value
- outbuf[i] = delta
- prev_value = frame[i]
- min = MIN(delta, min)

- for i = 0; i < 16; i++:

- outbuf[i] = outbuf[i] - min

- outstream.encode_leb128(min)
- outstream.encode_frame(outbuf)

Here `outstream` is an object that implements described LEB128 and modified LEB128 encodings. Function `encode_leb128` writes single value using LEB128 encoding and function `encode_frame` write sixteen consequent values using modified LEB128.

There is a special case. When data source is periodic we will get some non-zero value as a smallest element and 16-element frame will consist from zeroes (because all elements of the frame will be equal after Delta-coding step). Modified LEB128 encoding uses 0xFF flag value to indicate this case. If encoder sees that frame consists of sixteen zeroes it saves smallest element using normal LEB128 encoding and then it inserts 0xFF flag (this can be done in `encode_frame` function). So for periodic data sources we will store only one LEB128 encoded word plus one byte per frame (sixteen elements). This means that compression ratio in this case will be in x20-x30 range. This is a bit worse than Delta-RLE but far better than Delta-Delta encoding that achieves x6.4 compression ratio at best.

To compress values akumuli relies on DFCM (differential finite context method) predictor. Previous values are used to predict the next one. Predicted value then being XORed with the current one and difference is encoded.

Here is algorithm outline (pseudocode):

- Tuple<u64, byte> process_value(predictor, input):

- u64 predicted = predictor.predict_next()
- u64 curr = input.get_next()
- predictor.update(next)
- u64 bits = curr XOR predicted
- leading_zeroes = count_leading_zeroes(bits)
- trailing_zeroes = count_trailing_zeroes(bits)
- If trailing_zeroes > leading_zeroes:

- nbytes = 8 - trailing_zeroes / 8
- If nbytes > 0:

- nbytes = nbytes - 1

- flags = 00001000b | ( nbytes & 00000111b )

- Else:

- nbytes = 8 - leading_zeroes / 8
- If nbytes > 0:

- nbytes = nbytes - 1

- flags = nbytes & 00000111b

- nshift = (64 - nbytes*8)*(flags & 00001000b)
- Return (bits >> nshift), flags

This algorithm returns xor-ed value without trailing bits and the set of flags (4-bits). Flags is used to encode length of the xor-ed value and what part of the u64 value was truncated (algorithm can truncate leading or trailing bits). Only 3 bits is used to encode length so value length should be in range (1, 8). If predicted value is the same as actual observed value we will store one byte anyway. Because each value takes integral number of bytes and flags can be stored in a half of the byte akumuli groups consequent values into pairs:

- lobits, loflags = process_value(predictor, input)
- hibits, hiflags = process_value(predictor, input)
- nbytes_lo = loflags & 00000111b + 1
- nbytes_hi = hiflags & 00000111b + 1
- output.write_n_bytes(&lobits, nbytes_lo)
- output.write_n_bytes(&hibits, nbytes_hi)
- output.write_n_bytes(loflags & (hiflags << 4), 1)

This technique allows us to write data only on byte boundary to improve performance.

The flag 0xF can't be produced by the algorithm (if `bits` variable is 0 the code above will generate 0x7 flag). This allows us to use it for the different purpose. Akumuli uses flag 0xF to indicate that entire row consists of zero values thus can be omitted (this is possible because encoder works with rows of values composed of 16 elements). As result, if decoder meets this flag it just repeats the last value 16 times. This means that only half of bit will be used to store each value.

Combined algorithm was able to achieve pick throughput of 1.1GB/s on high precision data (random walk) with non-periodic timestamps.

Compression ratio varies depending on the data. On low precision data with periodic timestamps it was able to store each data point in 1.9 bytes per element. On full precision data (double values with all mantissa bits changing) with periodic timestamps with noise in microsecond domain it was able to store each data point in 8.3 bytes per element in average.

Each version of the compression algorithm was fuzzed using AFL (https://github.com/akumuli/Akumuli/blob/master/fuzzers/afl_compression.cpp) and diverse set of samples.