1 of 39

Improving Cassandra’s performance with byte order and tries

Branimir Lambov, DataStax

ApacheCon 2022

2 of 39

Motivation/Background

Cassandra relies heavily on comparison-based structures and algorithms:

  • SkipList + BTree memtables
  • Search in summary + primary index
  • MergeIterator for compaction and constructing results

3 of 39

Comparisons are inefficient

  • Comparisons require full keys in comparable forms
    • Deserialization / pointer hop
  • Keys can be long and their length varies
    • Difficult to package / inefficient to cache
  • Repeated prefixes vs. map hierarchies
    • More work/space vs. more complex code
  • O(k) multiplier on all operations
    • O(k∙log n) insertion/lookup
    • O(kn∙log m) compaction

4 of 39

Byte-comparable keys and tries

  • Lexicographically comparable
  • Prefix differences define order
  • Data structures can take advantage:
    • Prefix sharing
    • Limited node size
    • O(k) lookup/insertion
    • O(N∙log m) compaction

5 of 39

Byte-comparable representations (CASSANDRA-6936)

  • Typed value <-> byte-comparable representation
  • Sequence of bytes which lexicographically compare like the typed value
  • For example:
    • Unsigned integers: as is
    • Signed integers: flip sign bit
    • IEEE floating point: flip sign bit, flip all other bits if negative
    • UUIDs: move bits around
  • Flat multi-component keys

6 of 39

Byte-comparability applications

  • Trie memtable�
  • BTI (“Big Trie-Indexed”) SSTable format:
    • Trie-based partition index
    • Trie-based row index

7 of 39

Legacy Memtables

  • A hierarchy of comparison-based maps
    • Partition map is a concurrent skip list
    • Row maps and below are B-Trees�
  • Data may be stored off heap
  • All maps (organizing structures) are on-heap
    • On-heap size dominates
    • Complex on-heap structure
    • Churn of objects with medium-term lifecycle

8 of 39

New memtable solution (CEP-19/CASSANDRA-17240)

  • Byte-comparable key translation
  • Partition map using a custom byte trie
    • Typed nodes with pointer tagging
    • Internal memory management with fixed block size
    • Traversal, merging and slicing using a “Cursor” paradigm
    • Single writer, multiple concurrent readers
  • Sharding

9 of 39

Typed Trie Nodes

  • Dense
    • Lots of children, often consulted, pointer arithmetic
  • Sparse
    • Few children, search in children list
  • Chain
    • Single child sequences, comparison
  • Leaf
    • No child, data pointer
  • Prefix
    • Add data to node with child

10 of 39

Fixed Block Size

  • Split dense nodes into 2-3-3 bit subtransitions
  • Combine up to 28 chain nodes into one block
  • Put content prefixes in unused space�
  • Achieves:
    • Cache efficiency
    • Memory management simplicity

11 of 39

Pointer tagging

  • Use pointer bits to identify:
    • Type of node
    • Exact node in a chain�
  • No space needed for leaf nodes

12 of 39

Microbenchmark summary

  • Trie vs. skip-list on 10,000,000-entry key-value memtable:
    • 1.8x times faster random reads
    • 2.5x faster single-threaded insertion

13 of 39

Microbenchmark summary

  • Trie vs. skip-list on 10,000,000-entry key-value memtable:
    • 25-35% more data for the same on-heap size

14 of 39

Short-term throughput in Apache Cassandra

10% reads, key-value workload using �NoSQLBench / fallout / i3.4xlarge

  • 1.8x higher burst throughput

15 of 39

Sustained performance with improved compaction

Using datastax/cassandra : ds-trunk

  • >2x higher sustained throughput
  • ~30% latency reduction at fixed rate�(10 and 50% read workload @110k ops/s)
  • ~30% more data per memtable flush
  • 2.5x less total garbage collection time

16 of 39

Legacy partition index (BIG format)

  • Binary search in in-memory index summary
  • Linear search in sorted index file
    • Includes deserialization and decoration
    • Has to skip over row index
  • Needs key cache

17 of 39

Trie-based partition index (BTI format)

  • Token, partition key encoded as byte-ordered sequence
  • Trie stores unique prefixes
    • Typed nodes, sized pointers
    • Written page-packed
  • Points to data or row index file
  • Includes hash bits

18 of 39

Unique prefix

Only store up to unique prefix and rely on full key in data file to check full match.

Store and check hash bits to minimize reading data file on mismatch.

19 of 39

Page packing

Include as much substructure as possible inside disk page.

Only write page once branch is greater than page.

Treat pointers to placed pages like leaves.

20 of 39

Key-value microbenchmark

  • ~2x faster random reads in microbenchmarks�
  • ~30% smaller index size
  • ~10% latency reduction on 1TB fixed-throughput test

21 of 39

Partition index performance

  • ~2x faster random reads in microbenchmarks�
  • ~30% smaller index size
  • ~10% latency reduction on 1TB fixed-throughput test

22 of 39

Operational benefits

  • No key cache
    • Cached by chunk cache / OS page cache
  • No index summary
    • Small non-leaf page set stays cached
  • No key deserialization
    • Much less object churn during queries
  • No skipping over row index during search
    • Typically a single leaf page fetch

23 of 39

Legacy row index (BIG format)

  • Binary-search in index of blocks of row keys
    • Page fetch per comparison or deserialization of the whole row index to memory�
  • Block is formed at given granularity (64kb by default)
    • Block needs to be large�
  • Linear search within block
  • Full deserialization of block on reversed queries

24 of 39

Trie-based row index (BTI format)

  • Clustering key encoded as byte-ordered sequence
  • Index blocks at given granularity
  • Stores “separator” between index blocks
    • Key ≥ separator means no block before can contain entry
    • Cut off at first different byte
  • Plus deletion information

25 of 39

Row index performance

  • 3-4x more compact index → better index granularity → close to 4x reduction in latency�
  • Improved reversed queries�
  • 4/2/1 kb column_index_size is useable
  • As well as 0kb (full indexing)

26 of 39

Future work

  • Memtable trie to rows and cells
  • Trie-based PartitionUpdate in commit log
  • On-disk TrieTables
  • Compaction and retrieval with trie cursors
  • Partition-level segregation of tombstones from data

27 of 39

Thank you!

28 of 39

29 of 39

Backup slides

30 of 39

Sharding and performance under skew

  • Split hash range equally into separate tries
    • Hashed keys -> good distribution of work�
  • Better at handling congestion on many cores�
  • Potential problems:
    • Dominating partitions
    • Non-hashing partitioners�
  • Tests better for up to 20% skew

31 of 39

Merging, Slicing and Traversal

  • Traversal using stateful `Cursor`
    • `advance` moves to next node in lexicographic order
    • Returns depth and incoming byte
    • Sufficient to decide order in parallel walks
    • Sufficient to recreate paths
  • Union and intersection on cursors�
  • No intermediate objects
  • Single wrapper on transformation

32 of 39

Memtables

  • Sort and organize data in memory
  • On the synchronous path of every write
  • Source for reads of recently-written data
  • Eventually “flushed” to on-disk SSTables�
  • Efficiency determines peak write throughput
  • Size is a major factor in sustained write throughput
  • Structure complexity affects garbage collection pauses

33 of 39

Trie

Trie.java

Tree with labelled transitions.�Key → value map. �Key = path from root to final node.

34 of 39

Chains

35 of 39

Range

RangeTrieSet.java

“bit” inclusive to “thing” exclusive

36 of 39

Intersection

SetIntersectionTrie.java

Intersection with [bit, thing)

37 of 39

Byte-ordered sequences

ByteSource.java, ByteComparable.java, doc

Comparable by lexicographic compare on the bytes

  • Integers — flip sign bit
  • IEEE floats — flip sign bit if positive, flip all bits if negative
  • UUIDs — move bits around
  • Strings, blobs — terminate with 00, escape 00
  • BigInteger, BigDecimal — complex encoding
  • Tuples, lists, maps — encode components, add separators and concatenate

Two versions (6.0/indexes, 6.8/memtables)

38 of 39

Mutation and Concurrent Reads

  • Single attachment point
    • Adding a new child
    • Remapping an existing child
  • No invalid intermediate state
    • Sparse nodes store children unordered
  • Volatile write establishes happens-before
  • Obsolete nodes stay until reads complete

39 of 39