Improving Cassandra’s performance with byte order and tries
Branimir Lambov, DataStax
ApacheCon 2022
Motivation/Background
Cassandra relies heavily on comparison-based structures and algorithms:
Comparisons are inefficient
Byte-comparable keys and tries
Byte-comparable representations (CASSANDRA-6936)
Byte-comparability applications
Legacy Memtables
New memtable solution (CEP-19/CASSANDRA-17240)
Typed Trie Nodes
Fixed Block Size
Pointer tagging
Microbenchmark summary
Microbenchmark summary
Short-term throughput in Apache Cassandra
10% reads, key-value workload using �NoSQLBench / fallout / i3.4xlarge
Sustained performance with improved compaction
Using datastax/cassandra : ds-trunk
Legacy partition index (BIG format)
Trie-based partition index (BTI format)
Unique prefix
Only store up to unique prefix and rely on full key in data file to check full match.
Store and check hash bits to minimize reading data file on mismatch.
Page packing
Include as much substructure as possible inside disk page.
Only write page once branch is greater than page.
Treat pointers to placed pages like leaves.
Key-value microbenchmark
Partition index performance
Operational benefits
Legacy row index (BIG format)
Trie-based row index (BTI format)
Row index performance
Future work
Thank you!
Backup slides
Sharding and performance under skew
Merging, Slicing and Traversal
Memtables
Trie
Trie.java
Tree with labelled transitions.�Key → value map. �Key = path from root to final node.
Chains
Range
RangeTrieSet.java
“bit” inclusive to “thing” exclusive
Intersection
SetIntersectionTrie.java
Intersection with [bit, thing)
Byte-ordered sequences
ByteSource.java, ByteComparable.java, doc
Comparable by lexicographic compare on the bytes
Two versions (6.0/indexes, 6.8/memtables)
Mutation and Concurrent Reads