A Success In Implementing A Fast Hash Table
Eduardo Madrid, Scott Bruce
CppCon 2022, 2022-9-16
Authors
Eduardo Madrid
Consultant
Ex-Snap
Ex FinTech
Repeat CppCon Presenter
Scott Bruce
Rockset
Ex-Snap
Ex-Google
First time CppCon Presenter
Design and Implementation of a fast hash table
What is required for fast hash tables?
2. Parallel computation
3. Efficient encoding of metadata
Hashtable Briefing
|
|
|
|
|
|
|
|
|
|
4
1
7
| | 1 | 4 | | | 7 | |
11
homeIndex
available
Chaining
Probing
Hang a linked list off every slot of array.
Extremely bad locality!
Find first open slot after homeIndex.
Hashtables: Key Checks and Entropy
Robin Hood Fundamentals
Robin Hood Hash Tables
0 | 0 | 0 | 25, 1 | 0 | 0 | 0 | 0 | 0 | 0 |
homeIndex
actualIndex
0 | 0 | 0 | 76,1 | 33,2 | 42,3 | 25,4 | 22,2 | 14,3 | 0 |
homeIndex
0 | 0 | 0 | 76, 1 | 33,2 | 42,3 | 14,1 | 22,2 | 0 | 0 |
Probe sequence length
Before insert
After insert
Empty table
Robin Hood Table: Insert
Elements are value, probe sequence length (PSL). Empty cells are PSL 0.
Find hit!
0 | 0 | 0 | 76,1 | 33,2 | 42,3 | 25,4 | 22,2 | 14,3 | 0 |
homeIndex
0 | 0 | 0 | 76, 1 | 33,2 | 42,3 | 14,1 | 22,2 | 0 | 0 |
Probe sequence length
Find, missing
Find, present
Robin Hood Table: Find
Elements are value, probe sequence length (PSL). Empty cells are PSL 0.
Find missed!
homeIndex
25
0 | 0 | 0 | 76, 1 | 33,2 | 42,3 | 22,1 | 14,2 | 0 | 0 |
Probe sequence length
Delete, after
Robin Hood Table: Delete
Elements are value, probe sequence length (PSL). Empty cells are PSL 0.
Remove, shift back!
homeIndex
Delete hit!
0 | 0 | 0 | 76,1 | 33,2 | 42,3 | 25,4 | 22,2 | 14,3 | 0 |
homeIndex
Delete
25
Robin Hood Hash Table Properties
Some other hashtable designs
Skarupke’s Robin Hood Hash
Martinus’ Robin Hood Hash
Google/Abseil’s SwissTable / related hash tables
F14 SIMD hash map
Design and Goals
Goals
Design Strategies
SWAR (SIMD within a register)
SWAR (SIMD within a register)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
2 | 4 | 6 | 8 | 10 | 12 | 14 | 16 |
+
=
Why SWAR? Safety.
Why SWAR? Speed!
When doing SETS of useful operations, SWAR has at least not-worse performance.
Why SWAR? SIMD!
Generic Programming
Keep Abstractions Healthy
Generic Programming
Better Instruction Cost Models!
Traditional Cost Models
Count Compares!
Count Cachelines!
Tiered Instruction Cost Model
Ops | Cost |
Bitwise Ops, Add, Subtract | 0-1 |
Multiply, Branch Hit, Predicted Indirect | 1-3 |
Division, Modulo | ~40 |
Branch Miss, Indirect Jump | 10-60 |
L2 Cache Miss, Bad Indirect Jump | 50-150 |
Design for High Load Factor
Table Parameterization: Metadata
Table Parameterization: Data Layout
Type Erasure for Runtime Optimization
Implementation
Metadata Bit Packing
psl|hash | psl|hash | psl|hash | psl|hash | psl|hash | psl|hash | psl|hash | psl|hash |
00|00 | 01|eb | 02|f3 | 01|cc | 00|00 | 00|00 | 00|00 | 00|00 |
Metadata Bit Packing
The Sharpest Needle, corrected
6|c | 2|d | 3|e | 4|d | 5|c | 1|a | 0|0 | 0|0 |
1|c | 2|c | 3|c | 4|c | 5|c | 6|c | 7|c | 8|c |
X|! | !|X | !|X | !|X | !|! | X|< | X|< | X|< |
Needle
Haystack
Result
Match!
Deadline!
The Sharpest Needle, as presented
6|c | 2|e | 3|f | 4|d | 5|c | 1|a | 0|0 | 0|0 |
1|c | 2|c | 3|c | 4|c | 5|c | 6|c | 7|c | 8|c |
X|F | F|X | F|X | F|X | F|F | X|X | X|X | X|X |
Needle
Haystack
Result
Match!
Deadline!
Hash vs Scatter and Hoist
Deterministic Code
auto toAbsolute = [](auto v, auto ma) { auto shiftedLeft = v.shiftLanesLeft(ma); auto shiftedRight = v.shiftLanesRight(Metadata::NSlots - ma - 1).shiftLanesRight(1); return Metadata{shiftedLeft | shiftedRight}; }; |
Align SWARs to find.
Insertion against Unaligned SWAR.
Insertion and Eviction Chains
Division is wildly expensive
Deletion by Shifting Back
Resizing!
Results
Clang++-13 300k elements, find hit
Unordered Map load factor 0.854552 | Robin Hood load factor 0.99944| Element Count 300000 RH Cap 300168
Lookup, find, increment value.
benchmark name mean low mean high mean
-------------------------------------------------------------------------------
baseline 46.054 us 44.289 us 47.836 us
Unordered Map 91.8874 ms 91.3968 ms 92.3027 ms
Robin Hood 6.47753 ms 6.29374 ms 6.79329 ms
std::map 299.269 ms 296.54 ms 303.723 ms
Debian clang version 13.0.1-6~deb11u1 Intel(R) Core(TM) m3-8100Y CPU @ 1.10GHz, Linux penguin 5.4.163-17364-gfe3d4f499cf1 x86_64 GNU/Linux
Clang++-13 300k elements, find miss
Unordered Map load factor 0.854552 | Robin Hood load factor 0.99944 | Element Count 300000 RH Cap 300168
Lookup value not present in table.
benchmark name mean low mean high mean
---------------------------------------------------
baseline 31.513 us 30.24 us 36.331 us
Unordered Map 58.3777 ms 57.3793 ms 59.3649 ms
Robin Hood 7.57569 ms 7.29033 ms 7.90574 ms
std::map 13.5118 ms 13.0835 ms 14.1473 ms
Debian clang version 13.0.1-6~deb11u1 Intel(R) Core(TM) m3-8100Y CPU @ 1.10GHz, Linux penguin 5.4.163-17364-gfe3d4f499cf1 x86_64 GNU/Linux
Clang++-13 string keys
Benchmark is a frequency count of input corpus.
Number of different words: 4914
benchmark name mean low mean high mean
------------------------------------------------------
RH 1.6437 ms 1.61575 ms 1.68692 ms
std::unordered_map 2.67309 ms 2.63686 ms 2.72369 ms
std::map 8.91768 ms 8.81113 ms 9.09036 ms
Debian clang version 13.0.1-6~deb11u1 Intel(R) Core(TM) m3-8100Y CPU @ 1.10GHz, Linux penguin 5.4.163-17364-gfe3d4f499cf1 x86_64 GNU/Linux
Clang++-13 Insertion
Benchmark name is ‘<count inserted> - <type>’. Type is std for unordered_map, psl/hash bits for Robin Hood
benchmark name mean low mean high mean
---------------------------------------------------
1000 - 6/2 32.329 us 31.994 us 32.894 us
1000 - std 101.048 us 100.158 us 102.995 us
1000 - 5/3 31.51 us 31.247 us 31.912 us
2000 - 6/2 78.693 us 77.957 us 79.534 us
2000 - std 213.445 us 212.585 us 214.675 us
50000 - std 6.34328 ms 6.29092 ms 6.41162 ms
50000 - 6/2 2.33733 ms 2.33137 ms 2.34393 ms
Debian clang version 13.0.1-6~deb11u1 Intel(R) Core(TM) m3-8100Y CPU @ 1.10GHz, Linux penguin 5.4.163-17364-gfe3d4f499cf1 x86_64 GNU/Linux
Compiler Explorer Output
No-inlining https://godbolt.org/z/vhaobobzW
Compiler Explorer
Inlining working: https://godbolt.org/z/d6TocPKxY
Important operations: https://godbolt.org/z/7bjh7P5zv
Future Work
Key Rotation!
The Rust Robin Hood Quadratic Copy
Productionization!
Benchmarks are Lies
End Goal
A Success In Implementing Fast Hash Tables
Eduardo Madrid, Scott Bruce
CppCon 2022, 2022-9-16
Related work
Eduardo Madrid, SWAR Techniques , CppCon 2019
Eduardo Madrid, Rehashing Hash Tables and Associative Containers, C++ Now 2022
Malte Skarupke New Improvements to Hash Table Performance, C++Now 2018
Matt Kulukundis Fast, Efficient, Cache-friendly Hash Table, CppCon 2017
Nathan Bronson, Xiao Shi, F14, Facebook Engineering Blog
Finally, one of the most important papers of Computer Science, ever:
Claude Shannon, A Mathematical Theory of Information