1 of 67

A Success In Implementing A Fast Hash Table

Eduardo Madrid, Scott Bruce

CppCon 2022, 2022-9-16

2 of 67

Authors

Eduardo Madrid

Consultant

Ex-Snap

Ex FinTech

Repeat CppCon Presenter

Scott Bruce

Rockset

Ex-Snap

Ex-Google

First time CppCon Presenter

3 of 67

Design and Implementation of a fast hash table

4 of 67

What is required for fast hash tables?

5 of 67

  1. Locality friendly algorithm

6 of 67

2. Parallel computation

7 of 67

3. Efficient encoding of metadata

8 of 67

Hashtable Briefing

9 of 67

4

1

7

1

4

7

11

homeIndex

available

Chaining

Probing

Hang a linked list off every slot of array.

Extremely bad locality!

Find first open slot after homeIndex.

10 of 67

Hashtables: Key Checks and Entropy

  • Full key equality tests are expensive.
  • Construct some proxy (ala Bloom filter) we can cheaply check to rule out matches.
  • Entropy is unexpectedness. Increasing entropy in whatever bit structure we use reduces false positive rate.
  • Reducing false positive rate is vital for performant parallel key checks.

11 of 67

Robin Hood Fundamentals

12 of 67

Robin Hood Hash Tables

  • Linear probed hash table.
  • ‘Home index’ is the slot an element maps directly to.
  • ‘Actual index’ is the slot an element ends up in.
  • Keep track of ‘probe sequence length’ (PSL), distance from ‘home index’.
  • During a find, insert or delete, track PSL.
  • When inserting, if the element to insert is poorer (farther from home) than the element in the table (closer to home), evict the richer (lower PSL) and take its slot.
  • Continue the insertion, now with the element evicted

13 of 67

0

0

0

25, 1

0

0

0

0

0

0

homeIndex

actualIndex

0

0

0

76,1

33,2

42,3

25,4

22,2

14,3

0

homeIndex

0

0

0

76, 1

33,2

42,3

14,1

22,2

0

0

Probe sequence length

Before insert

After insert

Empty table

Robin Hood Table: Insert

Elements are value, probe sequence length (PSL). Empty cells are PSL 0.

14 of 67

Find hit!

0

0

0

76,1

33,2

42,3

25,4

22,2

14,3

0

homeIndex

0

0

0

76, 1

33,2

42,3

14,1

22,2

0

0

Probe sequence length

Find, missing

Find, present

Robin Hood Table: Find

Elements are value, probe sequence length (PSL). Empty cells are PSL 0.

Find missed!

homeIndex

25

15 of 67

0

0

0

76, 1

33,2

42,3

22,1

14,2

0

0

Probe sequence length

Delete, after

Robin Hood Table: Delete

Elements are value, probe sequence length (PSL). Empty cells are PSL 0.

Remove, shift back!

homeIndex

Delete hit!

0

0

0

76,1

33,2

42,3

25,4

22,2

14,3

0

homeIndex

Delete

25

16 of 67

Robin Hood Hash Table Properties

  • With a non-adversarial hash function the maximum length of a find operation is log(table size)
    • Very deterministic finding behavior, find is a precursor to all hash table operations.
  • Locality friendly: max probe lengths grow slowly.
  • Probe Sequence Length can grow 1 per slot, or fall arbitrarily.

17 of 67

Some other hashtable designs

18 of 67

Skarupke’s Robin Hood Hash

  • C++Now 2018:
    • Thorough presentation on the performance of hash table implementations
    • Covered Robin Hood, “optimization 4/7”, https://youtu.be/M2fKMP47slQ?t=1565 but later claims other strategies may be better.
      • The implementation of Robin Hood is straightforward and not optimal.
        • Separate comparison of probe sequence length and key, no simplification of key comparison.
    • For Robin Hood, he invented(?) the technique we call ‘Skarupke Tail’
    • He never put together in code all of the performance techniques from the talk:
      • We put all we thought was advantageous together!
    • Bytell uses a concept of a block, with a linked list of elements in a probe sequence, the links in the linked list are “jump forward” counts, to support jumps longer than 7 bits (128) there is a level of indirection:
      • Complex code that seems a dead end

19 of 67

Martinus’ Robin Hood Hash

  • Most popular Robin Hood implementation.
  • Good stock performance
  • Several techniques to improve performance:
    • Metadata encoding with hoisted hash codes.
  • However made some design decisions we don’t agree with.
  • We suspect a low probability bug, we can not ascertain either way because the code crosses levels of abstraction all the time and there is some entanglement (coupling) between elements.
  • Martinus has written a blog with potential improvements, including the concept of “fast forwarding” and metadata encoding for it that we are not convinced are productive.
    • https://martin.ankerl.com/2016/09/21/very-fast-hashmap-in-c-part-2/

20 of 67

Google/Abseil’s SwissTable / related hash tables

  • Matt Kulukundis presented at CppCon on this subject in 2017 and 2019:
    • Points out open addressing schemes have bad performance.
      • Rapidly approaches worst case behavior
      • Worst case behavior is a full table scan.
    • Characteristics of Swiss Table:
      • Open addressed
      • Offset based linked node structure
      • Hoisted hash bits
      • Explicit metadata blocks with small hoisted hash bit counts
        • Hoisted hash collides with any other key with that hash
      • SIMD used for broad metadata block check
        • Fallback to hoisted hash bit equivalence check in 64 bits if no SIMD
    • Mentioned trying a Robin Hood variant, seeing roughly equal perf.

21 of 67

F14 SIMD hash map

  • Facebook’s Folly library, Nathan Bronson and Xiao Shi
    • Double hashing table that avoids overhead by chunking.
    • Specifically designed to take massive advantage of speedups available to SIMD designs.
    • Hoists 7 bits of hash function to 128 bit line to execute the SIMD check.
    • 16 bytes overhead per Chunk, 1.14-1.33 bytes per entry.
    • F14 seems to make an intense effort to be fast and productionized, but it seems the design and code is overspecified because of it.

22 of 67

Design and Goals

23 of 67

Goals

Design Strategies

  • Very high performance.
  • Very low overhead/high load factor.
  • Configurable metadata characteristics.
  • Type erased dynamic tables.
  • Use SWAR for bit parallelization!
  • Generic programming!
  • Zero cost abstraction layers!
  • Avoid overspecified code!
  • Use improved instruction cost models!

24 of 67

SWAR (SIMD within a register)

25 of 67

SWAR (SIMD within a register)

  • SWAR is an abstraction around bit parallelization: the idea of doing multiple computations within a single register.
    • SIMD hardware will create hardware ‘lanes’ that do not interfere with each other.
    • SWAR creates software lanes in 64 bits, but the lanes can interfere with each other.
  • SWAR allows doing parallel operations with generic registers.

1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

2

4

6

8

10

12

14

16

+

=

26 of 67

Why SWAR? Safety.

  • SWAR is a safe, low cognitive load abstraction wrapped around complicated, non obvious or subtle techniques. This includes all variety of bit twiddling.
  • Abstracting away these techniques allows for generic implementation. Different arch targets will get different code.

27 of 67

Why SWAR? Speed!

  • For some ops, SWAR allows us to execute against all elements at once.
  • For some ops, SWAR allows us to execute ops cheaper than if implemented otherwise.
  • For some ops, SWAR has about the same cost as other techniques.

When doing SETS of useful operations, SWAR has at least not-worse performance.

28 of 67

Why SWAR? SIMD!

  • The library may need to be renamed…
  • Techniques used still apply to SIMD hardware, and we’ve prototyped code that will issue SIMD instructions via the SWAR library.
    • In at least one case, adversarily!

29 of 67

Generic Programming

30 of 67

Keep Abstractions Healthy

  • C++ provides the ability to create zero cost abstractions.
  • Our code strives to keep abstraction layers not leaky.
    • Provides powerful performance improvements by being optimizer friendly.
    • Keeps code comprehensible, easier to audit for errors.
  • We try very hard to avoid the common pattern in other high performance code in which the code base is very monolithic.
    • F14 has its tag count embedded in the name, and near zero parameterization.

31 of 67

Generic Programming

  • Avoid overspecification of code.
    • Almost always auto, type inference, template metaprogramming, etc.
    • Allows parameterization of implementation details that might otherwise be fixed.
  • Lax constraints give the theorem prover that is the optimizing compiler more opportunity.
    • Results in better and faster output binaries.

32 of 67

Better Instruction Cost Models!

33 of 67

Traditional Cost Models

Count Compares!

  • Treats compares as a reliable proxy for code cost.
  • Ok in theory and for getting moderate performance.
  • Falls absolutely flat when chasing high performance.

Count Cachelines!

  • Treats cacheline fetches as dominating factor.
  • Treats all other operations as free.
  • Ok for getting good performance.
  • Falls flat when chasing high performance
    • Memory hierarchy is too complex.
    • Dominating factors vary by scale.
    • Easy to do more work than cacheline latency.
    • Prefetching
    • Etc

34 of 67

Tiered Instruction Cost Model

  • Bucket instructions and hardware actions by rough cost into levels.
  • Good enough to let the optimizing compiler work, simple enough to be usable.

Ops

Cost

Bitwise Ops, Add, Subtract

0-1

Multiply, Branch Hit, Predicted Indirect

1-3

Division, Modulo

~40

Branch Miss, Indirect Jump

10-60

L2 Cache Miss, Bad Indirect Jump

50-150

35 of 67

Design for High Load Factor

  • Memory efficiency in hash tables incredibly important.
    • ~4% of Google’s RAM is in hash tables.
  • Load factor dominates overhead.
    • 50% load factor guarantees a x2 overhead absolute minimum.
    • 99% load factor gets very low overhead.

36 of 67

Table Parameterization: Metadata

  • Selectable metadata configuration (bits devoted to PSL, bits devoted to hoisted hash bits)
    • See the benchmarks later
  • Specific design goal to allow these changes during runtime.

37 of 67

Table Parameterization: Data Layout

  • Metadata / data layout as a mixin.
  • Configurable internal layouts:
    • Metadata block then key block
    • Interleaved metadata with keys
  • The current design supports these, research going

38 of 67

Type Erasure for Runtime Optimization

  • Uniform type erased interface.
  • Ability to change parameterization of table at runtime based on use of table.
  • Allowed by design, not implemented.

39 of 67

Implementation

40 of 67

Metadata Bit Packing

  • Checking PSL then hash then key too expensive.
  • Pack PSL and ‘hoisted’ hash bits into metadata blocks using SWAR.
  • Specify how many bits of each to keep!
  • SWARWithSublanes abstraction: two sublanes per lane.

psl|hash

psl|hash

psl|hash

psl|hash

psl|hash

psl|hash

psl|hash

psl|hash

00|00

01|eb

02|f3

01|cc

00|00

00|00

00|00

00|00

41 of 67

Metadata Bit Packing

  • PSL is very strong for checking equality within a single chain.
  • Use hoisted hashes to strength checking across chains.
  • Using bits for PSL or hoisted hash has exponential benefit.
  • Keep PSL in least significant bit sublane.
    • MSB guaranteed off: fast compare, boolean repr.

42 of 67

The Sharpest Needle, corrected

  • Check lanewidth entries at once using SWAR!
  • Example is 4 bits PSL, 4 bits hoisted hash:
    • X means no match, “!”, a match, < means breaks invariant

6|c

2|d

3|e

4|d

5|c

1|a

0|0

0|0

1|c

2|c

3|c

4|c

5|c

6|c

7|c

8|c

X|!

!|X

!|X

!|X

!|!

X|<

X|<

X|<

Needle

Haystack

Result

Match!

Deadline!

43 of 67

The Sharpest Needle, as presented

  • Check lanewidth entries at once using SWAR!
  • Example is 4 bits PSL, 4 bits hoisted hash, X means no match

6|c

2|e

3|f

4|d

5|c

1|a

0|0

0|0

1|c

2|c

3|c

4|c

5|c

6|c

7|c

8|c

X|F

F|X

F|X

F|X

F|F

X|X

X|X

X|X

Needle

Haystack

Result

Match!

Deadline!

44 of 67

Hash vs Scatter and Hoist

  • Hash func used for ‘find index’ and ‘cheap equality check’
  • Break into ‘Scatter’ and ‘Hoist’
    • Scatter provides home index
    • Hoist provides hash bits
  • Allows use of different functions for increased entropy.
  • We use Fibonacci hash, Lemire reductions, multiply-pick-high-bits
  • Pick the worst and cheapest bit mixer that gets good results.

45 of 67

Deterministic Code

  • Reduces branch predictor workload.
  • Deterministic code has better execution patterns.
  • Cost model de facto bans divisions.
  • Sometimes computing both sides of a branch and squashing together is faster.

auto toAbsolute = [](auto v, auto ma) {

auto shiftedLeft = v.shiftLanesLeft(ma);

auto shiftedRight =

v.shiftLanesRight(Metadata::NSlots - ma - 1).shiftLanesRight(1);

return Metadata{shiftedLeft | shiftedRight};

};

46 of 67

Align SWARs to find.

  • The first lookup point is likely in middle of a SWAR register: half of a SWAR un-used
  • Combine neighboring SWARs together
  • Prioritizes the first comparison at a cost of a few shifts.

47 of 67

Insertion against Unaligned SWAR.

  • Insert starts after a successful find.
  • Evictions cause a ‘causality horizon’, no advantage to repeatedly constructing an alignment on insert.

48 of 67

Insertion and Eviction Chains

  • Inserts that evict cause another insert.
  • Cheaper, prefetch friendly to journal key moves, then apply them at end.
  • Metadata moves can be journaled and applied at the end or optimistically applied.
    • Running out of PSL throws, so performance choice and an exception safety choice.
  • Eviction chains hit table size for extreme load factors.

49 of 67

Division is wildly expensive

  • It costs as much as a missed branch! It’s silly!
  • Lane widths (PSL bits + hash bits) that are not a power of two are possible, but result in a modulo-compile-time-constant (multiplication and shifts) instead of simple shifts.
  • Open benchmarking and implementation challenge to characterize how much of a slowdown it is:
    • Trashes the performance of finding the lane-index of the LSB
    • Puts even more pressure on multiplication (we use it a lot for entropy and broadcasting)

50 of 67

Deletion by Shifting Back

  • Find element to delete.
  • Find end of block to shift back: first 0 or 1 PSL.
  • Shift entire block back, decrement PSL.
  • Choose to not use tombstones. They are very bad for load factor and performance.

51 of 67

Resizing!

  • Table layout for a given key insertion is deterministic.
  • Can change PSL bits to hash bits and vice versa. No rehash.
    • Grow by ‘stealing’ hash bits (just zero hash LSB, done)
  • Growth by memcpy to create a larger PSL. No rehash.
  • Increasing table size does not require rehoist of hash bits.
  • Increasing hoisted hash bit size only requires rehoist of hash bits.
  • Stealing hash bits for PSL bits just works!

52 of 67

Results

53 of 67

Clang++-13 300k elements, find hit

Unordered Map load factor 0.854552 | Robin Hood load factor 0.99944| Element Count 300000 RH Cap 300168

Lookup, find, increment value.

benchmark name mean low mean high mean

-------------------------------------------------------------------------------

baseline 46.054 us 44.289 us 47.836 us

Unordered Map 91.8874 ms 91.3968 ms 92.3027 ms

Robin Hood 6.47753 ms 6.29374 ms 6.79329 ms

std::map 299.269 ms 296.54 ms 303.723 ms

Debian clang version 13.0.1-6~deb11u1 Intel(R) Core(TM) m3-8100Y CPU @ 1.10GHz, Linux penguin 5.4.163-17364-gfe3d4f499cf1 x86_64 GNU/Linux

54 of 67

Clang++-13 300k elements, find miss

Unordered Map load factor 0.854552 | Robin Hood load factor 0.99944 | Element Count 300000 RH Cap 300168

Lookup value not present in table.

benchmark name mean low mean high mean

---------------------------------------------------

baseline 31.513 us 30.24 us 36.331 us

Unordered Map 58.3777 ms 57.3793 ms 59.3649 ms

Robin Hood 7.57569 ms 7.29033 ms 7.90574 ms

std::map 13.5118 ms 13.0835 ms 14.1473 ms

Debian clang version 13.0.1-6~deb11u1 Intel(R) Core(TM) m3-8100Y CPU @ 1.10GHz, Linux penguin 5.4.163-17364-gfe3d4f499cf1 x86_64 GNU/Linux

55 of 67

Clang++-13 string keys

Benchmark is a frequency count of input corpus.

Number of different words: 4914

benchmark name mean low mean high mean

------------------------------------------------------

RH 1.6437 ms 1.61575 ms 1.68692 ms

std::unordered_map 2.67309 ms 2.63686 ms 2.72369 ms

std::map 8.91768 ms 8.81113 ms 9.09036 ms

Debian clang version 13.0.1-6~deb11u1 Intel(R) Core(TM) m3-8100Y CPU @ 1.10GHz, Linux penguin 5.4.163-17364-gfe3d4f499cf1 x86_64 GNU/Linux

56 of 67

Clang++-13 Insertion

Benchmark name is ‘<count inserted> - <type>’. Type is std for unordered_map, psl/hash bits for Robin Hood

benchmark name mean low mean high mean

---------------------------------------------------

1000 - 6/2 32.329 us 31.994 us 32.894 us

1000 - std 101.048 us 100.158 us 102.995 us

1000 - 5/3 31.51 us 31.247 us 31.912 us

2000 - 6/2 78.693 us 77.957 us 79.534 us

2000 - std 213.445 us 212.585 us 214.675 us

50000 - std 6.34328 ms 6.29092 ms 6.41162 ms

50000 - 6/2 2.33733 ms 2.33137 ms 2.34393 ms

Debian clang version 13.0.1-6~deb11u1 Intel(R) Core(TM) m3-8100Y CPU @ 1.10GHz, Linux penguin 5.4.163-17364-gfe3d4f499cf1 x86_64 GNU/Linux

57 of 67

Compiler Explorer Output

58 of 67

59 of 67

Compiler Explorer

60 of 67

Future Work

61 of 67

Key Rotation!

  • A given ‘probe chain’ contains all the elements that ‘belong’ to some home index.
  • Any of those elements can be swapped to any position occupied by another of those elements safely.
  • Allows a ‘rotation’ at lookup time to improve table performance for hot keys.

62 of 67

The Rust Robin Hood Quadratic Copy

  • We need to test our performance failure mode around table copies to identical tables.
  • We think we are safe since we use different entropy for finding slot and keeping hash bits.
  • Requires benchmarking and checking to ensure we don’t hit the same failure mode.

63 of 67

Productionization!

  • We are working to complete this table to be solid enough for real use.
  • Sane default table, guidance on parameterization for various workloads.
  • Learn from other productionization efforts:
    • Induce random layout in debug mode.
    • Chasing hidden access patterns that spike resource consumption.

64 of 67

Benchmarks are Lies

  • Need to do more benchmarking.
    • Use of standard hash benchmarking suites.
    • Benchmarking in heavily loaded systems.
    • Benchmarking in mixed-use compute workloads.
  • Benchmarking has a nice ability to flush out subtle bugs.

65 of 67

End Goal

  • Type erased, dynamically configured, auto balancing hash table that is suitable for a wide range of workloads.
  • Not an easy goal, but achievable.

66 of 67

A Success In Implementing Fast Hash Tables

Eduardo Madrid, Scott Bruce

CppCon 2022, 2022-9-16

67 of 67

Related work

Eduardo Madrid, SWAR Techniques , CppCon 2019

Eduardo Madrid, Rehashing Hash Tables and Associative Containers, C++ Now 2022

Malte Skarupke New Improvements to Hash Table Performance, C++Now 2018

Matt Kulukundis Fast, Efficient, Cache-friendly Hash Table, CppCon 2017

Nathan Bronson, Xiao Shi, F14, Facebook Engineering Blog

Finally, one of the most important papers of Computer Science, ever:

Claude Shannon, A Mathematical Theory of Information