1 of 32

Caffeine

Design of a Modern Cache

https://github.com/ben-manes/caffeine

2 of 32

Cache

Caffeine is a high performance, near optimal cache

Provides the familiar Guava Cache API

Lots of features (memoization, maximum size, expiration, …)

LoadingCache<Key, Graph> graphs = Caffeine.newBuilder()

.maximumSize(10_000)

.expireAfterWrite(5, TimeUnit.MINUTES)

.refreshAfterWrite(1, TimeUnit.MINUTES)

.build(key -> createExpensiveGraph(key));

Let’s focus on the data structures

3 of 32

Design Goals

Use O(1) algorithms for predictable performance

Optimize the system, not a single metric

Tradeoffs based on simulations

Must be correct & maintainable

APIs with low conceptual weight

In-memory caching only

4 of 32

Least Recently Used

Prioritizes eviction based on access order to prefer the most recent

Suffers from cache pollution (not scan resistant, “one hit wonders”)

O(1) time for add, remove, update operations

h

least�recent

most�recent

reorder

5 of 32

Window TinyLFU

Admit only if the estimated frequency is higher than the victim’s

LRU before the filter ➜ recency-biased

LRU after the filter ➜ frequency-biased

TinyLFU

Segmented LRU

LRU

Admission

Eviction

protected

probation

6 of 32

Optimize by Hill Climbing

7 of 32

Adaptive Window

Recency ➜ Frequency ➜ Recency

8 of 32

Efficiency (1)

Web page Popularity

Multi-Pass Analysis

9 of 32

Efficiency (2)

SQL Database

Document Search

10 of 32

Hash Flooding

If the attacker controls the input, he can cause worst-case performance

0

1

2

3

4

5

Bucket

Normal operation of a hash table

0

1

2

3

4

5

Bucket

Worst-case hash table collisions

11 of 32

Avalanche Effect

A small input change should change in the output to be indistinguishable from random.

One row = one input bit�The 32 squares in each row is the influence on each hash bit by that bit.

Simple Hash

Jenkins96 Hash

FNV-1 Hash

Bret Mulvey on Hash Functions

Orange is unacceptable

Red has no mixing at all

Green is acceptable

Avalanche effect for 256-octed keys

Simple Hash:

This hash has absolutely horrible avalanche behavior. Some output bits are not mixed at all. The fact that the LSB doesn’t get mixed at all indicates that this hash would not provide a uniform distribution if the LSBs of the keys were not already uniform.

int simpleHash(byte[] data):� int hash = 0� for each (byte b in data):� hash = (hash + b) * 0x50003� return hash

FNV-1:

It is a simple multiplicative hash with the addition of a post-processing step called xor-folding to remove some linearity in the lower bits. The upper bits are not uniformly distributed if you use more than 14 or 15 bits. Because of this, I don’t recommend using this hash with hash tables larger than 214 buckets. The results are acceptible up to 216 bits for text keys, so you may be able to use it for that purpose if you carefully test the performance in your particular application.

Jenkins96:

It uses shifts, adds, and xors in its mixing function. This test indicates that this hash has excellent avalanche behavior. It is unlikely that there are any specific classes of keys that will cause problems with this function, as key bits from every position appear to be dispersed well into all hash bit positions, even for very short hash keys.

12 of 32

Hash Prospecting

Good hash functions by simulated annealing!

Important criteria:

Avalanche effect: Flip one input bit ➜ each output is flipped with 50% chance

Low bias: No correlation between flipped output and input bit

// Thomas Mueller’s hash

x = ((x >>> 16) ^ x) * 0x45d9f3b

return (x >>> 16) ^ x

github.com/skeeto/hash-prospector

13 of 32

HashDoS in the wild

2MB of POST data consisting of 200,000 colliding strings ≈ 40B comparisons�Hash-flooding DoS reloaded: attacks and defenses

~6 kbits/s keeps one i7 core busy (1 Gbit/s ➜ 10k cores)�Efficient Denial of Service Attacks on Web Application Platforms

“A dial-up modem... can bring a dedicated server to its knees.”�Denial of Service via Algorithmic Complexity Attacks

14 of 32

Popularity Attack

CountMin is used to approximate heavy hitters

Attackers can exploit collisions so that the wrong entries are evicted

h

3

0

8

0

1

4

1

2

0

2

1

6

1

2

5

0

1

0

1

0

2

1

7

3

4

6

1

0

1

2

0

1

width

depth

key

15 of 32

HashDoS mitigations

Randomization

Jitter for unpredictable outcomes

Add noise and change it often, e.g. change random seed on hash table resizing

h(x) = hash(x + random seed)

Graceful degradation

Switch algorithm if needed

0

1

2

3

4

5

Bucket

16 of 32

Cache Stampede

Latency spikes may be caused by concurrent loads for the same entry

Racy get-compute-put ➜ redundant work

Outages at Facebook, Honeycomb, and Coinbase

17 of 32

Cache Stampede

Doordash: p99 latency reduced by 87%

Perform the action once, atomically

value = cache.getIfPresent(key)�if (value == null):� lock = locks[hash(key) % locks.length]� with lock:� value = cache.getIfPresent(key)� if (value == null):� value = compute(key)� cache.put(key, value)�return value

18 of 32

False Sharing

CPU caches are split into cache lines, each holding a 64-byte contiguous block

May cause hardware-level contention due to cache invalidation

0

2

4

1

3

5

7

6

Core 1

Update A2

A

0

2

4

1

3

5

7

6

Core 2

Update A6

A

Invalidate

19 of 32

False Sharing

Pad hot variables to update independently

Example: Statistics like the hit & miss counters

0

2

4

1

3

5

7

6

Core 1

Update A0

A

0

B

A

B

Concurrent Updates

2

4

1

3

5

7

6

Core 2

Update B0

0

2

4

1

3

5

7

6

0

2

4

1

3

5

7

6

2

3

0

4

1

5

7

6

20 of 32

Concurrent Hash Tables

Caching may degrade performance due to synchronization!

Use read/write locks and many sub-maps?

Zipf’s law causes a some locks to be used more often

Key

Map

read/write locks

Map

h

Frequency

Item

21 of 32

Java’s ConcurrentHashMap

Lock-free reads

Locks the bin's head node during a write

Concurrently resizes the table based on the load factor

ReservationNodes are used as placeholders while computing values

22 of 32

Concurrent Eviction

Linked List

Access: Move to the tail

Evict: Discard the head

Not thread safe!

Clock

Access: Mark the entry

Evict: Sweep, reset, and discard if unmarked

Sampled

Access: Increment the entry’s utility counter

Evict: Discard the entry with the lowest utility value in a uniform distribution

-

+

-

+

-

+

-

+

+ marked

- unmarked

5

7

9

1

8

6

4

3

1

23 of 32

Log-based Concurrency

Schedule the work, not the thread!

1. Apply the operation immediately to the hash table

2. Record the change into a read or write buffer

3. Replay under lock in asynchronous batches

Consumer

Producer

BP-Wrapper: Make Any Replacement Algorithm (Almost) Lock Contention Free

24 of 32

Read Buffer

Stripe ring buffers by the thread’s id

Add buffers when contention is detected

Drop additions when full or exhausted retries

Trigger a maintenance cycle when a buffer is full

Policy

Entry

try lock

h

25 of 32

26 of 32

Write Buffer

Bounded ring buffer that resizes up to a maximum (via JCTools)

Back pressure when full

Writers assist in performing the maintenance work
Rarely occurs as requires a write rate >> replay rate

Triggers a maintenance cycle immediately

Consumer

Producer

27 of 32

Log Replay

Schedules the async draining under a tryLock

Concurrently, other writes may require reprocessing

Out-of-order consistency by using per-entry states (alive, retired, or dead)

Idle

Req

P_Idle

P_Req

Replay States

Idle
Required
Processing to Idle
Processing to Required

28 of 32

Performance

29 of 32

Expiration

Lazy

Relies on size eviction

Causes cache pollution

May mitigate using a background sweeper

Fixed

Treats all entries uniformly

Simple to implement and has O(1) eviction

Limited applicability

Variable

Different lifetimes per entry

Typically uses a tree, O(lg n)

May be slow at scale

30 of 32

Fixed Expiration

A time-bounded Least Recently Used policy
Time-to-Idle reorders on every access
Time-to-Live reorders on every write

h

oldest

youngest

reorder

31 of 32

Variable Expiration

Timer Wheels provide approximate timeouts in O(1) time!

Adding to a bucket requires hashing to the coarse resolution

A clock hand “ticks” as buckets expire�and can be swept

Advancing a larger wheel causes the�timers to be inserted into the �smaller wheels

0

1

3

4

5

6

7

8

9

10

11

12

13

14

15

2

32 of 32

Last Words

Decouple and break down problems to optimize them individually
Combine simple data structures for efficient, powerful features
Utilize these performance mantras

Don't do it
Do it, but don't do it again
Do it cheaper
Do it less
Do it later
Do it when they're not looking
Do it concurrently