Learning Relaxed Belady
for Content Delivery Network Caching
NSDI 2020
CDN Caching Goal: Minimize WAN Traffic
2
Edge cache
Requests
Miss
Hit
User
Requests
Wide Area Network (WAN)
traffic is expensive
miss bytes
total bytes
Key metric byte miss ratio:
Caching Remains Challenging
Heuristic-based algorithms (1965–): LRU, LRUK, GDSF, ARC, ...
ML-based adaptation of heuristics (2017–): UCB, LeCAR, ...
Belady’s MIN algorithm (1966)
3
Introducing Learning Relaxed Belady (LRB)
New approach: mimic Belady using machine learning
4
General Overview of our Approach
5
R
R
R
R
R
R
······
···
Now
Cache
R
Past information
ML architecture
Training data
Eviction candidates
Challenge 1: Past Information
6
R
R
R
R
R
R
······
···
Now
Cache
R
What past information to use?
Past information
ML architecture
Training data
Eviction candidates
More data improves training but increases mem overhead
Challenge 2: Generate Online Training Data
7
R
R
R
R
R
R
······
···
Now
Cache
R
What past information to use?
Generate online training data?
Past information
ML architecture
Training data
Eviction candidates
Challenge 3: ML Architecture
8
R
R
R
R
R
R
······
···
Now
Cache
R
What past information to use?
Generate online training data?
What ML architecture to select?
Past information
ML architecture
Training data
Eviction candidates
Large design space: �features, model, prediction target, loss function
Challenge 4: Eviction Candidates
9
R
R
R
R
R
R
······
···
Now
Cache
R
Past information
ML architecture
Training data
Eviction candidates
How to select evict candidates?
What past information to use?
Generate online training data?
What ML architecture to select?
Challenge 5: Quickly Evaluate Design Decisions
10
R
R
R
R
R
R
······
···
Now
Cache
R
End-to-end evaluation: days
How to select evict candidates?
What past information to use?
Generate online training data?
What ML architecture to select?
Past information
ML architecture
Training data
Eviction candidates
Solutions: Relaxed Belady Algorithm & Good Decision Ratio
11
End-to-end evaluation: days
How to select evict candidates?
What past information to use?
Generate online training data?
What ML architecture to select?
Relaxed Belady algorithm
Good decision ratio
Solutions: Relaxed Belady Algorithm & Good Decision Ratio
12
Relaxed Belady algorithm
Good decision ratio: mins
End-to-end evaluation: days
How to select evict candidates?
What past information to use?
Generate online training data?
What ML architecture to select?
Solutions: Relaxed Belady Algorithm & Good Decision Ratio
13
End-to-end evaluation: days
How to select evict candidates?
What past information to use?
Generate online training data?
What ML architecture to select?
Relaxed Belady algorithm
Good decision ratio: mins
Challenge: Hard to Mimic Belady (Oracle) Algorithm
Mimicking exact Belady is impractical
Belady: evict object with next access farthest in the future
14
Cache
(now)
A
······
B
C
D
Time to next request
D
B
A
C
······
······
Evict
Introducing the Relaxed Belady Algorithm
Observation: many objects are good candidates for eviction
Relaxed Belady evicts an objects beyond boundary
15
Cache
(now)
A
······
B
C
D
Time to next request
D
B
A
C
······
······
Evict
Belady boundary
Good Decision Ratio: Directly Measures Eviction Decisions
Insight: relaxed Belady enables evaluating eviction decisions
Good decision ratio:
16
Bad eviction decision
evicted obj’s next access < boundary
# good eviction decisions
# total eviction decisions
Good eviction decision
evicted obj’s next access > boundary
Belady boundary
Time to next request
Challenge 5: Quickly Evaluate Design Decisions
17
R
R
R
R
R
R
······
···
Now
Cache
R
End-to-end evaluation: days
How to select evict candidates?
What past information to use?
Generate online training data?
What ML architecture to select?
Past information
ML architecture
Training data
Eviction candidates
Evaluate Design Decisions w/o Simulation
Evaluate designs on log using good decision ratio in minutes
18
ML architecture
Training data
Eviction candidates
Log
Simulate once,
reuse for many designs
Challenge 1: Past Information
19
R
R
R
R
R
R
······
···
ML architecture
Training data
Eviction candidates
More data improves training but increases mem overhead
Past information
What past information to use?
Now
Cache
R
Track Objects within a Sliding Memory Window
Per object features
Only track objects within memory window
Window size is LRB’s main hyperparameter
20
R
R
R
R
R
R
······
···
Now
R
Sliding memory window mimics Belady boundary
Challenge 2: Training Data
21
R
R
R
R
R
R
······
···
Now
Cache
R
What past information to use?
Generate online training data?
Past information
ML architecture
Training data
Eviction candidates
Sample Training Data & Label on Access or Boundary
Per object features
Unlabeled training data
Labeled training data
22
R
R
R
R
R
R
······
···
Now
R
Sliding memory window
Sample
Past memory window
Access
Challenge 3: ML Architecture
23
Large potential design space
R
R
R
R
R
R
······
···
Now
Cache
R
What past information to use?
Generate online training data?
What ML architecture to select?
Past information
ML architecture
Training data
Eviction candidates
Solution 3: Feature & Model Selection
Gradient boosting decision trees
Lightweight & high good decision ratio
Training ~300 ms, prediction ~30 us
Use good decision ratio to evaluate new designs
24
Features |
Object size |
Object type |
Inter-request distances (recency) |
Exponential decay counters (long-term frequencies) |
Challenge 4: Eviction Candidates
25
R
R
R
R
R
R
······
···
Now
Cache
R
Past information
ML architecture
Training data
Eviction candidates
How to select evict candidates?
What past information to use?
Generate online training data?
What ML architecture to select?
Solution 4: Random Sampling for Eviction
Can mimic relaxed Belady if we can�find 1 object beyond the boundary
k=64 candidates; more does not improve �good decision ratio
26
R
R
R
R
R
R
······
···
Now
Cache
R
Past information
Random k candidates
Learning Relaxed Belady
27
Label
Labeled dataset
Sample
Unlabeled dataset
Now
Cache
R
R
R
R
R
R
R
···
···
Memory window
R
R
Train
Model
Eviction Candidates
···
Sample
Predict
Evict
Implementation
28
Evaluation Setup
29
LRB Reduces WAN Traffic
30
20% traffic reduction over B-LRU
10% reduction over the best SOA
Wikipedia trace
Industry standard
LRB Consistently Improves on the State of the Art
31
CDN-B1
CDN-B3
CDN-B2
Wikipedia
CDN-A1
CDN-A2
LRB Overhead Is Modest
Throughput: 11.7 Gbps vs 11.7 Gbps (unmodified)
Memory overhead=1‒3% cache size
Peak CPU: 16% vs 9% (unmodified)
32
Edge cache
Requests
User
Conclusion
→ Simplifies machine learning & reduces system overhead
→ Good decision ratio enables fast design evaluation & design iteration
Code & Wikipedia trace: https://github.com/sunnyszy/lrb
33
34
Learning Relaxed Belady
35
Label
Labeled dataset
Sample
Unlabeled dataset
Now
Cache
R
R
R
R
R
R
R
···
···
Memory window
R
R
Train
Model
Eviction Candidates
···
Sample
Predict
Evict