1 of 35

Learning Relaxed Belady

for Content Delivery Network Caching

NSDI 2020

2 of 35

CDN Caching Goal: Minimize WAN Traffic

2

Edge cache

Requests

Miss

Hit

User

Requests

Wide Area Network (WAN)

traffic is expensive

miss bytes

total bytes

Key metric byte miss ratio:

3 of 35

Caching Remains Challenging

Heuristic-based algorithms (1965–): LRU, LRUK, GDSF, ARC, ...

  • Work well for some workloads, but work poorly for other

ML-based adaptation of heuristics (2017–): UCB, LeCAR, ...

  • Also work well for some workloads, but poorly for others

Belady’s MIN algorithm (1966)

  • Oracle: requires future knowledge
  • Large gap in byte miss ratio between state-of-the-art and Belady:
  • 20–40% on production traces

3

4 of 35

Introducing Learning Relaxed Belady (LRB)

New approach: mimic Belady using machine learning

4

5 of 35

General Overview of our Approach

5

R

R

R

R

R

R

······

···

Now

Cache

R

Past information

ML architecture

Training data

Eviction candidates

6 of 35

Challenge 1: Past Information

6

R

R

R

R

R

R

······

···

Now

Cache

R

What past information to use?

Past information

ML architecture

Training data

Eviction candidates

More data improves training but increases mem overhead

7 of 35

Challenge 2: Generate Online Training Data

7

R

R

R

R

R

R

······

···

Now

Cache

R

What past information to use?

Generate online training data?

Past information

ML architecture

Training data

Eviction candidates

8 of 35

Challenge 3: ML Architecture

8

R

R

R

R

R

R

······

···

Now

Cache

R

What past information to use?

Generate online training data?

What ML architecture to select?

Past information

ML architecture

Training data

Eviction candidates

Large design space: �features, model, prediction target, loss function

9 of 35

Challenge 4: Eviction Candidates

9

R

R

R

R

R

R

······

···

Now

Cache

R

Past information

ML architecture

Training data

Eviction candidates

How to select evict candidates?

What past information to use?

Generate online training data?

What ML architecture to select?

10 of 35

Challenge 5: Quickly Evaluate Design Decisions

10

R

R

R

R

R

R

······

···

Now

Cache

R

End-to-end evaluation: days

How to select evict candidates?

What past information to use?

Generate online training data?

What ML architecture to select?

Past information

ML architecture

Training data

Eviction candidates

11 of 35

Solutions: Relaxed Belady Algorithm & Good Decision Ratio

11

End-to-end evaluation: days

How to select evict candidates?

What past information to use?

Generate online training data?

What ML architecture to select?

Relaxed Belady algorithm

Good decision ratio

12 of 35

Solutions: Relaxed Belady Algorithm & Good Decision Ratio

12

Relaxed Belady algorithm

Good decision ratio: mins

End-to-end evaluation: days

How to select evict candidates?

What past information to use?

Generate online training data?

What ML architecture to select?

13 of 35

Solutions: Relaxed Belady Algorithm & Good Decision Ratio

13

End-to-end evaluation: days

How to select evict candidates?

What past information to use?

Generate online training data?

What ML architecture to select?

Relaxed Belady algorithm

Good decision ratio: mins

14 of 35

Challenge: Hard to Mimic Belady (Oracle) Algorithm

Mimicking exact Belady is impractical

  • Need predictions for all objects → prohibitive computational cost
  • Need exact prediction of next access → further prediction are harder

Belady: evict object with next access farthest in the future

14

Cache

(now)

A

······

B

C

D

Time to next request

D

B

A

C

······

······

Evict

15 of 35

Introducing the Relaxed Belady Algorithm

Observation: many objects are good candidates for eviction

Relaxed Belady evicts an objects beyond boundary

  • Do not need predictions for all objects → reasonable computation
  • No need to differentiate beyond boundary → simplifies the prediction

15

Cache

(now)

A

······

B

C

D

Time to next request

D

B

A

C

······

······

Evict

Belady boundary

16 of 35

Good Decision Ratio: Directly Measures Eviction Decisions

Insight: relaxed Belady enables evaluating eviction decisions

Good decision ratio:

16

Bad eviction decision

evicted obj’s next access < boundary

# good eviction decisions

# total eviction decisions

Good eviction decision

evicted obj’s next access > boundary

Belady boundary

Time to next request

17 of 35

Challenge 5: Quickly Evaluate Design Decisions

17

R

R

R

R

R

R

······

···

Now

Cache

R

End-to-end evaluation: days

How to select evict candidates?

What past information to use?

Generate online training data?

What ML architecture to select?

Past information

ML architecture

Training data

Eviction candidates

18 of 35

Evaluate Design Decisions w/o Simulation

Evaluate designs on log using good decision ratio in minutes

18

ML architecture

Training data

Eviction candidates

Log

Simulate once,

reuse for many designs

19 of 35

Challenge 1: Past Information

19

R

R

R

R

R

R

······

···

ML architecture

Training data

Eviction candidates

More data improves training but increases mem overhead

Past information

What past information to use?

Now

Cache

R

20 of 35

Track Objects within a Sliding Memory Window

Per object features

Only track objects within memory window

Window size is LRB’s main hyperparameter

20

R

R

R

R

R

R

······

···

Now

R

Sliding memory window mimics Belady boundary

21 of 35

Challenge 2: Training Data

21

R

R

R

R

R

R

······

···

Now

Cache

R

What past information to use?

Generate online training data?

Past information

ML architecture

Training data

Eviction candidates

22 of 35

Sample Training Data & Label on Access or Boundary

Per object features

Unlabeled training data

Labeled training data

22

R

R

R

R

R

R

······

···

Now

R

Sliding memory window

Sample

Past memory window

Access

23 of 35

Challenge 3: ML Architecture

23

Large potential design space

R

R

R

R

R

R

······

···

Now

Cache

R

What past information to use?

Generate online training data?

What ML architecture to select?

Past information

ML architecture

Training data

Eviction candidates

24 of 35

Solution 3: Feature & Model Selection

Gradient boosting decision trees

Lightweight & high good decision ratio

Training ~300 ms, prediction ~30 us

Use good decision ratio to evaluate new designs

24

Features

Object size

Object type

Inter-request distances

(recency)

Exponential decay counters (long-term frequencies)

25 of 35

Challenge 4: Eviction Candidates

25

R

R

R

R

R

R

······

···

Now

Cache

R

Past information

ML architecture

Training data

Eviction candidates

How to select evict candidates?

What past information to use?

Generate online training data?

What ML architecture to select?

26 of 35

Solution 4: Random Sampling for Eviction

Can mimic relaxed Belady if we can�find 1 object beyond the boundary

k=64 candidates; more does not improve �good decision ratio

26

R

R

R

R

R

R

······

···

Now

Cache

R

Past information

Random k candidates

27 of 35

Learning Relaxed Belady

27

Label

Labeled dataset

Sample

Unlabeled dataset

Now

Cache

R

R

R

R

R

R

R

···

···

Memory window

R

R

Train

Model

Eviction Candidates

···

Sample

Predict

Evict

28 of 35

Implementation

  • Simulator implementation
    • LRB + 14 other algorithms
  • Prototype implementation
    • C++ on top of production system (Apache Traffic Server)
    • Many optimizations

28

29 of 35

Evaluation Setup

  • Q1: Learning Relaxed Belady (LRB) traffic reduction vs state-of-the-art
  • Q2: overhead of LRB vs CDN production system
  • Traces: 6 production traces from 3 CDNs
  • Hyperparameter (memory window/model/...) tuned on 20% of trace

29

30 of 35

LRB Reduces WAN Traffic

30

20% traffic reduction over B-LRU

10% reduction over the best SOA

Wikipedia trace

Industry standard

31 of 35

LRB Consistently Improves on the State of the Art

31

CDN-B1

CDN-B3

CDN-B2

Wikipedia

CDN-A1

CDN-A2

32 of 35

LRB Overhead Is Modest

Throughput: 11.7 Gbps vs 11.7 Gbps (unmodified)

Memory overhead=1‒3% cache size

Peak CPU: 16% vs 9% (unmodified)

32

Edge cache

Requests

User

33 of 35

Conclusion

  • LRB reduces WAN traffic with modest overhead
  • Key insight: relaxed Belady

→ Simplifies machine learning & reduces system overhead

Good decision ratio enables fast design evaluation & design iteration

Code & Wikipedia trace: https://github.com/sunnyszy/lrb

33

34 of 35

34

35 of 35

Learning Relaxed Belady

35

Label

Labeled dataset

Sample

Unlabeled dataset

Now

Cache

R

R

R

R

R

R

R

···

···

Memory window

R

R

Train

Model

Eviction Candidates

···

Sample

Predict

Evict