1 of 43

Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran,

David Novo, Ataberk Olgun, Mohammad Sadrosadati, Onur Mutlu

Accelerating Long-Latency Load Requests

via Perceptron-Based Off-Chip Load Prediction

2 of 43

The Key Problem

Long-latency off-chip load requests

Often block instruction retirement from

Reorder Buffer (ROB)

Limits performance

2

3 of 43

Traditional Solutions

Employ sophisticated prefetchers

1

Increase size of on-chip caches

2

3

4 of 43

Key Observations

Nearly 50% of the off-chip requests

in a no-prefetching system

still go to the main memory

even in presence of state-of-the-art prefetcher

Observation 1

70% of these off-chip loads block the ROB

4

5 of 43

Key Observations

40% of the stalls caused by an off-chip load

can be reduced by removing on-chip cache access latency from its critical path

On-chip cache access latency significantly contributes to total off-chip load latency

Observation 2

L1

L2

LLC

Main Memory

L1

L2

LLC

Saved cycles

Main Memory

5

6 of 43

And Things are Getting Worse…

https://wccftech.com/surprisingly-high-latency-discovered-during-alder-lake-test-with-ddr5-6400-memory

https://www.anandtech.com/show/17047/the-intel-12th-gen-core-i912900k-review-hybrid-performance-brings-hybrid-complexity/6

https://www.reddit.com/r/Amd/comments/q8k3u7/the_l3_cache_issue_is_more_than_latency_related

6

7 of 43

Improve processor performance

by removing on-chip cache access latency

from the critical path of off-chip loads

Our Goal

8 of 43

Predicts which load requests might go off-chip

Starts fetching data directly from main memory while concurrently accessing the cache hierarchy

9 of 43

Hermes Overview

Core

L1-D

L2

LLC

MC

Off-Chip

Main Memory

L1

L2

LLC

Main Memory

Baseline

Processor is stalled

Latency tolerance limit of ROB

9

10 of 43

Hermes Overview

Core

L1-D

L2

LLC

MC

Off-Chip

Main Memory

L1

L2

LLC

Main Memory

POPET

L1

L2

LLC

Main Memory

Baseline

Hermes

Saved stall cycles

Processor is stalled

Latency tolerance limit of ROB

Predict @ LQ alloc

Issue Hermes request

Wait

Train@ LQ release

Off-chip predictor

10

Now Hermes employs an off-chip predictor named POPET.

[CLICK] For every load generated in the core, Hermes consults POPET to predict whether this load would go off-chip or not

[CLICK] If the load is predicted to go off-chip, Hermes waits for the address translation to finish

[CLICK] And then issues a speculative memory request, which we call a Hermes request,

directly to the main memory controller to start fetching the data from the main memory,

While also concurrently accessing the cache hierarchy for such a load.

[CLICK] If the off-chip prediction is correct, the load would eventually miss the LLC and arrive at the memory controller

Where it will simply wait for the ongoing Hermes request to finish, thus hiding the entire cache access latency from its critical path.

Once the data returns from the main memory, Hermes returns the data to the core [CLICK],

Thus, saving the precious stall cycles induced by the off-chip load [CLICK], which in turn provides performance benefit.

[CLICK] Hermes trains POPET when the data finally returns to the core, closing the feedback loop.

11 of 43

Designing The Off-Chip Predictor

Large metadata

Metadata size increases with cache hierarchy size

Track every cache operations

Gets tricky based on the cache hierarchy configuration

(e.g., inclusivity, bypassing,…)

Correlate different program features with off-chip loads

Lower storage overhead

Lower design complexity

Tracking cache contents

Learning from program behavior

11

Now let’s discuss how to design this off-chip predictor.

[CLICK] There are two major ways to design an off-chip predictor.

First, by explicitly tracking the cache contents and

Second, by learning from the program behavior.

[CLICK] Now tracking the cache content poses two key challenges

First, it requires a large metadata structure. And the size of this metadata also increases with the increase in the cache hierarchy size.

And second, we need to track every cache operation going on in the entire cache hierarchy. This can get tricky based on the cache hierarchy configuration.

[CLICK] Learning method, on the other hand, aims to correlate different program feature values with the off-chip load to accurately predict with much lower storage and design complexity.

[CLICK] Hermes adopts the learning method and proposes… [NEXT]

12 of 43

POPET: Perceptron-Based Off-Chip Predictor

Multi-feature hashed perceptron model^[1]

Each feature has its own weight table

Stores correlation between feature value and off-chip prediction

[1] D. Tarjan and K. Skadron, “Merging Path and Gshare Indexing in Perceptron Branch Prediction,” TACO, 2005

12

13 of 43

Predicting using POPET

Using simple table lookups, addition, and comparison

0x7ffe0+12

42

-4

12

3

3 >= -2

-5

13

14 of 43

Training POPET

Using simple increment or decrement of feature weights

0x7ffe0+12

42

-4

12

3

3 >= -2

-5

14

15 of 43

Evaluation

16 of 43

Simulation Methodology

ChampSim trace driven simulator
110 single-core memory-intensive traces

SPEC CPU 2006 and 2017
PARSEC 2.1
Ligra
Real-world applications

220 eight-core homogeneous and heterogeneous trace mixes

Off-Chip Predictors

LLC Prefetchers

HMP [Yoaz+, ISCA’99]

Address Tag-Tracking based Predictor (TTP)

Pythia [Bera+, MICRO’22]
Bingo [Bakshalipour+, HPCA’19]
MLOP [Shakerinava+, 3rd Prefetching Championship’19]
SPP + Perceptron filter [Bhatia+, ISCA’20]
SMS [Somogyi+, ISCA’06]

16

17 of 43

Simulation Methodology

6 cycles

18 cycles

Hermes-Optimistic

Hermes-Pessimistic

17

18 of 43

Performance with Varying Memory Bandwidth

~AMD Threadripper 3990x (Zen 2, 64C/4ch)

~AMD EPYC Rome 7702P (Zen 2, 64C/8ch)

~Intel Xeon 6258R (Cascade Lake, 28C/6ch)

Pythia

Hermes

Pythia+Hermes

20.3%

11.5%

5.4%

6.2%

Hermes alone provides nearly 50% performance of Pythia

In extremely bandwidth-constrained configurations,

Hermes alone outperforms Pythia

Hermes+Pythia outperforms Pythia

in a wide range of bandwidth configurations

18

19 of 43

Bandwidth Efficiency

20%

11%

5%

5.5%

38.5%

5.9%

For every 1% performance benefit,

the increase in main memory request by

Pythia

2%

Hermes on top of Pythia

1%

Hermes alone

0.5%

19

20 of 43

Performance with Varying Baseline Prefetcher

5.4%

6.2%

5.1%

7.6%

7.7%

Hermes consistently improves performance

on top of a wide range of baseline prefetchers

20

21 of 43

Overhead of Hermes

4 KB storage overhead

1.5% power overhead

On top of an Intel Alder Lake-like performance-core ^[2] configuration

[2] https://www.anandtech.com/show/16881/a-deep-dive-into-intels-alder-lake-microarchitectures/3

21

22 of 43

23 of 43

24 of 43

To Summarize…

25 of 43

Summary

Hermes advocates off-chip load prediction,

a different form of speculation than

load address predictions by prefetchers

Off-chip load prediction can be applied by itself

or combined with load address prediction

to provide performance improvement

25

26 of 43

Summary

Hermes…

Identifies 74% off-chip loads,

with 77% accuracy,

that misses a 4.5 MB cache hierarchy

using only 4 KB storage overhead,

and provides 5.4% performance gain

26

27 of 43

Hermes is Open Sourced

https://github.com/CMU-SAFARI/Hermes

All workload traces

13 prefetchers

9 off-chip predictors

27

28 of 43

Easy To Define Your Own Off-Chip Predictor

Just extend the OffchipPredBase class

28

29 of 43

Easy To Define Your Own Off-Chip Predictor

Define your own train() and predict() functions

Get statistics like precision (aka accuracy) and recall (aka coverage) out of the box

29

30 of 43

Off-Chip Prediction Can Further Enable…

Prioritizing loads that are likely go off-chip

in cache queues and on-chip network routing

Better instruction scheduling

of data-dependent instructions

and many more…

30

31 of 43

Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran,

David Novo, Ataberk Olgun, Mohammad Sadrosadati, Onur Mutlu

Accelerating Long-Latency Load Requests

via Perceptron-Based Off-Chip Load Prediction

32 of 43

BACKUP

33 of 43

Key Observations

40% of the stalls caused by

an off-chip load can be reduced by removing on-chip cache access latency from its critical path

On-chip cache access latency significantly contributes to total off-chip load latency

Observation 2

33

34 of 43

The Problem and Its Traditional Solutions

Long-latency off-chip load requests

significantly limits performance by stalling the processor

Key Problem

Employ sophisticated data prefetchers
Increase size of on-chip cache hierarchy

Traditional Solutions

34

35 of 43

Observation: Not All Off-Chip Loads are Prefetched

50%

Nearly 50% of the loads are still not prefetched

1

35

36 of 43

Observation: Not All Off-Chip Loads are Prefetched

70% of these off-chip loads blocks ROB

2

36

37 of 43

Observation: With Large Cache Comes Longer Latency

On-chip cache access latency significantly contributes to the latency of an off-chip load

58

On-chip cache hierarchy access latency

37

38 of 43

Observation: With Large Cache Comes Longer Latency

On-chip cache access latency significantly contributes to the latency of an off-chip load

58

On-chip cache hierarchy access latency

40% of stall cycles caused by an off-chip load can be eliminated

by removing on-chip cache access latency from its critical path

38

39 of 43

Bandwidth Efficiency

20%

11%

5%

5.5%

38.5%

5.9%

For every 1% performance benefit

Pythia increases

main memory requests by 2%

Hermes on top of Pythia increases main memory requests by 1%

Hermes alone increases

main memory requests by 0.5%

39

40 of 43

Long-latency load requests

that go off-chip significantly limits

a processor’s performance

Key Problem

Core

L1-D

L2

LLC

MC

Off-Chip

Main Memory

41 of 43

Key Problem

Core

L1-D

L2

LLC

MC

Off-Chip

Main Memory

Long-latency load requests that go off-chip significantly limits processor’s performance

41

42 of 43

Executive Summary

Long-latency off-chip load requests limits processor’s performance

by stalling the core

The problem

Employ sophisticated data prefetchers

Increase size of on-chip cache hierarchy

How it is addressed today?

Nearly 50% of the off-chip loads are not prefetched even by a sophisticated state-of-the-art prefetcher
40% of the stall cycles caused by an off-chip load can be eliminated by removing on-chip cache access latency from its critical path

Key observations

Improve processor performance by removing on-chip cache access latency from the critical path of an off-chip load

Our goal

42

43 of 43

Executive Summary

Predict which load request would go off-chip

Start fetching data directly from main memory, while also concurrently accessing the cache hierarchy

Hermes: Key idea

POPET – A perceptron-based off-chip load predictor

Learns to accurately predict off-chip loads using multiple program features

Key mechanism

Evaluated using a wide range of workloads from SPEC CPU, PARSEC, Ligra, and real-world applications

Identifies 74% off-chip loads with 77% accuracy

Improves performance by 5.4%, 5.1%, and 6.2% in 1/8/bw-limited cores

Incurs only 4KB storage and 1.5% power overhead per-core

Key results

https://github.com/CMU-SAFARI/Hermes

43