Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran,
David Novo, Ataberk Olgun, Mohammad Sadrosadati, Onur Mutlu
Accelerating Long-Latency Load Requests
via Perceptron-Based Off-Chip Load Prediction
The Key Problem
Long-latency off-chip load requests
Often block instruction retirement from
Reorder Buffer (ROB)
Limits performance
2
Traditional Solutions
Employ sophisticated prefetchers
1
Increase size of on-chip caches
2
3
Key Observations
Nearly 50% of the off-chip requests
in a no-prefetching system
still go to the main memory
even in presence of state-of-the-art prefetcher
Observation 1
70% of these off-chip loads block the ROB
4
Key Observations
40% of the stalls caused by an off-chip load
can be reduced by removing on-chip cache access latency from its critical path
On-chip cache access latency significantly contributes to total off-chip load latency
Observation 2
L1
L2
LLC
Main Memory
L1
L2
LLC
Saved cycles
Main Memory
5
And Things are Getting Worse…
6
Improve processor performance
by removing on-chip cache access latency
from the critical path of off-chip loads
Our Goal
Predicts which load requests might go off-chip
Starts fetching data directly from main memory while concurrently accessing the cache hierarchy
Hermes Overview
Core
L1-D
L2
LLC
MC
Off-Chip
Main Memory
L1
L2
LLC
Main Memory
Baseline
Processor is stalled
Latency tolerance limit of ROB
9
Hermes Overview
Core
L1-D
L2
LLC
MC
Off-Chip
Main Memory
L1
L2
LLC
Main Memory
POPET
L1
L2
LLC
Main Memory
Baseline
Hermes
Saved stall cycles
Processor is stalled
Latency tolerance limit of ROB
Predict @ LQ alloc
Issue Hermes request
Wait
Train@ LQ release
Off-chip predictor
10
Designing The Off-Chip Predictor
(e.g., inclusivity, bypassing,…)
Correlate different program features with off-chip loads
Tracking cache contents
Learning from program behavior
11
POPET: Perceptron-Based Off-Chip Predictor
[1] D. Tarjan and K. Skadron, “Merging Path and Gshare Indexing in Perceptron Branch Prediction,” TACO, 2005
12
Predicting using POPET
0x7ffe0+12
42
-4
12
3
3 >= -2
-5
13
Training POPET
0x7ffe0+12
42
-4
12
3
3 >= -2
-5
14
Evaluation
Simulation Methodology
Off-Chip Predictors
LLC Prefetchers
16
Simulation Methodology
6 cycles
18 cycles
Hermes-Optimistic
Hermes-Pessimistic
17
Performance with Varying Memory Bandwidth
~AMD Threadripper 3990x (Zen 2, 64C/4ch)
~AMD EPYC Rome 7702P (Zen 2, 64C/8ch)
~Intel Xeon 6258R (Cascade Lake, 28C/6ch)
Pythia
Hermes
Pythia+Hermes
20.3%
11.5%
5.4%
6.2%
Hermes alone provides nearly 50% performance of Pythia
In extremely bandwidth-constrained configurations,
Hermes alone outperforms Pythia
Hermes+Pythia outperforms Pythia
in a wide range of bandwidth configurations
18
Bandwidth Efficiency
20%
11%
5%
5.5%
38.5%
5.9%
For every 1% performance benefit,
the increase in main memory request by
Pythia
2%
Hermes on top of Pythia
1%
Hermes alone
0.5%
19
Performance with Varying Baseline Prefetcher
5.4%
6.2%
5.1%
7.6%
7.7%
Hermes consistently improves performance
on top of a wide range of baseline prefetchers
20
Overhead of Hermes
4 KB storage overhead
1.5% power overhead
On top of an Intel Alder Lake-like performance-core [2] configuration
[2] https://www.anandtech.com/show/16881/a-deep-dive-into-intels-alder-lake-microarchitectures/3
21
More in the Paper
22
More in the Paper
23
To Summarize…
Summary
Hermes advocates off-chip load prediction,
a different form of speculation than
load address predictions by prefetchers
Off-chip load prediction can be applied by itself
or combined with load address prediction
to provide performance improvement
25
Summary
Hermes…
Identifies 74% off-chip loads,
with 77% accuracy,
that misses a 4.5 MB cache hierarchy
using only 4 KB storage overhead,
and provides 5.4% performance gain
26
Hermes is Open Sourced
All workload traces
13 prefetchers
9 off-chip predictors
27
Easy To Define Your Own Off-Chip Predictor
28
Easy To Define Your Own Off-Chip Predictor
29
Off-Chip Prediction Can Further Enable…
Prioritizing loads that are likely go off-chip
in cache queues and on-chip network routing
Better instruction scheduling
of data-dependent instructions
and many more…
30
Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran,
David Novo, Ataberk Olgun, Mohammad Sadrosadati, Onur Mutlu
Accelerating Long-Latency Load Requests
via Perceptron-Based Off-Chip Load Prediction
BACKUP
Key Observations
40% of the stalls caused by
an off-chip load can be reduced by removing on-chip cache access latency from its critical path
On-chip cache access latency significantly contributes to total off-chip load latency
Observation 2
33
The Problem and Its Traditional Solutions
Long-latency off-chip load requests
significantly limits performance by stalling the processor
Key Problem
Traditional Solutions
34
Observation: Not All Off-Chip Loads are Prefetched
50%
Nearly 50% of the loads are still not prefetched
1
35
Observation: Not All Off-Chip Loads are Prefetched
70% of these off-chip loads blocks ROB
2
36
Observation: With Large Cache Comes Longer Latency
58
On-chip cache hierarchy access latency
37
Observation: With Large Cache Comes Longer Latency
58
On-chip cache hierarchy access latency
40% of stall cycles caused by an off-chip load can be eliminated
by removing on-chip cache access latency from its critical path
38
Bandwidth Efficiency
20%
11%
5%
5.5%
38.5%
5.9%
For every 1% performance benefit
Pythia increases
main memory requests by 2%
Hermes on top of Pythia increases main memory requests by 1%
Hermes alone increases
main memory requests by 0.5%
39
Long-latency load requests
that go off-chip significantly limits
a processor’s performance
Key Problem
Core
L1-D
L2
LLC
MC
Off-Chip
Main Memory
Key Problem
Core
L1-D
L2
LLC
MC
Off-Chip
Main Memory
Long-latency load requests that go off-chip significantly limits processor’s performance
41
Executive Summary
Long-latency off-chip load requests limits processor’s performance
by stalling the core
The problem
How it is addressed today?
Key observations
Improve processor performance by removing on-chip cache access latency from the critical path of an off-chip load
Our goal
42
Executive Summary
Hermes: Key idea
POPET – A perceptron-based off-chip load predictor
Key mechanism
Key results
43