1 of 43

Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran,

David Novo, Ataberk Olgun, Mohammad Sadrosadati, Onur Mutlu

Accelerating Long-Latency Load Requests

via Perceptron-Based Off-Chip Load Prediction

2 of 43

The Key Problem

Long-latency off-chip load requests

Often block instruction retirement from

Reorder Buffer (ROB)

Limits performance

2

3 of 43

Traditional Solutions

Employ sophisticated prefetchers

1

Increase size of on-chip caches

2

3

4 of 43

Key Observations

Nearly 50% of the off-chip requests

in a no-prefetching system

still go to the main memory

even in presence of state-of-the-art prefetcher

Observation 1

70% of these off-chip loads block the ROB

4

5 of 43

Key Observations

40% of the stalls caused by an off-chip load

can be reduced by removing on-chip cache access latency from its critical path

On-chip cache access latency significantly contributes to total off-chip load latency

Observation 2

L1

L2

LLC

Main Memory

L1

L2

LLC

Saved cycles

Main Memory

5

6 of 43

And Things are Getting Worse…

6

7 of 43

Improve processor performance

by removing on-chip cache access latency

from the critical path of off-chip loads

Our Goal

8 of 43

Predicts which load requests might go off-chip

Starts fetching data directly from main memory while concurrently accessing the cache hierarchy

9 of 43

Hermes Overview

Core

L1-D

L2

LLC

MC

Off-Chip

Main Memory

L1

L2

LLC

Main Memory

Baseline

Processor is stalled

Latency tolerance limit of ROB

9

10 of 43

Hermes Overview

Core

L1-D

L2

LLC

MC

Off-Chip

Main Memory

L1

L2

LLC

Main Memory

POPET

L1

L2

LLC

Main Memory

Baseline

Hermes

Saved stall cycles

Processor is stalled

Latency tolerance limit of ROB

Predict @ LQ alloc

Issue Hermes request

Wait

Train@ LQ release

Off-chip predictor

10

11 of 43

Designing The Off-Chip Predictor

    • Large metadata
      • Metadata size increases with cache hierarchy size
    • Track every cache operations
      • Gets tricky based on the cache hierarchy configuration

(e.g., inclusivity, bypassing,…)

Correlate different program features with off-chip loads

    • Lower storage overhead

    • Lower design complexity

Tracking cache contents

Learning from program behavior

11

12 of 43

POPET: Perceptron-Based Off-Chip Predictor

  • Multi-feature hashed perceptron model[1]
    • Each feature has its own weight table
      • Stores correlation between feature value and off-chip prediction

[1] D. Tarjan and K. Skadron, “Merging Path and Gshare Indexing in Perceptron Branch Prediction,” TACO, 2005

12

13 of 43

Predicting using POPET

  • Using simple table lookups, addition, and comparison

0x7ffe0+12

42

-4

12

3

3 >= -2

-5

13

14 of 43

Training POPET

  • Using simple increment or decrement of feature weights

0x7ffe0+12

42

-4

12

3

3 >= -2

-5

14

15 of 43

Evaluation

16 of 43

Simulation Methodology

  • ChampSim trace driven simulator
  • 110 single-core memory-intensive traces
    • SPEC CPU 2006 and 2017
    • PARSEC 2.1
    • Ligra
    • Real-world applications
  • 220 eight-core homogeneous and heterogeneous trace mixes

Off-Chip Predictors

LLC Prefetchers

  • HMP [Yoaz+, ISCA’99]

  • Address Tag-Tracking based Predictor (TTP)
  • Pythia [Bera+, MICRO’22]
  • Bingo [Bakshalipour+, HPCA’19]
  • MLOP [Shakerinava+, 3rd Prefetching Championship’19]
  • SPP + Perceptron filter [Bhatia+, ISCA’20]
  • SMS [Somogyi+, ISCA’06]

16

17 of 43

Simulation Methodology

6 cycles

18 cycles

Hermes-Optimistic

Hermes-Pessimistic

17

18 of 43

Performance with Varying Memory Bandwidth

~AMD Threadripper 3990x (Zen 2, 64C/4ch)

~AMD EPYC Rome 7702P (Zen 2, 64C/8ch)

~Intel Xeon 6258R (Cascade Lake, 28C/6ch)

Pythia

Hermes

Pythia+Hermes

20.3%

11.5%

5.4%

6.2%

Hermes alone provides nearly 50% performance of Pythia

In extremely bandwidth-constrained configurations,

Hermes alone outperforms Pythia

Hermes+Pythia outperforms Pythia

in a wide range of bandwidth configurations

18

19 of 43

Bandwidth Efficiency

20%

11%

5%

5.5%

38.5%

5.9%

For every 1% performance benefit,

the increase in main memory request by

Pythia

2%

Hermes on top of Pythia

1%

Hermes alone

0.5%

19

20 of 43

Performance with Varying Baseline Prefetcher

5.4%

6.2%

5.1%

7.6%

7.7%

Hermes consistently improves performance

on top of a wide range of baseline prefetchers

20

21 of 43

Overhead of Hermes

4 KB storage overhead

1.5% power overhead

On top of an Intel Alder Lake-like performance-core [2] configuration

[2] https://www.anandtech.com/show/16881/a-deep-dive-into-intels-alder-lake-microarchitectures/3

21

22 of 43

More in the Paper

  • Wide range of sensitivity studies by varying
    • Cache hierarchy access latency
    • Hermes request issue latency
    • Activation threshold

  • Comparison of POPET’s accuracy and coverage against HMP and TTP

  • Understanding usefulness of each program feature

  • Hermes’s effect on stall cycle reduction

  • Performance analysis in eight-core system

22

23 of 43

More in the Paper

  • Wide range of sensitivity studies by varying
    • Cache hierarchy access latency
    • Hermes request issue latency
    • Activation threshold

  • Comparison of POPET’s accuracy and coverage against HMP and TTP

  • Understanding usefulness of each program feature

  • Hermes’s effect on stall cycle reduction

  • Performance analysis in eight-core system

23

24 of 43

To Summarize…

25 of 43

Summary

Hermes advocates off-chip load prediction,

a different form of speculation than

load address predictions by prefetchers

Off-chip load prediction can be applied by itself

or combined with load address prediction

to provide performance improvement

25

26 of 43

Summary

Hermes…

Identifies 74% off-chip loads,

with 77% accuracy,

that misses a 4.5 MB cache hierarchy

using only 4 KB storage overhead,

and provides 5.4% performance gain

26

27 of 43

Hermes is Open Sourced

All workload traces

13 prefetchers

9 off-chip predictors

27

28 of 43

Easy To Define Your Own Off-Chip Predictor

  • Just extend the OffchipPredBase class

28

29 of 43

Easy To Define Your Own Off-Chip Predictor

  • Define your own train() and predict() functions

  • Get statistics like precision (aka accuracy) and recall (aka coverage) out of the box

29

30 of 43

Off-Chip Prediction Can Further Enable…

Prioritizing loads that are likely go off-chip

in cache queues and on-chip network routing

Better instruction scheduling

of data-dependent instructions

and many more…

30

31 of 43

Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran,

David Novo, Ataberk Olgun, Mohammad Sadrosadati, Onur Mutlu

Accelerating Long-Latency Load Requests

via Perceptron-Based Off-Chip Load Prediction

32 of 43

BACKUP

33 of 43

Key Observations

40% of the stalls caused by

an off-chip load can be reduced by removing on-chip cache access latency from its critical path

On-chip cache access latency significantly contributes to total off-chip load latency

Observation 2

33

34 of 43

The Problem and Its Traditional Solutions

Long-latency off-chip load requests

significantly limits performance by stalling the processor

Key Problem

  1. Employ sophisticated data prefetchers
  2. Increase size of on-chip cache hierarchy

Traditional Solutions

34

35 of 43

Observation: Not All Off-Chip Loads are Prefetched

50%

Nearly 50% of the loads are still not prefetched

1

35

36 of 43

Observation: Not All Off-Chip Loads are Prefetched

70% of these off-chip loads blocks ROB

2

36

37 of 43

Observation: With Large Cache Comes Longer Latency

  • On-chip cache access latency significantly contributes to the latency of an off-chip load

58

On-chip cache hierarchy access latency

37

38 of 43

Observation: With Large Cache Comes Longer Latency

  • On-chip cache access latency significantly contributes to the latency of an off-chip load

58

On-chip cache hierarchy access latency

40% of stall cycles caused by an off-chip load can be eliminated

by removing on-chip cache access latency from its critical path

38

39 of 43

Bandwidth Efficiency

20%

11%

5%

5.5%

38.5%

5.9%

For every 1% performance benefit

Pythia increases

main memory requests by 2%

Hermes on top of Pythia increases main memory requests by 1%

Hermes alone increases

main memory requests by 0.5%

39

40 of 43

Long-latency load requests

that go off-chip significantly limits

a processor’s performance

Key Problem

Core

L1-D

L2

LLC

MC

Off-Chip

Main Memory

41 of 43

Key Problem

Core

L1-D

L2

LLC

MC

Off-Chip

Main Memory

Long-latency load requests that go off-chip significantly limits processor’s performance

41

42 of 43

Executive Summary

Long-latency off-chip load requests limits processor’s performance

by stalling the core

The problem

  • Employ sophisticated data prefetchers

  • Increase size of on-chip cache hierarchy

How it is addressed today?

  • Nearly 50% of the off-chip loads are not prefetched even by a sophisticated state-of-the-art prefetcher
  • 40% of the stall cycles caused by an off-chip load can be eliminated by removing on-chip cache access latency from its critical path

Key observations

Improve processor performance by removing on-chip cache access latency from the critical path of an off-chip load

Our goal

42

43 of 43

Executive Summary

  • Predict which load request would go off-chip

  • Start fetching data directly from main memory, while also concurrently accessing the cache hierarchy

Hermes: Key idea

POPET – A perceptron-based off-chip load predictor

  • Learns to accurately predict off-chip loads using multiple program features

Key mechanism

  • Evaluated using a wide range of workloads from SPEC CPU, PARSEC, Ligra, and real-world applications

  • Identifies 74% off-chip loads with 77% accuracy

  • Improves performance by 5.4%, 5.1%, and 6.2% in 1/8/bw-limited cores

  • Incurs only 4KB storage and 1.5% power overhead per-core

Key results

43