1 of 86

Rahul Bera* Adithya Ranganathan* Joydeep Rakshit Sujit Mahto

Anant V. Nori Jayesh Gaur Ataberk Olgun Konstantinos Kanellopoulos

Mohammad Sadrosadati Sreenivas Subramoney Onur Mutlu

Improving Performance and Power Efficiency

by Safely Eliminating Load Instruction Execution

2 of 86

Key Problem

Stall other loads due to contention

in load execution resources

Stall load-dependent instructions

due to long load execution latency

Load instructions are a key limiter of

instruction-level parallelism (ILP)

Data Dependence

Resource Dependence

L

I

L₁

L₂

R

(e.g., address generation unit,

load port, ...)

2

3 of 86

Prior Works on Tolerating Load Latency

Load value prediction (LVP) [Lipasti+, ASPLOS’96; Sazeides+, MICRO’96; ...]
Memory renaming (MRN) [Moshovos+, ISCA’97; Tyson+, MICRO’97; ...]

By speculatively executing

load-dependent instructions

using a predicted load value

Mitigate

Data Dependence

L

I

Predicted load still gets executed

to verify speculation,

consuming execution resources

Do Not Mitigate

Resource Dependence

L₁

L₂

R

3

4 of 86

Motivation

Safely breaking load data dependency

without executing a load instruction

may provide additional performance benefits

By finding load instructions that repeatedly produce

identical results across dynamic instances

How do we start?

4

5 of 86

Key Finding I: Global-Stable Loads

Some loads repeatedly fetch the same data value from same load address across entire workload

Both operations, address generation & data fetch, produce identical results across all dynamic instances

Prime targets for breaking data dependency without execution

Global-Stable Load

5

6 of 86

Key Finding I: Global-Stable Loads

Fraction of dynamic loads

34.2%

Nearly 1 in every 3 dynamic loads

is a global-stable load

Global-Stable Loads

Across a wide range of 90 workloads

6

7 of 86

In the Paper: Analysis of Global-Stable Loads

Why do these loads even exist in well-optimized real-world workloads?

Accessing global-scope variables
Accessing local variables of inline functions
Limited set of architectural registers

Can increasing architectural registers help?

Very small change even after doubling x64 registers

Deeper characterization of global-stable loads

Which addressing mode do they use?
How far away do they appear in a workload?

7

8 of 86

In the Paper: Analysis of Global-Stable Loads

Why do these loads even exist in well-optimized real-world workloads?

Accessing global-scope variables
Accessing local variables of inline functions
Limited set of architectural registers

Can increasing architectural registers help?

Very small change even after doubling x64 registers

Deeper characterization of global-stable loads

Which addressing mode do they use?
How far away do they appear in a workload?

https://arxiv.org/pdf/2406.18786

8

9 of 86

9

But do they limit ILP even when using

load value prediction and memory renaming?

A significant fraction of loads are global-stable

10 of 86

Key Finding II: Global-Stable Loads Cause �Resource Dependence

In an aggressive OoO processor with 6-wide issue, 3 load ports,

a load value predictor (EVES [Seznec, CVP’18]), and memory renaming enabled

All execution cycles where

at least one load port is utilized

10

11 of 86

Key Finding II: Global-Stable Loads Cause �Resource Dependence

In an aggressive OoO processor with 6-wide issue, 3 load ports,

a load value predictor (EVES [Seznec, CVP’18]), and memory renaming enabled

All execution cycles where

at least one load port is utilized

23%

A global-stable load utilizes a load port

blocking a non-global-stable load

11

12 of 86

Key Finding II: Global-Stable Loads Cause �Resource Dependence

In an aggressive OoO processor with 6-wide issue, 3 load ports,

a load value predictor (EVES [Seznec, CVP’18]), and memory renaming enabled

All execution cycles where

at least one load port is utilized

23%

A global-stable load utilizes a load port

blocking a non-global-stable load

Even when using load value prediction and memory renaming,

global-stable loads limit ILP due to resource dependence

What’s the performance headroom of mitigating the resource dependence?

12

13 of 86

Key Finding III: High Performance Headroom

Ideally eliminating all global-stable loads

(mitigating load data + resource dependence)

Ideally value-predicting all global-stable loads

(mitigating only load data dependence)

2x load execution width

4.3%

9.1%

8.8%

Mitigating both data and resource dependence has

more than 2x the performance benefit

of mitigating only data dependence of global-stable loads

Ideal elimination of global-stable loads exceeds performance

of a processor with 2x wider load execution

13

14 of 86

Load Execution Resources Lag Behind

3.4x

3x

1.5x

14

15 of 86

Load Execution Resources Lag Behind

3.4x

3x

1.5x

Mitigating load resource dependence has

high performance potential

in recent and future generation processors

15

16 of 86

Our Goal

To improve instruction-level parallelism by mitigating

both load data dependence and resource dependence

16

17 of 86

17

Mitigates both load data dependence

and load resource dependence

By safely eliminating

the entire execution of a load instruction

A purely-microarchitectural technique

18 of 86

Constable: Key Insight

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

jle 0x40230e

add rax, 0x10

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

jle 0x40230e

Dynamic instruction stream

add rax, 0x10

Two successive dynamic instances

of the same static load instruction

LD₁

LD₂

18

19 of 86

Constable: Key Insight

If the source register rbp

has not been modified

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

jle 0x40230e

add rax, 0x10

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

jle 0x40230e

Dynamic instruction stream

add rax, 0x10

LD₂ would have the same address as LD₁

Address generation of LD₂

can be eliminated

If no store or snoop request

to address [rbp+0x8]

LD₂ would fetch the same data as LD₁

Data fetching of LD₂

can be eliminated

LD₁

LD₂

19

20 of 86

Constable: Key Steps

Dynamically identify load instructions

that have historically fetched

the same data from the same load address

(i.e., likely-stable)

Eliminate execution of likely-stable loads

by tracking modifications to

their source registers and their load addresses

20

21 of 86

Prior Related Literature

Rich literature on skipping redundant computations

by memoizing previously-computed results

[Michie, Nature’68; Harbison+, ASPLOS’82; Richardson, SCA’93; Sodani+, ISCA’97; González+, ICPP’99; ...]

Aim to memoize every instruction

including multiple dynamic instances of each instruction

Require large memoization buffer

Often bigger than the size of L1 data cache

21

22 of 86

Key Improvements over Literature

Rich literature on skipping redundant computations

by memoizing previously-computed results

[Michie, Nature’68; Harbison+, ASPLOS’82; Richardson, SCA’93; Sodani+, ISCA’97; González+, ICPP’99; ...]

Focus only on loads that are likely stable

Lower storage overhead

with high load elimination coverage

Lower design complexity

Fewer port requirements, lower power

22

23 of 86

Key Improvements over Literature

Focus only on loads that are likely stable

Eliminate loads early in the pipeline

Elimination at rename stage

by explicitly monitoring changes to the source registers

and load address of a likely-stable load

Rich literature on skipping redundant computations

by memoizing previously-computed results

[Michie, Nature’68; Harbison+, ASPLOS’82; Richardson, SCA’93; Sodani+, ISCA’97; González+, ICPP’99; ...]

23

24 of 86

Key Improvements over Literature

Focus only on loads that are likely stable

Eliminate loads early in the pipeline

Ensure correctness in today’s processors

Maintain correctness in presence of out-of-order load issue
Maintain coherence in multi-threaded & multi-core execution

Rich literature on skipping redundant computations

by memoizing previously-computed results

[Michie, Nature’68; Harbison+, ASPLOS’82; Richardson, SCA’93; Sodani+, ISCA’97; González+, ICPP’99; ...]

24

25 of 86

�Design Overview

25

26 of 86

Constable: Key Steps

Identify

likely-stable loads

Eliminate

by tracking modifications

26

27 of 86

Identify a Likely-Stable Load

Using a stability confidence counter per load instruction

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

add rax, 0x10

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

ret

add rax, 0x10

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

jle 0x40230e

Stability Confidence

+1

/2

5

6

3

Same data & address

as last dynamic instance

Different data or

different address

27

28 of 86

Eliminate a Likely-Stable Load

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

add rax, 0x10

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

add rax, 0x10

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

30

Stability confidence crosses threshold

jle 0x40230e

rbp

0x4200e0

PC_x

Elimination Table

PC_x

Register Monitor

Address Monitor

No reservation station
No address generation unit
No load port
Still takes ROB and load buffer

0x2ae

last value

eliminate flag

to handle correct elimination of in-flight loads

Insert

Lookup

28

[CLICK] Once the counter reaches a confidence threshold,

[CLICK] Constable starts tracking the source registers and the last executed load address

of this instruction in two separate structures.

[CLICK] saves the last executed load value, and sets an eliminate flag, which signifies that

any subsequent instances of this instruction can be eliminated from now on.

[CLICK] Once a new instance of this instructions arrives, Constable eliminates its execution,

meaning no reservation station, address generation unit, and load port will be assigned for this instruction

and its data-dependency would be broken using the last executed value.

[CLICK] However, an eliminated load would still take a reorder buffer and a load buffer entry

to rectify any rare incorrectly-eliminated in-flight loads, as we describe in the paper.

[CLICK] Constable keeps on eliminating any subsequent instances of this instruction till the eliminate flag stays on.

29 of 86

Stop Elimination of a Likely-Stable Load

rbp

0x4200e0

PC_x

Elimination Table

PC_x

Register Monitor

Address Monitor

0x2ae

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

pop rbx

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

jle 0x40230e

add rax, 0x10

30

jle 0x40230e

add rbp, 0xd8

.

add rax, 0x10

Elimination flag not set.

Gets executed

15

29

30 of 86

More in the Paper

Ensuring safe and correct elimination in presence of

Out-of-order load issue
Multi-threaded & multi-core execution
Wrong-path execution

Integration of Constable into the processor pipeline

Microarchitecture for breaking data dependence on the eliminated loads

Microarchitecture of Constable’s own structures

Read and write port requirements

30

31 of 86

More in the Paper

Ensuring safe and correct elimination in presence of

Out-of-order load issue
Multi-core execution
Wrong-path execution

Integration of Constable into the processor pipeline

Microarchitecture for breaking data dependence on the eliminated loads

Microarchitecture of Constable’s own structures

Read and write port requirements

https://arxiv.org/pdf/2406.18786

31

32 of 86

Evaluation

32

33 of 86

Methodology

Industry-grade x86-64 simulator modeling aggressive OoO processor

8-wide fetch, 6-wide issue to 3 load ports, 512-entry ROB
With memory renaming, zero/constant/move elimination, branch folding
Five prefetchers throughout cache hierarchy

90 workloads of wide variety

All from SPEC CPU 2017
Client (SYSMark, DaCapo, ...)
Enterprise (SPECjbb, SPECjEnterprise, ...)
Server (BigBench, Hadoop, ...)

EVES, the state-of-the-art load value predictor [Seznec, CVP’18]
Early Load Address Resolution [Bekerman+, ISCA’00]
Register File Prefetching [Shukla+, ISCA’22]

Mechanisms compared against

No simultaneous multi-threading (SMT)
2-way SMT

Configurations

33

34 of 86

Performance Improvement in noSMT

4.7%

5.1%

3.4%

EVES (the state-of-the-art load value predictor)

Constable

EVES + Constable

Constable alone provides similar performance as EVES

with only ½ of EVES’ storage overhead

Constable on top of EVES outperforms EVES alone

34

35 of 86

Performance Improvement in 2-way SMT

3.6%

8.8%

11.3%

EVES (the state-of-the-art load value predictor)

Constable

EVES + Constable

35

36 of 86

Performance Improvement in 2-way SMT

3.6%

8.8%

11.3%

EVES (the state-of-the-art load value predictor)

Constable

EVES + Constable

Constable provides higher performance benefits

in a 2-way SMT processor

36

37 of 86

Improvement in Resource Efficiency

Reduction in Reservation Station Allocation

Reduction in

L1 Data Cache Accesses

8.8% average

26% average

37

38 of 86

Improvement in Resource Efficiency

Reduction in Reservation Station Allocation

Reduction in

L1 Data Cache Accesses

8.8% average

26% average

Constable significantly improves resource efficiency

by eliminating load instruction execution

38

39 of 86

Reduction in Dynamic Power

OoO Unit Power

Memory Execution Unit Power

5.1% average reduction

9.1% average reduction

39

40 of 86

Reduction in Dynamic Power

OoO Unit Power

Memory Execution Unit Power

5.1% reduction

9.1% reduction

By eliminating load instruction execution,

Constable reduces dynamic power consumption

40

41 of 86

Area and Power Overhead �of Constable’s Own Structures

12.4 KB

Storage overhead per core

0.232 mm²

0.0061% area of Intel Alderlake-S processor

Low Energy

Up to 10.8 pJ/read and 16.7 pJ/write

41

42 of 86

More in the Paper

Load elimination coverage of Constable

23.5% of all dynamic loads are eliminated

Per-workload performance analysis

Up to 31.2% over baseline
60/90 workloads outperforms EVES by more than 5%

Performance contribution per load category

Stack loads contribute the highest

Performance improvement over prior works

4.7% over Early load address resolution
3.6% over Register file prefetching

Performance sensitivity:

Higher performance in every configuration up to 2X load execution width
Higher performance in every configuration up to 2X pipeline depth

42

43 of 86

More in the Paper

Load elimination coverage of Constable

23.5% of all dynamic loads are eliminated

Per-workload performance analysis

Up to 31.2% over baseline
60/90 workloads outperforms EVES by more than 5%

Performance contribution per load category

Stack loads contributes the highest

Performance improvement over prior works

4.7% over Early load address resolution
3.6% over Register file prefetching

Performance sensitivity:

Higher performance in every configuration up to 2X load execution width
Higher performance in every configuration up to 2X pipeline depth

https://arxiv.org/pdf/2406.18786

43

44 of 86

To Summarize...

44

45 of 86

Our Key Findings

A large fraction (34%) of dynamic loads fetch

the same data from the same address

throughout the entire workload

These global-stable loads cause significant ILP loss due to resource dependence

Eliminating global-stable load execution provides

more than 2x the performance benefit

of just breaking their load data dependency

45

46 of 86

Our Proposal

Identifies and eliminates loads

that repeatedly fetch same data from same address

Constable

High performance benefit

over a strong baseline system

without (5.1%) and with SMT (8.8%)

Improves resource efficiency

L1-D access reduction by 26%

RS allocation reduction by 8.8%

Reduces dynamic power

L1-D power by 9.1%

RS power by 5.1%

Low storage overhead

Only 12.4 KB/core,

0.232 mm² in 14-nm technology

46

47 of 86

There’s Still Headroom...

43% of global-stable loads

do not get eliminated

We need to understand more

software primitives that generate global-stable loads

Constable successfully eliminates

57% of all global-stable loads at runtime

47

48 of 86

Open-Source Tool

A tool to analyze load instructions in any off-the-shelf x86(-64) program

https://github.com/CMU-SAFARI/Load-Inspector

48

49 of 86

Open-Source Tool

A tool to analyze load instructions in any off-the-shelf x86(-64) program

https://github.com/CMU-SAFARI/Load-Inspector

Study global-stable loads

Study the effects of increasing architectural registers

using APX extension to x64 ISA

49

50 of 86

Improving Performance and Power Efficiency

by Safely Eliminating Load Instruction Execution

arXiv

Load Inspector

51 of 86

BACKUP

52 of 86

Index

Motivation & Design

Why do global-stable loads exist?
Effect of increasing registers
Characterization of global-stable loads
Resource dependence by global-stable loads
Performance headroom
Constable overview
Architecting elimination table
Effect of wrong-path update
Example of Constable’s operation
Ensuring coherence
Storage overhead
Area/energy overhead

Methodology & Evaluation

Simulation parameters
Evaluated workloads
Workload-wise performance
Load-category-wise performance
Comparison with prior works
Elimination coverage
Coverage of global-stable loads
Reduction in power
Scaling width
Scaling depth
Violating memory ordering

52

53 of 86

Why Do Global-Stable Loads Exist?

Random* Random::s_rng = 0;

Random* Random::get_Rng(void)

{

if (s_rng == 0)

s_rng = new Random();

return s_rng;

}

Example code from 541.leela_r

Global scope variable

Function to access the variable

53

54 of 86

Why Do Global-Stable Loads Exist?

Random* Random::s_rng = 0;

Random* Random::get_Rng(void)

{

if (s_rng == 0)

s_rng = new Random();

return s_rng;

}

624: mov rax, [rip+0x1f4ac5]

62b: test rax,rax

62e: je 0x638

630: ret

631: nop

638: sub rsp,0x8

63c: mov edi,0xc

641: call 0x460

Example code from 541.leela_r

Disassembly (compiled by GCC* with –O3)

*GNU GCC 13.2

54

55 of 86

Why Do Global-Stable Loads Exist?

Random* Random::s_rng = 0;

Random* Random::get_Rng(void)

{

if (s_rng == 0)

s_rng = new Random();

return s_rng;

}

624: mov rax, [rip+0x1f4ac5]

62b: test rax,rax

62e: je 0x638

630: ret

631: nop

638: sub rsp,0x8

63c: mov edi,0xc

641: call 0x460

Example code from 541.leela_r

Gets initialized only once

and never changes

Global-stable load

Disassembly (compiled by GCC* with –O3)

*GNU GCC 13.2

55

56 of 86

Why Do Global-Stable Loads Exist?

56

57 of 86

Why Do Global-Stable Loads Exist?

Global-stable loads exist for many reasons:

Accessing global variables

Accessing local variables of an inline function

Limited architectural registers
...

57

58 of 86

Effects of Increasing Architectural Registers

Fraction of global-stable loads

are nearly same without or with APX

Compiled with Clang 18.1.3 with and without -mapxf

58

59 of 86

Effects of Increasing Architectural Registers

Fraction of global-stable loads (i.e., Constable’s opportunities)

are much higher than reduction in loads by APX

Compiled with Clang 18.1.3 with and without -mapxf

59

60 of 86

Effects of Increasing Architectural Registers

The profile of global-stable loads stays largely unchanged

after doubling registers using APX

60

61 of 86

Characterization of Global-Stable Loads (I)

61

62 of 86

Characterization of Global-Stable Loads (II)

62

63 of 86

Characterization of Global-Stable Loads (III)

63

64 of 86

Resource Dependence by Global-Stable Loads

Global-stable loads cause significant resource dependence

64

65 of 86

Performance Headroom Analysis

65

66 of 86

Constable Overview

66

67 of 86

Architecting SLD

67

68 of 86

Effect of Wrong-Path Update

68

69 of 86

An Example of Constable’s Operation

69

70 of 86

Ensuring Coherence

Constable relies on snoop requests to observe modifications to a memory address by other cores

70

71 of 86

Ensuring Coherence

Constable relies on snoop requests to observe modifications to a memory address by other cores

Evicting a cacheline from core-private cache resets the core-valid bit in directory

What if a clean eviction?

Unnecessary elimination opportunity loss

Constable pins the CV-bit of an eliminated load’s cacheline

Even if the cacheline gets evicted from core-private cache, snoop request gets delivered

71

72 of 86

Constable’s Storage Overhead

72

73 of 86

Area and Power Overhead of Constable’s Structures

73

74 of 86

Simulated System Parameters

74

75 of 86

Evaluated Workloads

75

76 of 86

Workload-Wise Performance

76

77 of 86

Load Category-Wise Performance

77

78 of 86

Performance Comparison with Prior Works

78

79 of 86

Elimination Coverage

79

80 of 86

Coverage of Global-Stable Loads

80

81 of 86

Reduction in Dynamic Power

81

82 of 86

Performance Sensitivity to Load Execution Width Scaling

82

83 of 86

Performance Sensitivity to Pipeline Depth Scaling

83

84 of 86

Eliminated Loads that Violates Memory Ordering

Only 0.09% of all eliminated loads

violate memory ordering

84

85 of 86

Eliminated Loads that Violates Memory Ordering

Eliminated loads that violate memory ordering

increase number of allocated instructions only by 0.3%

85

86 of 86

Executive Summary

A significant fraction of loads always fetch the same data
from same address throughout the entire workload

These loads cause significant resource dependence
even when using value prediction and memory renaming

Mitigating their resource dependence has significant performance headroom

Key Findings

Constable
Key Idea: To mitigate both load data and load resource dependence by safely eliminating the entire execution of a load

Key Result: Simultaneously improves performance and power efficiency of a strong baseline OoO processor

Key Mechanism

86