1 of 86

Rahul Bera* Adithya Ranganathan* Joydeep Rakshit Sujit Mahto

Anant V. Nori Jayesh Gaur Ataberk Olgun Konstantinos Kanellopoulos

Mohammad Sadrosadati Sreenivas Subramoney Onur Mutlu

Improving Performance and Power Efficiency

by Safely Eliminating Load Instruction Execution

2 of 86

Key Problem

Stall other loads due to contention

in load execution resources

Stall load-dependent instructions

due to long load execution latency

Load instructions are a key limiter of

instruction-level parallelism (ILP)

Data Dependence

Resource Dependence

L

I

L1

L2

R

(e.g., address generation unit,

load port, ...)

2

3 of 86

Prior Works on Tolerating Load Latency

  • Load value prediction (LVP) [Lipasti+, ASPLOS’96; Sazeides+, MICRO’96; ...]
  • Memory renaming (MRN) [Moshovos+, ISCA’97; Tyson+, MICRO’97; ...]

By speculatively executing

load-dependent instructions

using a predicted load value

Mitigate

Data Dependence

L

I

Predicted load still gets executed

to verify speculation,

consuming execution resources

Do Not Mitigate

Resource Dependence

L1

L2

R

3

4 of 86

Motivation

Safely breaking load data dependency

without executing a load instruction

may provide additional performance benefits

By finding load instructions that repeatedly produce

identical results across dynamic instances

How do we start?

4

5 of 86

Key Finding I: Global-Stable Loads

  • Some loads repeatedly fetch the same data value from same load address across entire workload

    • Both operations, address generation & data fetch, produce identical results across all dynamic instances

    • Prime targets for breaking data dependency without execution

Global-Stable Load

5

6 of 86

Key Finding I: Global-Stable Loads

Fraction of dynamic loads

34.2%

Nearly 1 in every 3 dynamic loads

is a global-stable load

Global-Stable Loads

Across a wide range of 90 workloads

6

7 of 86

In the Paper: Analysis of Global-Stable Loads

  • Why do these loads even exist in well-optimized real-world workloads?
    • Accessing global-scope variables
    • Accessing local variables of inline functions
    • Limited set of architectural registers

  • Can increasing architectural registers help?
    • Very small change even after doubling x64 registers

  • Deeper characterization of global-stable loads
    • Which addressing mode do they use?
    • How far away do they appear in a workload?

7

8 of 86

In the Paper: Analysis of Global-Stable Loads

  • Why do these loads even exist in well-optimized real-world workloads?
    • Accessing global-scope variables
    • Accessing local variables of inline functions
    • Limited set of architectural registers

  • Can increasing architectural registers help?
    • Very small change even after doubling x64 registers

  • Deeper characterization of global-stable loads
    • Which addressing mode do they use?
    • How far away do they appear in a workload?

8

9 of 86

9

But do they limit ILP even when using

load value prediction and memory renaming?

A significant fraction of loads are global-stable

10 of 86

Key Finding II: Global-Stable Loads Cause �Resource Dependence

In an aggressive OoO processor with 6-wide issue, 3 load ports,

a load value predictor (EVES [Seznec, CVP’18]), and memory renaming enabled

All execution cycles where

at least one load port is utilized

10

11 of 86

Key Finding II: Global-Stable Loads Cause �Resource Dependence

In an aggressive OoO processor with 6-wide issue, 3 load ports,

a load value predictor (EVES [Seznec, CVP’18]), and memory renaming enabled

All execution cycles where

at least one load port is utilized

23%

A global-stable load utilizes a load port

blocking a non-global-stable load

11

12 of 86

Key Finding II: Global-Stable Loads Cause �Resource Dependence

In an aggressive OoO processor with 6-wide issue, 3 load ports,

a load value predictor (EVES [Seznec, CVP’18]), and memory renaming enabled

All execution cycles where

at least one load port is utilized

23%

A global-stable load utilizes a load port

blocking a non-global-stable load

Even when using load value prediction and memory renaming,

global-stable loads limit ILP due to resource dependence

What’s the performance headroom of mitigating the resource dependence?

12

13 of 86

Key Finding III: High Performance Headroom

Ideally eliminating all global-stable loads

(mitigating load data + resource dependence)

Ideally value-predicting all global-stable loads

(mitigating only load data dependence)

2x load execution width

4.3%

9.1%

8.8%

Mitigating both data and resource dependence has

more than 2x the performance benefit

of mitigating only data dependence of global-stable loads

Ideal elimination of global-stable loads exceeds performance

of a processor with 2x wider load execution

13

14 of 86

Load Execution Resources Lag Behind

3.4x

3x

1.5x

14

15 of 86

Load Execution Resources Lag Behind

3.4x

3x

1.5x

Mitigating load resource dependence has

high performance potential

in recent and future generation processors

15

16 of 86

Our Goal

To improve instruction-level parallelism by mitigating

both load data dependence and resource dependence

16

17 of 86

17

Mitigates both load data dependence

and load resource dependence

By safely eliminating

the entire execution of a load instruction

A purely-microarchitectural technique

18 of 86

Constable: Key Insight

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

jle 0x40230e

add rax, 0x10

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

jle 0x40230e

Dynamic instruction stream

add rax, 0x10

add rax, 0x10

Two successive dynamic instances

of the same static load instruction

LD1

LD2

18

19 of 86

Constable: Key Insight

If the source register rbp

has not been modified

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

jle 0x40230e

add rax, 0x10

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

jle 0x40230e

Dynamic instruction stream

add rax, 0x10

add rax, 0x10

LD2 would have the same address as LD1

Address generation of LD2

can be eliminated

If no store or snoop request

to address [rbp+0x8]

LD2 would fetch the same data as LD1

Data fetching of LD2

can be eliminated

LD1

LD2

19

20 of 86

Constable: Key Steps

Dynamically identify load instructions

that have historically fetched

the same data from the same load address

(i.e., likely-stable)

Eliminate execution of likely-stable loads

by tracking modifications to

their source registers and their load addresses

20

21 of 86

Prior Related Literature

Rich literature on skipping redundant computations

by memoizing previously-computed results

[Michie, Nature’68; Harbison+, ASPLOS’82; Richardson, SCA’93; Sodani+, ISCA’97; González+, ICPP’99; ...]

Aim to memoize every instruction

including multiple dynamic instances of each instruction

Require large memoization buffer

Often bigger than the size of L1 data cache

21

22 of 86

Key Improvements over Literature

Rich literature on skipping redundant computations

by memoizing previously-computed results

[Michie, Nature’68; Harbison+, ASPLOS’82; Richardson, SCA’93; Sodani+, ISCA’97; González+, ICPP’99; ...]

Focus only on loads that are likely stable

Lower storage overhead

with high load elimination coverage

Lower design complexity

Fewer port requirements, lower power

22

23 of 86

Key Improvements over Literature

Focus only on loads that are likely stable

Eliminate loads early in the pipeline

Elimination at rename stage

by explicitly monitoring changes to the source registers

and load address of a likely-stable load

Rich literature on skipping redundant computations

by memoizing previously-computed results

[Michie, Nature’68; Harbison+, ASPLOS’82; Richardson, SCA’93; Sodani+, ISCA’97; González+, ICPP’99; ...]

23

24 of 86

Key Improvements over Literature

Focus only on loads that are likely stable

Eliminate loads early in the pipeline

Ensure correctness in today’s processors

  • Maintain correctness in presence of out-of-order load issue
  • Maintain coherence in multi-threaded & multi-core execution

Rich literature on skipping redundant computations

by memoizing previously-computed results

[Michie, Nature’68; Harbison+, ASPLOS’82; Richardson, SCA’93; Sodani+, ISCA’97; González+, ICPP’99; ...]

24

25 of 86

�Design Overview

25

26 of 86

Constable: Key Steps

Identify

likely-stable loads

Eliminate

by tracking modifications

26

27 of 86

Identify a Likely-Stable Load

  • Using a stability confidence counter per load instruction

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

add rax, 0x10

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

ret

add rax, 0x10

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

jle 0x40230e

Stability Confidence

+1

/2

5

6

3

Same data & address

as last dynamic instance

Different data or

different address

27

28 of 86

Eliminate a Likely-Stable Load

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

add rax, 0x10

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

add rax, 0x10

add rax, 0x10

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

30

30

30

Stability confidence crosses threshold

jle 0x40230e

rbp

0x4200e0

PCx

PCx

Elimination Table

PCx

Register Monitor

Address Monitor

  • No reservation station
  • No address generation unit
  • No load port
  • Still takes ROB and load buffer

0x2ae

last value

eliminate flag

to handle correct elimination of in-flight loads

Insert

Insert

Insert

Lookup

28

29 of 86

Stop Elimination of a Likely-Stable Load

rbp

0x4200e0

PCx

PCx

Elimination Table

PCx

Register Monitor

Address Monitor

0x2ae

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

pop rbx

mov r8, [rbp+0x8]

sub rax, r8

cmp rsi, rax

jle 0x40230e

add rax, 0x10

30

30

jle 0x40230e

add rbp, 0xd8

.

.

.

add rax, 0x10

Elimination flag not set.

Gets executed

15

29

30 of 86

More in the Paper

  • Ensuring safe and correct elimination in presence of
    • Out-of-order load issue
    • Multi-threaded & multi-core execution
    • Wrong-path execution

  • Integration of Constable into the processor pipeline

  • Microarchitecture for breaking data dependence on the eliminated loads

  • Microarchitecture of Constable’s own structures
    • Read and write port requirements

30

31 of 86

More in the Paper

  • Ensuring safe and correct elimination in presence of
    • Out-of-order load issue
    • Multi-core execution
    • Wrong-path execution

  • Integration of Constable into the processor pipeline

  • Microarchitecture for breaking data dependence on the eliminated loads

  • Microarchitecture of Constable’s own structures
    • Read and write port requirements

31

32 of 86

Evaluation

32

33 of 86

Methodology

  • Industry-grade x86-64 simulator modeling aggressive OoO processor
    • 8-wide fetch, 6-wide issue to 3 load ports, 512-entry ROB
    • With memory renaming, zero/constant/move elimination, branch folding
    • Five prefetchers throughout cache hierarchy

  • 90 workloads of wide variety
    • All from SPEC CPU 2017
    • Client (SYSMark, DaCapo, ...)
    • Enterprise (SPECjbb, SPECjEnterprise, ...)
    • Server (BigBench, Hadoop, ...)

  • EVES, the state-of-the-art load value predictor [Seznec, CVP’18]
  • Early Load Address Resolution [Bekerman+, ISCA’00]
  • Register File Prefetching [Shukla+, ISCA’22]

Mechanisms compared against

  • No simultaneous multi-threading (SMT)
  • 2-way SMT

Configurations

33

34 of 86

Performance Improvement in noSMT

4.7%

5.1%

3.4%

EVES (the state-of-the-art load value predictor)

Constable

EVES + Constable

Constable alone provides similar performance as EVES

with only ½ of EVES’ storage overhead

Constable on top of EVES outperforms EVES alone

34

35 of 86

Performance Improvement in 2-way SMT

3.6%

8.8%

11.3%

EVES (the state-of-the-art load value predictor)

Constable

EVES + Constable

35

36 of 86

Performance Improvement in 2-way SMT

3.6%

8.8%

11.3%

EVES (the state-of-the-art load value predictor)

Constable

EVES + Constable

Constable provides higher performance benefits

in a 2-way SMT processor

36

37 of 86

Improvement in Resource Efficiency

Reduction in Reservation Station Allocation

Reduction in

L1 Data Cache Accesses

8.8% average

26% average

37

38 of 86

Improvement in Resource Efficiency

Reduction in Reservation Station Allocation

Reduction in

L1 Data Cache Accesses

8.8% average

26% average

Constable significantly improves resource efficiency

by eliminating load instruction execution

38

39 of 86

Reduction in Dynamic Power

OoO Unit Power

Memory Execution Unit Power

5.1% average reduction

9.1% average reduction

39

40 of 86

Reduction in Dynamic Power

OoO Unit Power

Memory Execution Unit Power

5.1% reduction

9.1% reduction

By eliminating load instruction execution,

Constable reduces dynamic power consumption

40

41 of 86

Area and Power Overhead �of Constable’s Own Structures

12.4 KB

Storage overhead per core

0.232 mm2

0.0061% area of Intel Alderlake-S processor

Low Energy

Up to 10.8 pJ/read and 16.7 pJ/write

41

42 of 86

More in the Paper

  • Load elimination coverage of Constable
    • 23.5% of all dynamic loads are eliminated

  • Per-workload performance analysis
    • Up to 31.2% over baseline
    • 60/90 workloads outperforms EVES by more than 5%

  • Performance contribution per load category
    • Stack loads contribute the highest

  • Performance improvement over prior works
    • 4.7% over Early load address resolution
    • 3.6% over Register file prefetching

  • Performance sensitivity:
    • Higher performance in every configuration up to 2X load execution width
    • Higher performance in every configuration up to 2X pipeline depth

42

43 of 86

More in the Paper

  • Load elimination coverage of Constable
    • 23.5% of all dynamic loads are eliminated

  • Per-workload performance analysis
    • Up to 31.2% over baseline
    • 60/90 workloads outperforms EVES by more than 5%

  • Performance contribution per load category
    • Stack loads contributes the highest

  • Performance improvement over prior works
    • 4.7% over Early load address resolution
    • 3.6% over Register file prefetching

  • Performance sensitivity:
    • Higher performance in every configuration up to 2X load execution width
    • Higher performance in every configuration up to 2X pipeline depth

43

44 of 86

To Summarize...

44

45 of 86

Our Key Findings

A large fraction (34%) of dynamic loads fetch

the same data from the same address

throughout the entire workload

These global-stable loads cause significant ILP loss due to resource dependence

Eliminating global-stable load execution provides

more than 2x the performance benefit

of just breaking their load data dependency

45

46 of 86

Our Proposal

Identifies and eliminates loads

that repeatedly fetch same data from same address

Constable

High performance benefit

over a strong baseline system

without (5.1%) and with SMT (8.8%)

Improves resource efficiency

L1-D access reduction by 26%

RS allocation reduction by 8.8%

Reduces dynamic power

L1-D power by 9.1%

RS power by 5.1%

Low storage overhead

Only 12.4 KB/core,

0.232 mm2 in 14-nm technology

46

47 of 86

There’s Still Headroom...

43% of global-stable loads

do not get eliminated

We need to understand more

software primitives that generate global-stable loads

Constable successfully eliminates

57% of all global-stable loads at runtime

47

48 of 86

Open-Source Tool

A tool to analyze load instructions in any off-the-shelf x86(-64) program

48

49 of 86

Open-Source Tool

A tool to analyze load instructions in any off-the-shelf x86(-64) program

Study global-stable loads

Study the effects of increasing architectural registers

using APX extension to x64 ISA

49

50 of 86

Improving Performance and Power Efficiency

by Safely Eliminating Load Instruction Execution

arXiv

Load Inspector

51 of 86

BACKUP

52 of 86

Index

52

53 of 86

Why Do Global-Stable Loads Exist?

Random* Random::s_rng = 0;

Random* Random::get_Rng(void)

{

if (s_rng == 0)

s_rng = new Random();

return s_rng;

}

Example code from 541.leela_r

Global scope variable

Function to access the variable

53

54 of 86

Why Do Global-Stable Loads Exist?

Random* Random::s_rng = 0;

Random* Random::get_Rng(void)

{

if (s_rng == 0)

s_rng = new Random();

return s_rng;

}

624: mov rax, [rip+0x1f4ac5]

62b: test rax,rax

62e: je 0x638

630: ret

631: nop

638: sub rsp,0x8

63c: mov edi,0xc

641: call 0x460

Example code from 541.leela_r

Disassembly (compiled by GCC* with –O3)

*GNU GCC 13.2

54

55 of 86

Why Do Global-Stable Loads Exist?

Random* Random::s_rng = 0;

Random* Random::get_Rng(void)

{

if (s_rng == 0)

s_rng = new Random();

return s_rng;

}

624: mov rax, [rip+0x1f4ac5]

62b: test rax,rax

62e: je 0x638

630: ret

631: nop

638: sub rsp,0x8

63c: mov edi,0xc

641: call 0x460

Example code from 541.leela_r

Gets initialized only once

and never changes

Global-stable load

Disassembly (compiled by GCC* with –O3)

*GNU GCC 13.2

55

56 of 86

Why Do Global-Stable Loads Exist?

56

57 of 86

Why Do Global-Stable Loads Exist?

Global-stable loads exist for many reasons:

  • Accessing global variables

  • Accessing local variables of an inline function

  • Limited architectural registers
  • ...

57

58 of 86

Effects of Increasing Architectural Registers

Fraction of global-stable loads

are nearly same without or with APX

Compiled with Clang 18.1.3 with and without -mapxf

58

59 of 86

Effects of Increasing Architectural Registers

Fraction of global-stable loads (i.e., Constable’s opportunities)

are much higher than reduction in loads by APX

Compiled with Clang 18.1.3 with and without -mapxf

59

60 of 86

Effects of Increasing Architectural Registers

The profile of global-stable loads stays largely unchanged

after doubling registers using APX

60

61 of 86

Characterization of Global-Stable Loads (I)

61

62 of 86

Characterization of Global-Stable Loads (II)

62

63 of 86

Characterization of Global-Stable Loads (III)

63

64 of 86

Resource Dependence by Global-Stable Loads

Global-stable loads cause significant resource dependence

64

65 of 86

Performance Headroom Analysis

65

66 of 86

Constable Overview

66

67 of 86

Architecting SLD

67

68 of 86

Effect of Wrong-Path Update

68

69 of 86

An Example of Constable’s Operation

69

70 of 86

Ensuring Coherence

  • Constable relies on snoop requests to observe modifications to a memory address by other cores

70

71 of 86

Ensuring Coherence

  • Constable relies on snoop requests to observe modifications to a memory address by other cores

  • Evicting a cacheline from core-private cache resets the core-valid bit in directory
    • What if a clean eviction?
      • Unnecessary elimination opportunity loss

  • Constable pins the CV-bit of an eliminated load’s cacheline
    • Even if the cacheline gets evicted from core-private cache, snoop request gets delivered

71

72 of 86

Constable’s Storage Overhead

72

73 of 86

Area and Power Overhead of Constable’s Structures

73

74 of 86

Simulated System Parameters

74

75 of 86

Evaluated Workloads

75

76 of 86

Workload-Wise Performance

76

77 of 86

Load Category-Wise Performance

77

78 of 86

Performance Comparison with Prior Works

78

79 of 86

Elimination Coverage

79

80 of 86

Coverage of Global-Stable Loads

80

81 of 86

Reduction in Dynamic Power

81

82 of 86

Performance Sensitivity to Load Execution Width Scaling

82

83 of 86

Performance Sensitivity to Pipeline Depth Scaling

83

84 of 86

Eliminated Loads that Violates Memory Ordering

Only 0.09% of all eliminated loads

violate memory ordering

84

85 of 86

Eliminated Loads that Violates Memory Ordering

Eliminated loads that violate memory ordering

increase number of allocated instructions only by 0.3%

85

86 of 86

Executive Summary

  • A significant fraction of loads always fetch the same data
  • from same address throughout the entire workload

  • These loads cause significant resource dependence
  • even when using value prediction and memory renaming

  • Mitigating their resource dependence has significant performance headroom

Key Findings

  • Constable
  • Key Idea: To mitigate both load data and load resource dependence by safely eliminating the entire execution of a load

Key Result: Simultaneously improves performance and power efficiency of a strong baseline OoO processor

Key Mechanism

86