Rahul Bera* Adithya Ranganathan* Joydeep Rakshit Sujit Mahto
Anant V. Nori Jayesh Gaur Ataberk Olgun Konstantinos Kanellopoulos
Mohammad Sadrosadati Sreenivas Subramoney Onur Mutlu
Improving Performance and Power Efficiency
by Safely Eliminating Load Instruction Execution
Key Problem
Stall other loads due to contention
in load execution resources
Stall load-dependent instructions
due to long load execution latency
Load instructions are a key limiter of
instruction-level parallelism (ILP)
Data Dependence
Resource Dependence
L
I
L1
L2
R
(e.g., address generation unit,
load port, ...)
2
Prior Works on Tolerating Load Latency
By speculatively executing
load-dependent instructions
using a predicted load value
Mitigate
Data Dependence
L
I
Predicted load still gets executed
to verify speculation,
consuming execution resources
Do Not Mitigate
Resource Dependence
L1
L2
R
3
Motivation
Safely breaking load data dependency
without executing a load instruction
may provide additional performance benefits
By finding load instructions that repeatedly produce
identical results across dynamic instances
How do we start?
4
Key Finding I: Global-Stable Loads
Global-Stable Load
5
Key Finding I: Global-Stable Loads
Fraction of dynamic loads
34.2%
Nearly 1 in every 3 dynamic loads
is a global-stable load
Global-Stable Loads
Across a wide range of 90 workloads
6
In the Paper: Analysis of Global-Stable Loads
7
In the Paper: Analysis of Global-Stable Loads
8
9
But do they limit ILP even when using
load value prediction and memory renaming?
A significant fraction of loads are global-stable
Key Finding II: Global-Stable Loads Cause �Resource Dependence
In an aggressive OoO processor with 6-wide issue, 3 load ports,
a load value predictor (EVES [Seznec, CVP’18]), and memory renaming enabled
All execution cycles where
at least one load port is utilized
10
Key Finding II: Global-Stable Loads Cause �Resource Dependence
In an aggressive OoO processor with 6-wide issue, 3 load ports,
a load value predictor (EVES [Seznec, CVP’18]), and memory renaming enabled
All execution cycles where
at least one load port is utilized
23%
A global-stable load utilizes a load port
blocking a non-global-stable load
11
Key Finding II: Global-Stable Loads Cause �Resource Dependence
In an aggressive OoO processor with 6-wide issue, 3 load ports,
a load value predictor (EVES [Seznec, CVP’18]), and memory renaming enabled
All execution cycles where
at least one load port is utilized
23%
A global-stable load utilizes a load port
blocking a non-global-stable load
Even when using load value prediction and memory renaming,
global-stable loads limit ILP due to resource dependence
What’s the performance headroom of mitigating the resource dependence?
12
Key Finding III: High Performance Headroom
Ideally eliminating all global-stable loads
(mitigating load data + resource dependence)
Ideally value-predicting all global-stable loads
(mitigating only load data dependence)
2x load execution width
4.3%
9.1%
8.8%
Mitigating both data and resource dependence has
more than 2x the performance benefit
of mitigating only data dependence of global-stable loads
Ideal elimination of global-stable loads exceeds performance
of a processor with 2x wider load execution
13
Load Execution Resources Lag Behind
3.4x
3x
1.5x
14
Load Execution Resources Lag Behind
3.4x
3x
1.5x
Mitigating load resource dependence has
high performance potential
in recent and future generation processors
15
Our Goal
To improve instruction-level parallelism by mitigating
both load data dependence and resource dependence
16
17
Mitigates both load data dependence
and load resource dependence
By safely eliminating
the entire execution of a load instruction
A purely-microarchitectural technique
Constable: Key Insight
mov r8, [rbp+0x8]
sub rax, r8
cmp rsi, rax
jle 0x40230e
add rax, 0x10
mov r8, [rbp+0x8]
sub rax, r8
cmp rsi, rax
jle 0x40230e
Dynamic instruction stream
add rax, 0x10
add rax, 0x10
Two successive dynamic instances
of the same static load instruction
LD1
LD2
18
Constable: Key Insight
If the source register rbp
has not been modified
mov r8, [rbp+0x8]
sub rax, r8
cmp rsi, rax
jle 0x40230e
add rax, 0x10
mov r8, [rbp+0x8]
sub rax, r8
cmp rsi, rax
jle 0x40230e
Dynamic instruction stream
add rax, 0x10
add rax, 0x10
LD2 would have the same address as LD1
Address generation of LD2
can be eliminated
If no store or snoop request
to address [rbp+0x8]
LD2 would fetch the same data as LD1
Data fetching of LD2
can be eliminated
LD1
LD2
19
Constable: Key Steps
Dynamically identify load instructions
that have historically fetched
the same data from the same load address
(i.e., likely-stable)
Eliminate execution of likely-stable loads
by tracking modifications to
their source registers and their load addresses
20
Prior Related Literature
Rich literature on skipping redundant computations
by memoizing previously-computed results
[Michie, Nature’68; Harbison+, ASPLOS’82; Richardson, SCA’93; Sodani+, ISCA’97; González+, ICPP’99; ...]
Aim to memoize every instruction
including multiple dynamic instances of each instruction
Require large memoization buffer
Often bigger than the size of L1 data cache
21
Key Improvements over Literature
Rich literature on skipping redundant computations
by memoizing previously-computed results
[Michie, Nature’68; Harbison+, ASPLOS’82; Richardson, SCA’93; Sodani+, ISCA’97; González+, ICPP’99; ...]
Focus only on loads that are likely stable
Lower storage overhead
with high load elimination coverage
Lower design complexity
Fewer port requirements, lower power
22
Key Improvements over Literature
Focus only on loads that are likely stable
Eliminate loads early in the pipeline
Elimination at rename stage
by explicitly monitoring changes to the source registers
and load address of a likely-stable load
Rich literature on skipping redundant computations
by memoizing previously-computed results
[Michie, Nature’68; Harbison+, ASPLOS’82; Richardson, SCA’93; Sodani+, ISCA’97; González+, ICPP’99; ...]
23
Key Improvements over Literature
Focus only on loads that are likely stable
Eliminate loads early in the pipeline
Ensure correctness in today’s processors
Rich literature on skipping redundant computations
by memoizing previously-computed results
[Michie, Nature’68; Harbison+, ASPLOS’82; Richardson, SCA’93; Sodani+, ISCA’97; González+, ICPP’99; ...]
24
�Design Overview
25
Constable: Key Steps
Identify
likely-stable loads
Eliminate
by tracking modifications
26
Identify a Likely-Stable Load
mov r8, [rbp+0x8]
sub rax, r8
cmp rsi, rax
add rax, 0x10
mov r8, [rbp+0x8]
sub rax, r8
cmp rsi, rax
ret
add rax, 0x10
mov r8, [rbp+0x8]
sub rax, r8
cmp rsi, rax
jle 0x40230e
Stability Confidence
+1
/2
5
6
3
Same data & address
as last dynamic instance
Different data or
different address
27
Eliminate a Likely-Stable Load
mov r8, [rbp+0x8]
sub rax, r8
cmp rsi, rax
add rax, 0x10
mov r8, [rbp+0x8]
sub rax, r8
cmp rsi, rax
add rax, 0x10
add rax, 0x10
mov r8, [rbp+0x8]
sub rax, r8
cmp rsi, rax
30
30
30
Stability confidence crosses threshold
jle 0x40230e
rbp
0x4200e0
PCx
PCx
Elimination Table
PCx
Register Monitor
Address Monitor
0x2ae
last value
eliminate flag
to handle correct elimination of in-flight loads
Insert
Insert
Insert
Lookup
28
Stop Elimination of a Likely-Stable Load
rbp
0x4200e0
PCx
PCx
Elimination Table
PCx
Register Monitor
Address Monitor
0x2ae
mov r8, [rbp+0x8]
sub rax, r8
cmp rsi, rax
pop rbx
mov r8, [rbp+0x8]
sub rax, r8
cmp rsi, rax
jle 0x40230e
add rax, 0x10
30
30
jle 0x40230e
add rbp, 0xd8
.
.
.
add rax, 0x10
Elimination flag not set.
Gets executed
15
29
More in the Paper
30
More in the Paper
31
Evaluation
32
Methodology
Mechanisms compared against
Configurations
33
Performance Improvement in noSMT
4.7%
5.1%
3.4%
EVES (the state-of-the-art load value predictor)
Constable
EVES + Constable
Constable alone provides similar performance as EVES
with only ½ of EVES’ storage overhead
Constable on top of EVES outperforms EVES alone
34
Performance Improvement in 2-way SMT
3.6%
8.8%
11.3%
EVES (the state-of-the-art load value predictor)
Constable
EVES + Constable
35
Performance Improvement in 2-way SMT
3.6%
8.8%
11.3%
EVES (the state-of-the-art load value predictor)
Constable
EVES + Constable
Constable provides higher performance benefits
in a 2-way SMT processor
36
Improvement in Resource Efficiency
Reduction in Reservation Station Allocation
Reduction in
L1 Data Cache Accesses
8.8% average
26% average
37
Improvement in Resource Efficiency
Reduction in Reservation Station Allocation
Reduction in
L1 Data Cache Accesses
8.8% average
26% average
Constable significantly improves resource efficiency
by eliminating load instruction execution
38
Reduction in Dynamic Power
OoO Unit Power
Memory Execution Unit Power
5.1% average reduction
9.1% average reduction
39
Reduction in Dynamic Power
OoO Unit Power
Memory Execution Unit Power
5.1% reduction
9.1% reduction
By eliminating load instruction execution,
Constable reduces dynamic power consumption
40
Area and Power Overhead �of Constable’s Own Structures
12.4 KB
Storage overhead per core
0.232 mm2
0.0061% area of Intel Alderlake-S processor
Low Energy
Up to 10.8 pJ/read and 16.7 pJ/write
41
More in the Paper
42
More in the Paper
43
To Summarize...
44
Our Key Findings
A large fraction (34%) of dynamic loads fetch
the same data from the same address
throughout the entire workload
These global-stable loads cause significant ILP loss due to resource dependence
Eliminating global-stable load execution provides
more than 2x the performance benefit
of just breaking their load data dependency
45
Our Proposal
Identifies and eliminates loads
that repeatedly fetch same data from same address
Constable
High performance benefit
over a strong baseline system
without (5.1%) and with SMT (8.8%)
Improves resource efficiency
L1-D access reduction by 26%
RS allocation reduction by 8.8%
Reduces dynamic power
L1-D power by 9.1%
RS power by 5.1%
Low storage overhead
Only 12.4 KB/core,
0.232 mm2 in 14-nm technology
46
There’s Still Headroom...
43% of global-stable loads
do not get eliminated
We need to understand more
software primitives that generate global-stable loads
Constable successfully eliminates
57% of all global-stable loads at runtime
47
Open-Source Tool
A tool to analyze load instructions in any off-the-shelf x86(-64) program
48
Open-Source Tool
A tool to analyze load instructions in any off-the-shelf x86(-64) program
Study global-stable loads
Study the effects of increasing architectural registers
using APX extension to x64 ISA
49
Improving Performance and Power Efficiency
by Safely Eliminating Load Instruction Execution
arXiv
Load Inspector
BACKUP
Index
Motivation & Design
Methodology & Evaluation
52
Why Do Global-Stable Loads Exist?
Random* Random::s_rng = 0;
Random* Random::get_Rng(void)
{
if (s_rng == 0)
s_rng = new Random();
return s_rng;
}
Example code from 541.leela_r
Global scope variable
Function to access the variable
53
Why Do Global-Stable Loads Exist?
Random* Random::s_rng = 0;
Random* Random::get_Rng(void)
{
if (s_rng == 0)
s_rng = new Random();
return s_rng;
}
624: mov rax, [rip+0x1f4ac5]
62b: test rax,rax
62e: je 0x638
630: ret
631: nop
638: sub rsp,0x8
63c: mov edi,0xc
641: call 0x460
Example code from 541.leela_r
Disassembly (compiled by GCC* with –O3)
*GNU GCC 13.2
54
Why Do Global-Stable Loads Exist?
Random* Random::s_rng = 0;
Random* Random::get_Rng(void)
{
if (s_rng == 0)
s_rng = new Random();
return s_rng;
}
624: mov rax, [rip+0x1f4ac5]
62b: test rax,rax
62e: je 0x638
630: ret
631: nop
638: sub rsp,0x8
63c: mov edi,0xc
641: call 0x460
Example code from 541.leela_r
Gets initialized only once
and never changes
Global-stable load
Disassembly (compiled by GCC* with –O3)
*GNU GCC 13.2
55
Why Do Global-Stable Loads Exist?
56
Why Do Global-Stable Loads Exist?
Global-stable loads exist for many reasons:
57
Effects of Increasing Architectural Registers
Fraction of global-stable loads
are nearly same without or with APX
Compiled with Clang 18.1.3 with and without -mapxf
58
Effects of Increasing Architectural Registers
Fraction of global-stable loads (i.e., Constable’s opportunities)
are much higher than reduction in loads by APX
Compiled with Clang 18.1.3 with and without -mapxf
59
Effects of Increasing Architectural Registers
The profile of global-stable loads stays largely unchanged
after doubling registers using APX
60
Characterization of Global-Stable Loads (I)
61
Characterization of Global-Stable Loads (II)
62
Characterization of Global-Stable Loads (III)
63
Resource Dependence by Global-Stable Loads
Global-stable loads cause significant resource dependence
64
Performance Headroom Analysis
65
Constable Overview
66
Architecting SLD
67
Effect of Wrong-Path Update
68
An Example of Constable’s Operation
69
Ensuring Coherence
70
Ensuring Coherence
71
Constable’s Storage Overhead
72
Area and Power Overhead of Constable’s Structures
73
Simulated System Parameters
74
Evaluated Workloads
75
Workload-Wise Performance
76
Load Category-Wise Performance
77
Performance Comparison with Prior Works
78
Elimination Coverage
79
Coverage of Global-Stable Loads
80
Reduction in Dynamic Power
81
Performance Sensitivity to Load Execution Width Scaling
82
Performance Sensitivity to Pipeline Depth Scaling
83
Eliminated Loads that Violates Memory Ordering
Only 0.09% of all eliminated loads
violate memory ordering
84
Eliminated Loads that Violates Memory Ordering
Eliminated loads that violate memory ordering
increase number of allocated instructions only by 0.3%
85
Executive Summary
Key Findings
Key Result: Simultaneously improves performance and power efficiency of a strong baseline OoO processor
Key Mechanism
86