1 of 22

Emender: Optimizing Prefetch Priority and Throttling in VBerti+Pythia

DPC4 Championship

Jiajie Chen · Tingji Zhang (Presenter)

Xiaoyi Liu · Xuefeng Zhang · Peng Qu · Youhui Zhang

Tsinghua University

1

2 of 22

Presentation Overview

1

Background & Motivation

2

Competition Approach

3

Problem Analysis

4

Four Key Optimizations

5

Evaluation Results

6

Conclusion

2

3 of 22

Background - Data Prefetching

The Memory Wall Problem

Growing performance gap between processor and memory speeds
Data prefetching has become increasingly critical

What is Data Prefetching?

Hardware/software component that predicts future memory accesses
Proactively loads data from main memory into cache
Reduces or hides data access latency

Ideal: Data already in cache when processor needs it

3

4 of 22

Competition Approach - Finding SOTA

Step 1: Evaluate Existing Prefetchers

Evaluated 13 Prefetchers

SMS (ISCA '06), ISB (MICRO-46), SPP (MICRO-49), Bingo (HPCA '19), Triage (TC '21)
BOP (DPC-2), Berti (DPC-3), IPCP (DPC-3)
Pythia (MICRO-54), PMP (MICRO-55), VBerti (MICRO-55), Bandit (MICRO-56), Gaze (HPCA '25)

Methodology

Found implementations or built from scratch
Ported to DPC4 ChampSim platform
Replaced baseline L1/L2/L3 prefetchers

Step 2: Comprehensive Evaluation

Evaluation

Tested on DPC4 trace suite
Measured performance across configurations
Searched for optimal combination

VBerti + Pythia = Best Performing Combination

4

5 of 22

Challenge: Evaluation Time

The Problem

Challenge

Full evaluation takes ~1 day per prefetcher combination
Too many combinations to test exhaustively

Our Solution: Scale Validation

Experiment

Tested traces at 1%, 10%, and 100% scale
Compared relative rankings

Finding

Relative performance consistent across scales

Methodology

Use 1% scale for rapid evaluation of many combinations
Validate promising candidates at full scale

Result

VBerti + Pythia = Best Performing Combination

Key Insight: Scale validation enabled us to evaluate 100x more combinations in the same timeframe

5

6 of 22

Problem Analysis

Findings

Observation: VBerti+Pythia generates massive prefetch request volume

Prefetch Requests Generated

Very high

Prefetch Queue Capacity

Limited

Key Insight

Must use limited prefetch queue more efficiently

Prioritization: Send more useful prefetches first
Filtering: Don't send useless prefetches at all

6

7 of 22

Two Optimization Directions

1

Prioritization

If only some prefetches can enter queue, prioritize high-confidence ones

Current VBerti Limitation

Only prioritizes prefetches from the same load instruction
Multiple load instructions → no cross-instruction prioritization

Opportunity

Global prioritization across all load instructions

2

Filter Useless Prefetches

Some prefetches target already-cached cachelines

Challenge

Waste of precious prefetch queue space
No benefit, only cost
Need cache membership query with less hardware overhead

Need

Approximate, efficient cache membership test

7

8 of 22

Optimization 1 - Pending Target Buffer

Design: Global Prefetch Prioritization

Load Instructions → Generate Prefetches

↓

Pending Target Buffer

(sorts by confidence)

↓

Issue high-confidence first

Key Features

Buffers prefetch requests from ALL load instructions
Sorts by confidence globally
Issues high-confidence prefetches first

Implementation Details

Each Entry Contains

Cache line address (58 bits)
Confidence level (5 bits)
Status, IP hash, Sequence ID
Failure count (1 bit)

Policies

Duplicate filtering within buffer
Replace lowest-confidence when full
Periodic eviction of old entries
Evict entries with excessive failures

Hardware Cost: 6,884 bits (~861 bytes)

8

9 of 22

Optimization 2 - Cuckoo Filter

Goal

Eliminate prefetches for data already in cache

Challenge: Direct Cache Lookup

Why not just check the cache?

Cache read port contention
Additional read ports increase hardware cost
Prefetcher should not interfere with cache operations

Solution: Cuckoo Filter

What is it?

Probabilistic data structure for set membership
Like Bloom Filter, but supports deletions
Better space efficiency

Our Application

Tracks which virtual addresses are in L1D cache
O(1) query time
Supports insertions and deletions
High accuracy, low false positive rate (zero false negatives)

Hardware Cost: 53,248 bits (6.5 KB)

1024 sets x 4 ways

Each entry: 1-bit valid + 12-bit fingerprint

Key Benefit: Zero false negatives - when filter says "not in cache", confidently prefetch

9

Our second optimization is the Cuckoo Filter. The goal here is to stop prefetching data that is already in the L1 cache. So why not just check the cache tags? Well, checking the cache tags requires read ports, and we don't want to interfere with the CPU's actual work.

So, we use a Cuckoo Filter, which is similar to a Bloom Filter, but is better because it has better space efficiency and supports deletions. It tells us if an address is in the cache with zero false negatives.

The hardware cost is about 6.5 KB, where each entry contains a valid bit and a 12-bit fingerprint.

If the filter says 'it's not there,' we can safely prefetch. When it says data is in cache, there's a small chance it's wrong, meaning we might occasionally miss a useful prefetch. However, this false positive rate is low enough that the trade-off is worthwhile.

10 of 22

VIPT Cache Aliasing Problem for Cuckoo Filter

The Challenge

Our Design Choice

Use virtual addresses for prefetching (better performance)

Hardware Reality

L1D cache uses VIPT (Virtually Indexed, Physically Tagged)
Multiple VAs can map to same PA (aliasing)

Problem

VA₁ → PA → Cache Line

VA₂ → PA → Same Cache Line (aliasing!)

Our Solution

Track Last-Accessed VA

Cache records last-accessed VA for each cacheline
Store extra 22-bit hash alongside cacheline
Used for cuckoo filter updates

Implementation

Extra storage in Shadow Cache
22-bit index+fingerprint per cacheline
Handles most common cases efficiently

Note on False Negatives

Cuckoo Filter itself has zero false negatives. However, due to VIPT aliasing, we only track the last-accessed VA per cacheline. This may cause unnecessary prefetches when different VAs alias to the same cacheline, but the impact is minimal since aliasing is rare.

10

11 of 22

Opt. 3 - Dynamic Confidence Threshold

Observation

Some Load PCs Have Semi-unpredictable Patterns

Accesses: A, A+2, B, B+2, C, C+2, ...
Where A, B, C are random addresses

Problem

Delta=2 appears frequently
Prefetches A+4, B+4, C+4 (often useless!)
Short access intervals → prefetches arrive too late even for A+2, B+2, ...

Solution: Per-PC Throttling

Track Per-Load Instruction Effectiveness

Maintain saturating counter per load PC
Increment on cache miss
Decrement probabilistically on hit

Dynamic Adjustment

High miss rate → Increase threshold → Less prefetching
Low miss rate → Decrease threshold → Normal prefetching

Hardware Cost: 10-bit miss counter per entry

11

Our third optimization deals with unpredictable patterns.

We observed that some load PCs have semi-unpredictable patterns. For example, accesses might follow a pattern like A, A+2, B, B+2, C, C+2, and so on, where A, B, and C are relatively random addresses.

The problem is that the delta 2 appears frequently. The prefetcher learns this pattern and uses it, prefetching A+4, B+4, C+4, which are often useless.

Our solution is per-PC throttling. We track the miss rate of each instruction. If an instruction causes too many misses, we automatically raise the bar, we throttle it.

If it starts doing well, we let it prefetch more. It adapts effectively to the pattern.

The hardware cost is minimal - just a 10-bit miss counter per Berti Table entry. This provides self-adaptive throttling that further reduces useless prefetches.

12 of 22

Optimization 4 - L3 Fairness-based Throttling

Problem: Multi-core Resource Contention

VBerti+Pythia is Very Aggressive

Single core can monopolize memory bandwidth
Other cores starve for resources
Poor overall multi-core performance
Competition metric: Harmonic Speedup penalizes unfairness

Core 0: ████████████████████ (hogs)

Core 1: ██ (starved)

Core 2: █ (starved)

Core 3: ███ (starved)

Solution: Cross-core Coordination

Mechanism

L2 reports useless prefetch statistics to L3
L3 aggregates statistics from all cores
Identifies worst offender
Throttles that core's prefetching

Metadata Exchange

Inter-prefetcher communication channel
Periodic statistics collection
Distributed throttling decisions

Hardware Cost: 64B counters

12

13 of 22

Emender Architecture Overview

L1D Cache (Emender-L1D)

Base: VBerti prefetcher

Enhancements:

Pending Target Buffer
Cuckoo Filter
Dynamic Confidence Threshold

22.8 KB (within 32 KB limit)

L2 Cache (Emender-L2)

Base: Pythia prefetcher

Enhancements:

Extended throttling response
Metadata exchange with L3

25.5 KB (within 128 KB limit)

L3 Cache (Emender-L3)

Design: No prefetching at L3

Rationale:

L3 patterns difficult to exploit
Aggressive L3 prefetching hurts multi-core
DPC4 scoring: lowers score

Role: Fairness-based throttling coordinator

64 B (within 256 KB limit)

Key Design: Hierarchical optimization with coordinated metadata exchange

13

14 of 22

Evaluation Setup

Platform

DPC4 ChampSim Framework

Configurations

1C.fullBW (Single-core Full)

1 core, 4800 MT/s
Full memory bandwidth

1C.limitBW (Single-core Limited)

1 core, 800 MT/s
Memory-constrained

4C (Multi-core)

4 cores, 4x LLC
Tests fairness

Traces

DPC4-Traces: Publicly available
Subset used (time constraints)
Trends consistent with full set

Scoring Methodology

Score = (1C.fullBW × 1C.limitBW × 4C)^(1/3)

Single-core: Geometric mean of IPC

Multi-core: Geometric mean of harmonic speedup

Baselines

Competition: Berti (L1D) + Pythia (L2)

14

15 of 22

Overall Results

Speedup vs. Baseline and other prefetchers

15

Chart Interpretation

Key Observations

Emender outperforms all compared prefetchers
VBerti slightly better than other prefetchers
Consistent gains across all three configurations

Overall Score: +4.8% improvement

16 of 22

Ablation Study

Individual Feature Impact

Feature

1C.fullBW

1C.limitBW

4C

Overall

Cuckoo Filter

1.3%

0.2%

1.0%

0.8%

L3 Fairness Throttling

0.0%

2.6%

0.9%

Dynamic Confidence Threshold

0.1%

0.2%

0.1%

Pending Target Buffer

0.4%

0.0%

0.4%

0.2%

Combined (Emender)

6.6%

2.0%

6.1%

4.8%

Cuckoo Filter

Largest benefit for single-core
Eliminates redundant prefetches
More effective with plentiful bandwidth

L3 Fairness Throttling

Critical for multi-core performance
No single-core impact (as expected)
2.6% improvement in 4C

Other Features

DCT: Modest but consistent
PTB: Foundation for others
All contribute positively

16

17 of 22

S-Curve Analysis (1C.fullBW)

Distribution of Speedup

Best Case

>1.4x speedup
Predictable workloads benefit greatly

Median Case

~1.1x speedup
Majority of workloads benefit

Worst Case

<10% slowdown
Very few workloads see regression
Robust design avoids significant degradation

Visualize S-Curve

Interpretation: Majority of workloads see improvement, few significant regressions

Robust across diverse applications

17

18 of 22

Conclusion

Built upon VBerti+Pythia with Four Optimizations

1

Pending Target Buffer

Global prioritization
~861 bytes

2

Cuckoo Filter

Eliminates redundant prefetches
6.5 KB, most impactful for single-core

3

Dynamic Confidence Threshold

Per-PC throttling
Self-adaptive

4

L3 Fairness-based Throttling

Multi-core coordination
Critical for multi-core

Results: Overall Score Improvement +4.8%

Single-core Full BW

+6.6%

Single-core Limited BW

+2.0%

Multi-core

+6.1%

18

19 of 22

Discussion and Future Work

Hardware Implementability

Berti Table Design

Current: Fully-associative design
Future: Set-associative (e.g., 16-set, 4-way with NRU)
Reference: Berto approach for practical implementation

Pending Target Buffer Sorting

Challenge: Sorting overhead in hardware
Solution: Bucket sort-like mechanism
Partition entries by priority levels

Prefetch Filtering Techniques

Multiple Filtering Layers in Emender

Cuckoo Filter: Eliminates redundant L1D prefetches
Pending Target Buffer: Deduplication
Prefetch Queue: Additional filtering

Existing Techniques

T-SKID: Shadow Cache + Prefetch Queue filtering
Sangam: 40-entry Recent Access Filter
PPF: Perceptron-based filtering

Future Research: Comprehensive comparison of trade-offs in hardware overhead, accuracy, and performance impact.

19

20 of 22

Thank You!

Acknowledgments

DPC4 organizers and ChampSim developers
Authors of VBerti and Pythia prefetchers

Questions?

20

21 of 22

Appendix: Evaluation on DPC4-All-Traces

Performance Results

Methodology

Training vs Evaluation Traces

Training: DPC4-Traces (public)
Evaluation: DPC4-All-Traces (official)
Earlier: 1% instruction subset
Appendix: Complete traces tested

Key Differences in Trace Composition

Different benchmark distribution
Google-Traces-v2 proportion: 32% → 59%
More SPEC17 benchmarks with non-memory bottlenecks

For the camera-ready submission, we evaluated on the official DPC4-All-Traces, which differ from the training traces used during development.��On the left is the performance comparison on evaluation traces. As you can see, the overall trends remain consistent with our training results. The peak improvement reaches up to 1.71x IPC speedup on specific traces, demonstrating the potential of our optimizations.��The methodology difference is important: we trained on DPC4-Traces using 1% instruction subsets due to time constraints, but for the final evaluation we tested complete traces. The evaluation traces have a different composition - notably Google-Traces-v2 increased from 32% to 59% of the total.��This composition shift affected our overall score, as our optimizations provide minimal benefits for Google-Traces-v2. Additionally, more SPEC17 benchmarks with non-memory bottlenecks were included.

22 of 22

Appendix: Analysis and Insights

Single-Core Performance

Trends Consistent with Training

Most categories match training trace results
SPEC17 shows slight regression
Due to non-memory-bound benchmarks (e.g., leela)

Overall Score Reduction Factor

Google-Traces-v2 proportion increased
From 32% to 59% in evaluation traces
Our optimizations provide minimal benefits for these traces

Multi-Core Performance

Divergent Characteristics

Training: Random sampling across all benchmarks
Evaluation: Category-independent sampling
Results in more homogeneous behavior across cores

Throttling Mechanism Impact

Training: High diversity amplifies throttling effectiveness
Evaluation: Homogeneous behavior reduces throttling benefits
Current throttling proves suboptimal in evaluation scenario

Insight: Throttling effectiveness depends on workload diversity across cores.

This slide provides deeper analysis of why our results differ between training and evaluation traces.��For single-core performance, most categories match well, but SPEC17 shows slight regression due to benchmarks like leela where memory access isn't the primary bottleneck. The key factor in overall score reduction is the increased proportion of Google-Traces-v2, rising from 32% to 59%. Our optimizations simply don't help these traces as much.��The multi-core performance shows more interesting divergence. In training, random sampling across all benchmarks created high diversity across cores, which amplified our throttling effectiveness. However, evaluation traces use category-independent sampling, resulting in more homogeneous behavior across cores. Under these conditions, our current throttling mechanism proves suboptimal.��The key insight is that throttling effectiveness heavily depends on workload diversity. Future work could explore more sophisticated throttling strategies that adapt to the actual diversity observed at runtime.