1 of 22

Emender: Optimizing Prefetch Priority and Throttling in VBerti+Pythia

DPC4 Championship

Jiajie Chen · Tingji Zhang (Presenter)

Xiaoyi Liu · Xuefeng Zhang · Peng Qu · Youhui Zhang

Tsinghua University

1

2 of 22

Presentation Overview

1

Background & Motivation

2

Competition Approach

3

Problem Analysis

4

Four Key Optimizations

5

Evaluation Results

6

Conclusion

2

3 of 22

Background - Data Prefetching

The Memory Wall Problem

  • Growing performance gap between processor and memory speeds
  • Data prefetching has become increasingly critical

What is Data Prefetching?

  • Hardware/software component that predicts future memory accesses
  • Proactively loads data from main memory into cache
  • Reduces or hides data access latency

Ideal: Data already in cache when processor needs it

3

4 of 22

Competition Approach - Finding SOTA

Step 1: Evaluate Existing Prefetchers

Evaluated 13 Prefetchers

  • SMS (ISCA '06), ISB (MICRO-46), SPP (MICRO-49), Bingo (HPCA '19), Triage (TC '21)
  • BOP (DPC-2), Berti (DPC-3), IPCP (DPC-3)
  • Pythia (MICRO-54), PMP (MICRO-55), VBerti (MICRO-55), Bandit (MICRO-56), Gaze (HPCA '25)

Methodology

  • Found implementations or built from scratch
  • Ported to DPC4 ChampSim platform
  • Replaced baseline L1/L2/L3 prefetchers

Step 2: Comprehensive Evaluation

Evaluation

  • Tested on DPC4 trace suite
  • Measured performance across configurations
  • Searched for optimal combination

VBerti + Pythia = Best Performing Combination

4

5 of 22

Challenge: Evaluation Time

The Problem

Challenge

  • Full evaluation takes ~1 day per prefetcher combination
  • Too many combinations to test exhaustively

Our Solution: Scale Validation

Experiment

  • Tested traces at 1%, 10%, and 100% scale
  • Compared relative rankings

Finding

Relative performance consistent across scales

Methodology

  • Use 1% scale for rapid evaluation of many combinations
  • Validate promising candidates at full scale

Result

VBerti + Pythia = Best Performing Combination

Key Insight: Scale validation enabled us to evaluate 100x more combinations in the same timeframe

5

6 of 22

Problem Analysis

Findings

Observation: VBerti+Pythia generates massive prefetch request volume

Prefetch Requests Generated

Very high

Prefetch Queue Capacity

Limited

Key Insight

Must use limited prefetch queue more efficiently

  • Prioritization: Send more useful prefetches first
  • Filtering: Don't send useless prefetches at all

6

7 of 22

Two Optimization Directions

1

Prioritization

If only some prefetches can enter queue, prioritize high-confidence ones

Current VBerti Limitation

  • Only prioritizes prefetches from the same load instruction
  • Multiple load instructions → no cross-instruction prioritization

Opportunity

Global prioritization across all load instructions

2

Filter Useless Prefetches

Some prefetches target already-cached cachelines

Challenge

  • Waste of precious prefetch queue space
  • No benefit, only cost
  • Need cache membership query with less hardware overhead

Need

Approximate, efficient cache membership test

7

8 of 22

Optimization 1 - Pending Target Buffer

Design: Global Prefetch Prioritization

Load Instructions → Generate Prefetches

Pending Target Buffer

(sorts by confidence)

Issue high-confidence first

Key Features

  • Buffers prefetch requests from ALL load instructions
  • Sorts by confidence globally
  • Issues high-confidence prefetches first

Implementation Details

Each Entry Contains

  • Cache line address (58 bits)
  • Confidence level (5 bits)
  • Status, IP hash, Sequence ID
  • Failure count (1 bit)

Policies

  • Duplicate filtering within buffer
  • Replace lowest-confidence when full
  • Periodic eviction of old entries
  • Evict entries with excessive failures

Hardware Cost: 6,884 bits (~861 bytes)

8

9 of 22

Optimization 2 - Cuckoo Filter

Goal

Eliminate prefetches for data already in cache

Challenge: Direct Cache Lookup

Why not just check the cache?

  • Cache read port contention
  • Additional read ports increase hardware cost
  • Prefetcher should not interfere with cache operations

Solution: Cuckoo Filter

What is it?

  • Probabilistic data structure for set membership
  • Like Bloom Filter, but supports deletions
  • Better space efficiency

Our Application

  • Tracks which virtual addresses are in L1D cache
  • O(1) query time
  • Supports insertions and deletions
  • High accuracy, low false positive rate (zero false negatives)

Hardware Cost: 53,248 bits (6.5 KB)

1024 sets x 4 ways

Each entry: 1-bit valid + 12-bit fingerprint

Key Benefit: Zero false negatives - when filter says "not in cache", confidently prefetch

9

10 of 22

VIPT Cache Aliasing Problem for Cuckoo Filter

The Challenge

Our Design Choice

  • Use virtual addresses for prefetching (better performance)

Hardware Reality

  • L1D cache uses VIPT (Virtually Indexed, Physically Tagged)
  • Multiple VAs can map to same PA (aliasing)

Problem

VA₁ → PA → Cache Line

VA₂ → PA → Same Cache Line (aliasing!)

Our Solution

Track Last-Accessed VA

  • Cache records last-accessed VA for each cacheline
  • Store extra 22-bit hash alongside cacheline
  • Used for cuckoo filter updates

Implementation

  • Extra storage in Shadow Cache
  • 22-bit index+fingerprint per cacheline
  • Handles most common cases efficiently

Note on False Negatives

Cuckoo Filter itself has zero false negatives. However, due to VIPT aliasing, we only track the last-accessed VA per cacheline. This may cause unnecessary prefetches when different VAs alias to the same cacheline, but the impact is minimal since aliasing is rare.

10

11 of 22

Opt. 3 - Dynamic Confidence Threshold

Observation

Some Load PCs Have Semi-unpredictable Patterns

  • Accesses: A, A+2, B, B+2, C, C+2, ...
  • Where A, B, C are random addresses

Problem

  • Delta=2 appears frequently
  • Prefetches A+4, B+4, C+4 (often useless!)
  • Short access intervals → prefetches arrive too late even for A+2, B+2, ...

Solution: Per-PC Throttling

Track Per-Load Instruction Effectiveness

  • Maintain saturating counter per load PC
  • Increment on cache miss
  • Decrement probabilistically on hit

Dynamic Adjustment

  • High miss rate → Increase threshold → Less prefetching
  • Low miss rate → Decrease threshold → Normal prefetching

Hardware Cost: 10-bit miss counter per entry

11

12 of 22

Optimization 4 - L3 Fairness-based Throttling

Problem: Multi-core Resource Contention

VBerti+Pythia is Very Aggressive

  • Single core can monopolize memory bandwidth
  • Other cores starve for resources
  • Poor overall multi-core performance
  • Competition metric: Harmonic Speedup penalizes unfairness

Core 0: ████████████████████ (hogs)

Core 1: ██ (starved)

Core 2: █ (starved)

Core 3: ███ (starved)

Solution: Cross-core Coordination

Mechanism

  • L2 reports useless prefetch statistics to L3
  • L3 aggregates statistics from all cores
  • Identifies worst offender
  • Throttles that core's prefetching

Metadata Exchange

  • Inter-prefetcher communication channel
  • Periodic statistics collection
  • Distributed throttling decisions

Hardware Cost: 64B counters

12

13 of 22

Emender Architecture Overview

L1D Cache (Emender-L1D)

Base: VBerti prefetcher

Enhancements:

  • Pending Target Buffer
  • Cuckoo Filter
  • Dynamic Confidence Threshold

22.8 KB (within 32 KB limit)

L2 Cache (Emender-L2)

Base: Pythia prefetcher

Enhancements:

  • Extended throttling response
  • Metadata exchange with L3

25.5 KB (within 128 KB limit)

L3 Cache (Emender-L3)

Design: No prefetching at L3

Rationale:

  • L3 patterns difficult to exploit
  • Aggressive L3 prefetching hurts multi-core
  • DPC4 scoring: lowers score

Role: Fairness-based throttling coordinator

64 B (within 256 KB limit)

Key Design: Hierarchical optimization with coordinated metadata exchange

13

14 of 22

Evaluation Setup

Platform

DPC4 ChampSim Framework

Configurations

1C.fullBW (Single-core Full)

  • 1 core, 4800 MT/s
  • Full memory bandwidth

1C.limitBW (Single-core Limited)

  • 1 core, 800 MT/s
  • Memory-constrained

4C (Multi-core)

  • 4 cores, 4x LLC
  • Tests fairness

Traces

  • DPC4-Traces: Publicly available
  • Subset used (time constraints)
  • Trends consistent with full set

Scoring Methodology

Score = (1C.fullBW × 1C.limitBW × 4C)^(1/3)

Single-core: Geometric mean of IPC

Multi-core: Geometric mean of harmonic speedup

Baselines

Competition: Berti (L1D) + Pythia (L2)

14

15 of 22

Overall Results

Speedup vs. Baseline and other prefetchers

15

Chart Interpretation

Key Observations

  • Emender outperforms all compared prefetchers
  • VBerti slightly better than other prefetchers
  • Consistent gains across all three configurations

Overall Score: +4.8% improvement

16 of 22

Ablation Study

Individual Feature Impact

Feature

1C.fullBW

1C.limitBW

4C

Overall

Cuckoo Filter

1.3%

0.2%

1.0%

0.8%

L3 Fairness Throttling

0.0%

0.0%

2.6%

0.9%

Dynamic Confidence Threshold

0.1%

0.2%

0.2%

0.1%

Pending Target Buffer

0.4%

0.0%

0.4%

0.2%

Combined (Emender)

6.6%

2.0%

6.1%

4.8%

Cuckoo Filter

  • Largest benefit for single-core
  • Eliminates redundant prefetches
  • More effective with plentiful bandwidth

L3 Fairness Throttling

  • Critical for multi-core performance
  • No single-core impact (as expected)
  • 2.6% improvement in 4C

Other Features

  • DCT: Modest but consistent
  • PTB: Foundation for others
  • All contribute positively

16

17 of 22

S-Curve Analysis (1C.fullBW)

Distribution of Speedup

Best Case

  • >1.4x speedup
  • Predictable workloads benefit greatly

Median Case

  • ~1.1x speedup
  • Majority of workloads benefit

Worst Case

  • <10% slowdown
  • Very few workloads see regression
  • Robust design avoids significant degradation

Visualize S-Curve

Interpretation: Majority of workloads see improvement, few significant regressions

Robust across diverse applications

17

18 of 22

Conclusion

Built upon VBerti+Pythia with Four Optimizations

1

Pending Target Buffer

  • Global prioritization
  • ~861 bytes

2

Cuckoo Filter

  • Eliminates redundant prefetches
  • 6.5 KB, most impactful for single-core

3

Dynamic Confidence Threshold

  • Per-PC throttling
  • Self-adaptive

4

L3 Fairness-based Throttling

  • Multi-core coordination
  • Critical for multi-core

Results: Overall Score Improvement +4.8%

Single-core Full BW

+6.6%

Single-core Limited BW

+2.0%

Multi-core

+6.1%

18

19 of 22

Discussion and Future Work

Hardware Implementability

Berti Table Design

  • Current: Fully-associative design
  • Future: Set-associative (e.g., 16-set, 4-way with NRU)
  • Reference: Berto approach for practical implementation

Pending Target Buffer Sorting

  • Challenge: Sorting overhead in hardware
  • Solution: Bucket sort-like mechanism
  • Partition entries by priority levels

Prefetch Filtering Techniques

Multiple Filtering Layers in Emender

  • Cuckoo Filter: Eliminates redundant L1D prefetches
  • Pending Target Buffer: Deduplication
  • Prefetch Queue: Additional filtering

Existing Techniques

  • T-SKID: Shadow Cache + Prefetch Queue filtering
  • Sangam: 40-entry Recent Access Filter
  • PPF: Perceptron-based filtering

Future Research: Comprehensive comparison of trade-offs in hardware overhead, accuracy, and performance impact.

19

20 of 22

Thank You!

Acknowledgments

  • DPC4 organizers and ChampSim developers
  • Authors of VBerti and Pythia prefetchers

Questions?

20

21 of 22

Appendix: Evaluation on DPC4-All-Traces

Performance Results

Methodology

Training vs Evaluation Traces

  • Training: DPC4-Traces (public)
  • Evaluation: DPC4-All-Traces (official)
  • Earlier: 1% instruction subset
  • Appendix: Complete traces tested

Key Differences in Trace Composition

  • Different benchmark distribution
  • Google-Traces-v2 proportion: 32% → 59%
  • More SPEC17 benchmarks with non-memory bottlenecks

22 of 22

Appendix: Analysis and Insights

Single-Core Performance

Trends Consistent with Training

  • Most categories match training trace results
  • SPEC17 shows slight regression
  • Due to non-memory-bound benchmarks (e.g., leela)

Overall Score Reduction Factor

  • Google-Traces-v2 proportion increased
  • From 32% to 59% in evaluation traces
  • Our optimizations provide minimal benefits for these traces

Multi-Core Performance

Divergent Characteristics

  • Training: Random sampling across all benchmarks
  • Evaluation: Category-independent sampling
  • Results in more homogeneous behavior across cores

Throttling Mechanism Impact

  • Training: High diversity amplifies throttling effectiveness
  • Evaluation: Homogeneous behavior reduces throttling benefits
  • Current throttling proves suboptimal in evaluation scenario

Insight: Throttling effectiveness depends on workload diversity across cores.