Emender: Optimizing Prefetch Priority and Throttling in VBerti+Pythia
DPC4 Championship
Jiajie Chen · Tingji Zhang (Presenter)
Xiaoyi Liu · Xuefeng Zhang · Peng Qu · Youhui Zhang
Tsinghua University
1
Presentation Overview
1
Background & Motivation
2
Competition Approach
3
Problem Analysis
4
Four Key Optimizations
5
Evaluation Results
6
Conclusion
2
Background - Data Prefetching
The Memory Wall Problem
What is Data Prefetching?
Ideal: Data already in cache when processor needs it
3
Competition Approach - Finding SOTA
Step 1: Evaluate Existing Prefetchers
Evaluated 13 Prefetchers
Methodology
Step 2: Comprehensive Evaluation
Evaluation
VBerti + Pythia = Best Performing Combination
4
Challenge: Evaluation Time
The Problem
Challenge
Our Solution: Scale Validation
Experiment
Finding
Relative performance consistent across scales
Methodology
Result
VBerti + Pythia = Best Performing Combination
Key Insight: Scale validation enabled us to evaluate 100x more combinations in the same timeframe
5
Problem Analysis
Findings
Observation: VBerti+Pythia generates massive prefetch request volume
Prefetch Requests Generated
Very high
Prefetch Queue Capacity
Limited
Key Insight
Must use limited prefetch queue more efficiently
6
Two Optimization Directions
1
Prioritization
If only some prefetches can enter queue, prioritize high-confidence ones
Current VBerti Limitation
Opportunity
Global prioritization across all load instructions
2
Filter Useless Prefetches
Some prefetches target already-cached cachelines
Challenge
Need
Approximate, efficient cache membership test
7
Optimization 1 - Pending Target Buffer
Design: Global Prefetch Prioritization
Load Instructions → Generate Prefetches
↓
Pending Target Buffer
(sorts by confidence)
↓
Issue high-confidence first
Key Features
Implementation Details
Each Entry Contains
Policies
Hardware Cost: 6,884 bits (~861 bytes)
8
Optimization 2 - Cuckoo Filter
Goal
Eliminate prefetches for data already in cache
Challenge: Direct Cache Lookup
Why not just check the cache?
Solution: Cuckoo Filter
What is it?
Our Application
Hardware Cost: 53,248 bits (6.5 KB)
1024 sets x 4 ways
Each entry: 1-bit valid + 12-bit fingerprint
Key Benefit: Zero false negatives - when filter says "not in cache", confidently prefetch
9
VIPT Cache Aliasing Problem for Cuckoo Filter
The Challenge
Our Design Choice
Hardware Reality
Problem
VA₁ → PA → Cache Line
VA₂ → PA → Same Cache Line (aliasing!)
Our Solution
Track Last-Accessed VA
Implementation
Note on False Negatives
Cuckoo Filter itself has zero false negatives. However, due to VIPT aliasing, we only track the last-accessed VA per cacheline. This may cause unnecessary prefetches when different VAs alias to the same cacheline, but the impact is minimal since aliasing is rare.
10
Opt. 3 - Dynamic Confidence Threshold
Observation
Some Load PCs Have Semi-unpredictable Patterns
Problem
Solution: Per-PC Throttling
Track Per-Load Instruction Effectiveness
Dynamic Adjustment
Hardware Cost: 10-bit miss counter per entry
11
Optimization 4 - L3 Fairness-based Throttling
Problem: Multi-core Resource Contention
VBerti+Pythia is Very Aggressive
Core 0: ████████████████████ (hogs)
Core 1: ██ (starved)
Core 2: █ (starved)
Core 3: ███ (starved)
Solution: Cross-core Coordination
Mechanism
Metadata Exchange
Hardware Cost: 64B counters
12
Emender Architecture Overview
L1D Cache (Emender-L1D)
Base: VBerti prefetcher
Enhancements:
22.8 KB (within 32 KB limit)
L2 Cache (Emender-L2)
Base: Pythia prefetcher
Enhancements:
25.5 KB (within 128 KB limit)
L3 Cache (Emender-L3)
Design: No prefetching at L3
Rationale:
Role: Fairness-based throttling coordinator
64 B (within 256 KB limit)
Key Design: Hierarchical optimization with coordinated metadata exchange
13
Evaluation Setup
Platform
DPC4 ChampSim Framework
Configurations
1C.fullBW (Single-core Full)
1C.limitBW (Single-core Limited)
4C (Multi-core)
Traces
Scoring Methodology
Score = (1C.fullBW × 1C.limitBW × 4C)^(1/3)
Single-core: Geometric mean of IPC
Multi-core: Geometric mean of harmonic speedup
Baselines
Competition: Berti (L1D) + Pythia (L2)
14
Overall Results
Speedup vs. Baseline and other prefetchers
15
Chart Interpretation
Key Observations
Overall Score: +4.8% improvement
Ablation Study
Individual Feature Impact
Feature
1C.fullBW
1C.limitBW
4C
Overall
Cuckoo Filter
1.3%
0.2%
1.0%
0.8%
L3 Fairness Throttling
0.0%
0.0%
2.6%
0.9%
Dynamic Confidence Threshold
0.1%
0.2%
0.2%
0.1%
Pending Target Buffer
0.4%
0.0%
0.4%
0.2%
Combined (Emender)
6.6%
2.0%
6.1%
4.8%
Cuckoo Filter
L3 Fairness Throttling
Other Features
16
S-Curve Analysis (1C.fullBW)
Distribution of Speedup
Best Case
Median Case
Worst Case
Visualize S-Curve
Interpretation: Majority of workloads see improvement, few significant regressions
Robust across diverse applications
17
Conclusion
Built upon VBerti+Pythia with Four Optimizations
1
Pending Target Buffer
2
Cuckoo Filter
3
Dynamic Confidence Threshold
4
L3 Fairness-based Throttling
Results: Overall Score Improvement +4.8%
Single-core Full BW
+6.6%
Single-core Limited BW
+2.0%
Multi-core
+6.1%
18
Discussion and Future Work
Hardware Implementability
Berti Table Design
Pending Target Buffer Sorting
Prefetch Filtering Techniques
Multiple Filtering Layers in Emender
Existing Techniques
Future Research: Comprehensive comparison of trade-offs in hardware overhead, accuracy, and performance impact.
19
Thank You!
Acknowledgments
Questions?
20
Appendix: Evaluation on DPC4-All-Traces
Performance Results
Methodology
Training vs Evaluation Traces
Key Differences in Trace Composition
Appendix: Analysis and Insights
Single-Core Performance
Trends Consistent with Training
Overall Score Reduction Factor
Multi-Core Performance
Divergent Characteristics
Throttling Mechanism Impact
Insight: Throttling effectiveness depends on workload diversity across cores.