1 of 13

Learning Generalizable Program and Architecture Representations for Performance Modeling

SC 2024

Atlanta GA

Lingda Li, Thomas Flynn, Adolfy Hoisie

Brookhaven National Laboratory

2 of 13

Computer Architecture Modeling and Simulation

Essential to computer architecture research and engineering

New design evaluation, design space exploration, resource scheduling, …

Goals: speed, accuracy, and generalizability

Methodology	Speed	Accuracy	Generalizability
Analytical Modeling	Fast	Low	High
Analytical Modeling	Fast	High	Low
Discrete Event Simulation	Slow	Variable	High
Emulation	Medium	Variable	Medium
Machine Learning (ML)-based Modeling	Fast	High	Low
ML-based Simulation	Medium	Variable	Medium

Goal: explore better tradeoff using ML, especially on generalizability

PerfVec (this work)	Fast?	High?	High?

3 of 13

Generalizable Performance Modeling

Performance is determined by both software and hardware.
A generic performance model should separate the impact of software (program) and hardware (microarchitecture).

When one changes, no need to re-model the other

Not trivial to achieve such separation

Complex interplay between program and microarchitecture
Several analytical models tried to do it manually.

3

PerfVec learns the separation automatically.

4 of 13

Learning Separation

The same program representation is used to predict its performance on any microarchitecture, and vice versa.

4

Key idea 1: have two independent ML models to capture the performance impacts of program and microarchitecture, respectively

independent

of each other

Focus of this work

5 of 13

Learning Program Representation

Challenge: programs typical execute at least billions of instructions, and no ML model can deal with such long sequences.

5

Input	Level of Detail	Limitation
Profiling info (e.g., performance counters)	Low	Often microarchitecture dependent Lack of low level details
Static program info (e.g., control flow graph)	Medium	Cannot capture dynamic execution (e.g., input impact) Huge graphs to learn from
Instruction execution trace	High	Huge amounts of instructions to learn from

Key idea 2: 1) learn the representations of individual instructions and 2) compose a program representation from those of all its executed instructions (divide-and-conquer)

6 of 13

Learning Instruction Representation

Performance determination factors of an instruction

6

Factor	Microarchitecture Independent Feature
Own properties	Static properties (e.g., operation type); dynamic behavior (e.g., branch direction); reuse distance; branch entropy
Co-running instructions	Co-running instructions’ properties

A sequence model

(e.g., LSTM, Transformer)

Refer to the paper on how to compose program representation

Generalizable across programs

Program traces are different sequences of instructions from the same set.

Instruction

representation

model

Instruction_i’s representation

Instruction_i’s features

Instruction_i-1’s features

…

Instruction_i-c’s features

Instruction₁’s representation

…

Instruction_n’s representation

…

Program representation

7 of 13

PerfVec Training

Natural solution: training all models end-to-end
Challenge

Irregular and alterable microarchitectural space
Difficult to train a universal microarchitecture �representation model

Hypothesis: training with a sufficient �number of diverse microarchitectures �enables generalizability to others.

7

Train representations of sampled microarchitectures jointly

Key idea 3: train the instruction representation model to predict instruction latencies on randomly selected microarchitectures for generalizability

Replace the program representation model

8 of 13

Data Acquisition & Model Architecture

Instruction traces from gem5 simulation

Easy to obtain instruction level latency
Easy to configure microarchitectures

Programs

17 SPEC CPU2017 benchmarks
10 for training, 7 for testing

Microarchitectures

Randomly generated gem5 configurations

In-order/out-of-order cores, caches, memory, etc.

77 for training, 10 for testing

Model architecture

2-layer LSTM by default
See the paper for other architectures

8

9 of 13

Generalizability Evaluation

Prediction error range against gem5 simulation across all microarchitectures

Unseen microarchitectures

Unseen programs

The trained model generalizes well to unseen programs and microarchitectures.

10 of 13

PerfVec Use Cases

Performance modeling is essential for many tasks.

PerfVec can be used in them.

A case study: design space exploration (DSE)

Find the optimal design(s) given one/multiple objective function(s)

DSE example: L1 and L2 cache size exploration

Objective function: execution_time * (1000 + 10*L1_size + L2_size)

Similar to latency area product

Select the best cache sizes for 17 SPEC CPU2017 benchmarks

10

11 of 13

DSE Procedure

Train a microarchitecture�representation model

Training data: gem5 simulation traces of 3 benchmarks on selected configurations
Predict the performance of all benchmarks

Use the trained microarchitecture representation model and existing program representations

11

Froze during training

A 2-layer MLP

12 of 13

DSE Results

Complete gem5 simulation: 600 hours
Previous ML-based DSE methods
Use selected simulation results to train�program-specific models
Need to simulate many configurations�for each program
PerfVec
Incur significantly less overhead with�comparable quality
11 hours = 5 (data collection) + 6 (training)

Ïpek et al. Efficiently exploring architectural design spaces via predictive modeling. ASPLOS 2006.
Dubach et al. Microarchitectural design space exploration using an architecture-centric approach. MICRO 2007.
Li et al. Efficient design space exploration via statistical sampling and adaboost learning. DAC 2016.

Overhead: the time to construct models (hours)

Quality: how close the selected design is to the optimal design.

Lower is better

Method	Overhead	Quality
ASPLOS06	150	4.4%
MICRO07	84	4.7%
DAC16	170	3.6%
PerfVec	11	3.6%

13 of 13

PerfVec Summary

13

High generalizability

Learn independent program and microarchitecture representations

Good accuracy

Learn program representations from instruction execution traces

Fast speed

Simple combination of program and microarchitecture representations

Many potential applications

Design space exploration, performance analysis, …

Code: https://github.com/PerfVec/PerfVec