1 of 13

Learning Generalizable Program and Architecture Representations for Performance Modeling

SC 2024

Atlanta GA

Lingda Li, Thomas Flynn, Adolfy Hoisie

Brookhaven National Laboratory

2 of 13

Computer Architecture Modeling and Simulation

  • Essential to computer architecture research and engineering
    • New design evaluation, design space exploration, resource scheduling, …
  • Goals: speed, accuracy, and generalizability

Methodology

Speed

Accuracy

Generalizability

Analytical Modeling

Fast

Low

High

High

Low

Discrete Event Simulation

Slow

Variable

High

Emulation

Medium

Variable

Medium

Machine Learning (ML)-based Modeling

Fast

High

Low

ML-based Simulation

Medium

Variable

Medium

Goal: explore better tradeoff using ML, especially on generalizability

PerfVec (this work)

Fast?

High?

High?

3 of 13

Generalizable Performance Modeling

  • Performance is determined by both software and hardware.
  • A generic performance model should separate the impact of software (program) and hardware (microarchitecture).
    • When one changes, no need to re-model the other
  • Not trivial to achieve such separation
    • Complex interplay between program and microarchitecture
    • Several analytical models tried to do it manually.

3

PerfVec learns the separation automatically.

4 of 13

Learning Separation

  • The same program representation is used to predict its performance on any microarchitecture, and vice versa.

4

Key idea 1: have two independent ML models to capture the performance impacts of program and microarchitecture, respectively

independent

of each other

Focus of this work

5 of 13

Learning Program Representation

  • Challenge: programs typical execute at least billions of instructions, and no ML model can deal with such long sequences.

5

Input

Level of Detail

Limitation

Profiling info

(e.g., performance counters)

Low

  • Often microarchitecture dependent
  • Lack of low level details

Static program info

(e.g., control flow graph)

Medium

  • Cannot capture dynamic execution (e.g., input impact)
  • Huge graphs to learn from

Instruction execution trace

High

  • Huge amounts of instructions to learn from

Key idea 2: 1) learn the representations of individual instructions and 2) compose a program representation from those of all its executed instructions (divide-and-conquer)

6 of 13

Learning Instruction Representation

  • Performance determination factors of an instruction

6

Factor

Microarchitecture Independent Feature

Own properties

Static properties (e.g., operation type); dynamic behavior (e.g., branch direction); reuse distance; branch entropy

Co-running instructions

Co-running instructions’ properties

A sequence model

(e.g., LSTM, Transformer)

Refer to the paper on how to compose program representation

Generalizable across programs

Program traces are different sequences of instructions from the same set.

Instruction

representation

model

Instructioni’s representation

Instructioni’s features

Instructioni-1’s features

Instructioni-c’s features

Instruction1’s representation

Instructionn’s representation

Program representation

7 of 13

PerfVec Training

  • Natural solution: training all models end-to-end
  • Challenge
    • Irregular and alterable microarchitectural space
    • Difficult to train a universal microarchitecture �representation model

  • Hypothesis: training with a sufficient �number of diverse microarchitectures �enables generalizability to others.

7

Train representations of sampled microarchitectures jointly

Key idea 3: train the instruction representation model to predict instruction latencies on randomly selected microarchitectures for generalizability

Replace the program representation model

8 of 13

Data Acquisition & Model Architecture

  • Instruction traces from gem5 simulation
    • Easy to obtain instruction level latency
    • Easy to configure microarchitectures
  • Programs
    • 17 SPEC CPU2017 benchmarks
    • 10 for training, 7 for testing
  • Microarchitectures
    • Randomly generated gem5 configurations
      • In-order/out-of-order cores, caches, memory, etc.
    • 77 for training, 10 for testing
  • Model architecture
    • 2-layer LSTM by default
    • See the paper for other architectures

8

9 of 13

Generalizability Evaluation

Prediction error range against gem5 simulation across all microarchitectures

Unseen microarchitectures

Unseen programs

The trained model generalizes well to unseen programs and microarchitectures.

10 of 13

PerfVec Use Cases

  • Performance modeling is essential for many tasks.
    • PerfVec can be used in them.
  • A case study: design space exploration (DSE)
    • Find the optimal design(s) given one/multiple objective function(s)
  • DSE example: L1 and L2 cache size exploration
    • Objective function: execution_time * (1000 + 10*L1_size + L2_size)
      • Similar to latency area product
    • Select the best cache sizes for 17 SPEC CPU2017 benchmarks

10

11 of 13

DSE Procedure

  • Train a microarchitecture�representation model

  • Training data: gem5 simulation traces of 3 benchmarks on selected configurations
  • Predict the performance of all benchmarks
    • Use the trained microarchitecture representation model and existing program representations

11

Froze during training

A 2-layer MLP

12 of 13

DSE Results

    • Complete gem5 simulation: 600 hours
    • Previous ML-based DSE methods
    • Use selected simulation results to train�program-specific models
    • Need to simulate many configurations�for each program
    • PerfVec
    • Incur significantly less overhead with�comparable quality
    • 11 hours = 5 (data collection) + 6 (training)
  1. Ïpek et al. Efficiently exploring architectural design spaces via predictive modeling. ASPLOS 2006.
  2. Dubach et al. Microarchitectural design space exploration using an architecture-centric approach. MICRO 2007.
  3. Li et al. Efficient design space exploration via statistical sampling and adaboost learning. DAC 2016.

Overhead: the time to construct models (hours)

Quality: how close the selected design is to the optimal design.

Lower is better

Method

Overhead

Quality

ASPLOS06

150

4.4%

MICRO07

84

4.7%

DAC16

170

3.6%

PerfVec

11

3.6%

13 of 13

PerfVec Summary

13

  • High generalizability
    • Learn independent program and microarchitecture representations
  • Good accuracy
    • Learn program representations from instruction execution traces
  • Fast speed
    • Simple combination of program and microarchitecture representations
  • Many potential applications
    • Design space exploration, performance analysis, …
  • Code: https://github.com/PerfVec/PerfVec