1 of 22

Predicting Post-Route QoR Estimates for HLS Designs using Machine Learning

Pingakshya Goswami

2 of 22

Speaker Info

Name: Pingakshya Goswami

PhD Student, Department of Electrical Engineering

University of Texas at Dallas

Research Interest:

  • ML based Electronic Design Automation
  • FPGA Prototyping
  • Hardware Accelerator design on FPGA

3 of 22

High Level Synthesis FPGA Design Flow

Specification

C/C++/SystemC

Compile

LLVM IR

Allocation

Scheduling

Binding

RTL Generation

Verilog

Logic Synthesis

Tech Mapping

Optimization

Placement

Routing

Functional

Unit Library

int a, b, c, d;

int p,q;

main(){

p=a*b;

q=p*(c+d);

}

Pareto Optimal Points

Design Space Exploration

4 of 22

High Level Synthesis

  • HLS is the technology by which a program written in algorithmic level using high level languages like C, C++ and System C is converted into synthesizable RTL designs.

  • Popular Commercial HLS Tools:
    • Vivado HLS
    • NEC CyberWorkBench
    • Mentor Catapult
    • Cadence C2S
    • Legup HLS

5 of 22

Characteristics and Implementation Time of ADPCM

FPGA Used: Xilinx Zynq UltraScale+ XCZU7EV

0

ADPCM

HLS

Logic

Synthesis

Physical Design Stage

Run time in

Minutes

5.77

0

Place and Route

3.85

0

Total time ≈ 3500 mins

C-Synthesis

0

3316

adpcm_main

reset

encode

decode

quantl

uppol1

uppol2

upzero

logscl

filtep

scalel

filtez

Design Name

# of loops

# of arrays

# of function

Design Space

adpcm

12

9

11

> 100,000

6 of 22

Quality of Results of a Design

  • Latency:  Latency is defined as the number of clock cycles required to produce an output.
  • Resource Requirement: # of logic resources. These include
    • Flip Flops,
    • LUTs,
    • BRAM,
    • DSPs,
    • IOBs used by a design.
  • Timing: Delay of the critical path of a circuit. This impacts the maximum frequency of operation.
  • Power: Mainly concerned in portable and mobile devices. Smaller technology nodes have higher leakage power.

7 of 22

Vivado HLS vs Post Route Values

8 of 22

Current State of the Art

  • COMBA HLS Predict
  • Contributions
  • Proposed graph based metrics guided DSE.
  • Suggests pragmas based on LLVM graphs
  • Predicts latency pre-synthesis.

  • Drawbacks
  • Predicted only latency
  • Compared against Vivado HLS results (no post-route QoR)
  • No results about resource and timing predictions.

Fast & Accurate HLS Predict

Contributions

  1. Predicts post route timing and resource requirement
  2. Uses features from post synth log files
  3. Achieves good accuracy for the predicted parameters

Drawbacks

  1. Needs to synthesize designs
  2. Did not predict on individual benchmarks
  3. Did not predict latency.
  4. Calibrates post C-synthesis results tot match post route results.

Pyramid HLS Predict

Contributions

  1. Predicts post route resource and timing of HLS designs.
  2. Extracts features from HLS log files.
  3. Post route labels generated from Minerva tool.

Drawbacks

  1. Do not predict latency.
  2. Synthesis required to generate features.
  3. Did not predict on individual benchmarks.

J. Zhao, L. Feng, S. Sinha, W. Zhang, Y. Liang and B. He, "COMBA: A comprehensive model-based analysis framework for high level synthesis of real applications," 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

H. M. Makrani, F. Farahmand, .Sayadi, Sara Bondi , S. Dinakarrao, H.Homayoun, S. Rafatiradao, , “Pyramid: Machine Learning Framework to Estimate the Optimal Timing and Resource Usage of a High-Level Synthesis Design,"  arXiv 2019

S. Dai, Y. Zhou, H. Zhang, E. Ustun, E.F. Y. Young, and Z. Zhang. Fast and Accurate Estimation of Quality of Results in High-Level Synthesis with Machine Learning. Field Programmable Custom Computing Machines (FCCM), 2018.

Ironman

Contributions

  1. Predicts PPA in HLS designs.
  2. Extends the same work to ML based DSE.
  3. GNN and Reinforcement Learning is used

Drawbacks

  1. Needs a Code Transformer to generate scheduled DFG
  2. Can predict Resource and Frequency; Not Latency
  3. Did not predict on individual benchmarks
  4. Use only one pragma

#pragma HLS allocation instance=mul”.

Nan Wu, Yuan Xie, and Cong Hao. 2021. “IRONMAN: GNN-assisted Design Space Exploration in HLS using Reinforcement Learning” Proceedings of the 2021 on Great Lakes Symposium on VLSI

9 of 22

Comparison with Existing Works

Predicted Parameters

C-synthesis Required?

Feature Source

Labels

Reference

Resource

Latency

Clock Period

Fast and Accurate

✔️

✔️

Yes

Synthesis Files

Post Route

Pyramid

✔️

✔️

Yes

Synthesis Files

Post Route

Comba

✔️

No

Analytical

C-synthesis

Ironman

✔️

✔️

Yes

Scheduled DFG

Post Route

This Work

✔️

✔️

✔️

No

C++/LLVM IR

Post Route

10 of 22

Problem Statement

Given a dataset of synthesizable C/C++ based HLS code with all the pragma information.

Is it possible to create a machine learning based model which will predict the post route clock period, latency and resource requirement of a design without synthesizing the design

11 of 22

Overview of LLVM

Clang Frontend parser

Optimizer

Tech Specific Backend

  • Redundant code removal
  • Mem Optimization
  • Change in code structure
  • Bitwidth reduction

C/C++

IR Code

IR Code

  • LLVM is a compiler which converts a high level C/C++ or Python code into technology independent assembly code “Intermediate Representation”.
  • LLVM consists of 3 parts:

FPGA HLS Input

Arm assembly

X86 assembly

GPU specific codes

12 of 22

Proposed Flow

13 of 22

Feature Analysis

Source Name

Example Features

Number of Features

High Level C/C++ Code

  • Max and average unrolled factor
  • Max and average batch size of loops
  • Max and average pipeline Initiation Interval
  • Max pipelined loop name

13

IR File

  • Max and average number of instructions per BB
  • Max, average and total number of (math, logic, sign extn, zero extn, vector and memory) operations

44

Control Flow Graph

  • Total number of nodes
  • Max length of critical path
  • Number of FCUs
  • Max path length
  • Max and average number of incoming/outgoing edges per nodes

6

Callgraph

  • Max and min latency of child functions
  • Max and min clock period of child functions
  • Max and mean FCU count of child functions

6

Total

69

14 of 22

Callgraph

Control Flow Graph

Data Flow Graph

Graphs in LLVM

15 of 22

Study of Feature Importance

Tool Used: xgboost plot_importance

16 of 22

Selection of Training/Test Data

  • Data division : Training 120 designs; Test: 280 designs
  • Results shown in Mean Absolute Percentage Error

Model Used

Clock Period

Latency

LUT

Vivado HLS

102%

NA

553%

Linear Regression

NP

NP

NP

MLP Regression

9.69%

16.58%

44.14%

Random Forest

7.98%

18.25%

16.28%

Gradient Boost

6.29%

10.22%

10.32%

17 of 22

Design of Experiments

  • M1: Fully Localized Training Model
    • Used in Design Space Exploration
    • A fraction of DSE designs generated using actual HL synthesis
    • Remaining are predicted using fast ML based predictive DSE
  • M2: Frequency Sweep Experiment
    • Trained on 3 different frequencies with varying pragmas
    • Tested on baseline design for 8 different frequencies
    • Predicted resource, latency and timing 8 frequencies on baseline design
  • M3: Multi FPGA robustness test
    • Trained and tested on 3 different FPGA devices
    • Trained on 300 versions of adpcm design
    • Tested on 64 totally new unseen designs

18 of 22

Analysis of Results

 

Clock Period

Latency

Resource

Design Name

This Work

Vivado HLS

This Work

This Work

Vivado HLS

adpcm

7.75

198.09

17.54

18.94

529.27

ave8

7.28

182.14

16.41

7.82

73.18

matmul

7.92

67.15

11.12

8.37

560.78

sobel

0.6

163.23

1.76

0.94

584.47

dfadd

6.33

87.88

4.3

2.04

98.58

dfdiv

0.19

26.19

NA

0.03

114.38

dfsin

2.78

39.39

NA

3.78

167.21

aes

9.44

103.46

NA

20.78

911.25

blowfish

14.35

36.57

NA

30.25

491.91

Average

6.29

100.45

10.22

10.32

392.33

  • M1: Fully Localized Training Model
    • Mean Absolute Percentage Error Values

19 of 22

Analysis of Results

  • M2: Frequency Sweep Experiment
    • Actual vs Predicted Values

Clock Period (ns)

Latency (clock cycle)

LUT

Frequency (MHz)

Actual

Predicted

Actual

Predicted

Actual

Predicted

100

7.74

7.44

20104

21659

2782

3244

125

6.64

7.41

23604

21677

2774

3244

150

6.64

6.29

28254

26916

2772

3244

175

4.55

4.49

34004

40351

2772

2605

200

4.55

4.60

43154

41332

2588

2735

225

4.55

4.60

43204

41332

2636

2735

300

3.00

3.11

55954

43381

2740

2749

500

3.00

3.45

111654

47566

4443

5095

20 of 22

Analysis of Results

Clock Period

Latency

LUT

Device

Validation

Test

Validation

Test

Validation

Test

Zynq 7000

5.55

7.75

20.24

17.54

19.02

18.94

Virtex 7

4.11

6.50

18.36

17.10

12.36

17.89

Kintex 7

5.51

5.06

17.69

19.50

14.14

11.70

  • M3: Multi FPGA Results
    • MAPE values using Gradient Boost Regression

21 of 22

Summary and Conclusion

  • First work to predict post route matrices without synthesis.
  • Tested on 10 benchmarks from Chstone and S2cbench
  • Can be easily integrate with commercial HLS tools.
  • Achieved MAE of < 10% for all the 3 predicted parameters for all three proposed training methods.
    • Proposing to analyze features rigorously to create more robust models.
  • Achieved average speed up of 3.48x using fully localized as compared to Vivado HLS results.
  • Tested robustness on multi-frequency designs and multiple FPGA devices

22 of 22

Contact

    • Pingakshya Goswami: pxg131330@utdallas.edu
    • Dinesh Bhatia:

dinesh@utdallas.edu