3 of 22

High Level Synthesis FPGA Design Flow

Specification

C/C++/SystemC

Compile

LLVM IR

Allocation

Scheduling

Binding

RTL Generation

Verilog

Logic Synthesis

Tech Mapping

Optimization

Placement

Routing

Functional

Unit Library

int a, b, c, d;

int p,q;

main(){

p=a*b;

q=p*(c+d);

}

Pareto Optimal Points

Design Space Exploration

4 of 22

High Level Synthesis

HLS is the technology by which a program written in algorithmic level using high level languages like C, C++ and System C is converted into synthesizable RTL designs.

Popular Commercial HLS Tools:

Vivado HLS
NEC CyberWorkBench
Mentor Catapult
Cadence C2S
Legup HLS

5 of 22

Characteristics and Implementation Time of ADPCM

FPGA Used: Xilinx Zynq UltraScale+ XCZU7EV

ADPCM

HLS

Logic

Synthesis

Physical Design Stage

Run time in

Minutes

5.77

Place and Route

3.85

Total time ≈ 3500 mins

C-Synthesis

3316

adpcm_main

reset

encode

decode

quantl

uppol1

uppol2

upzero

logscl

filtep

scalel

filtez

Design Name	# of loops	# of arrays	# of function	Design Space
adpcm	12	9	11	> 100,000

6 of 22

Quality of Results of a Design

Latency: Latency is defined as the number of clock cycles required to produce an output.
Resource Requirement: # of logic resources. These include

Flip Flops,
LUTs,
BRAM,
DSPs,
IOBs used by a design.

Timing: Delay of the critical path of a circuit. This impacts the maximum frequency of operation.
Power: Mainly concerned in portable and mobile devices. Smaller technology nodes have higher leakage power.

7 of 22

Vivado HLS vs Post Route Values

8 of 22

Current State of the Art

COMBA HLS Predict
Contributions
Proposed graph based metrics guided DSE.
Suggests pragmas based on LLVM graphs
Predicts latency pre-synthesis.

Drawbacks
Predicted only latency
Compared against Vivado HLS results (no post-route QoR)
No results about resource and timing predictions.

Fast & Accurate HLS Predict

Contributions

Predicts post route timing and resource requirement
Uses features from post synth log files
Achieves good accuracy for the predicted parameters

Drawbacks

Needs to synthesize designs
Did not predict on individual benchmarks
Did not predict latency.
Calibrates post C-synthesis results tot match post route results.

Pyramid HLS Predict

Contributions

Predicts post route resource and timing of HLS designs.
Extracts features from HLS log files.
Post route labels generated from Minerva tool.

Drawbacks

Do not predict latency.
Synthesis required to generate features.
Did not predict on individual benchmarks.

J. Zhao, L. Feng, S. Sinha, W. Zhang, Y. Liang and B. He, "COMBA: A comprehensive model-based analysis framework for high level synthesis of real applications," 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

H. M. Makrani, F. Farahmand, .Sayadi, Sara Bondi , S. Dinakarrao, H.Homayoun, S. Rafatiradao, , “Pyramid: Machine Learning Framework to Estimate the Optimal Timing and Resource Usage of a High-Level Synthesis Design," arXiv 2019

S. Dai, Y. Zhou, H. Zhang, E. Ustun, E.F. Y. Young, and Z. Zhang. Fast and Accurate Estimation of Quality of Results in High-Level Synthesis with Machine Learning. Field Programmable Custom Computing Machines (FCCM), 2018.

Ironman

Contributions

Predicts PPA in HLS designs.
Extends the same work to ML based DSE.
GNN and Reinforcement Learning is used

Drawbacks

Needs a Code Transformer to generate scheduled DFG
Can predict Resource and Frequency; Not Latency
Did not predict on individual benchmarks
Use only one pragma

“#pragma HLS allocation instance=mul”.

Nan Wu, Yuan Xie, and Cong Hao. 2021. “IRONMAN: GNN-assisted Design Space Exploration in HLS using Reinforcement Learning” Proceedings of the 2021 on Great Lakes Symposium on VLSI

9 of 22

Comparison with Existing Works

	Predicted Parameters			C-synthesis Required?	Feature Source	Labels
Reference	Resource	Latency	Clock Period	C-synthesis Required?	Feature Source	Labels
Fast and Accurate	✔️	❌	✔️	Yes	Synthesis Files	Post Route
Pyramid	✔️	❌	✔️	Yes	Synthesis Files	Post Route
Comba	❌	✔️	❌	No	Analytical	C-synthesis
Ironman	✔️	❌	✔️	Yes	Scheduled DFG	Post Route
This Work	✔️	✔️	✔️	No	C++/LLVM IR	Post Route

10 of 22

Problem Statement

Given a dataset of synthesizable C/C++ based HLS code with all the pragma information.

Is it possible to create a machine learning based model which will predict the post route clock period, latency and resource requirement of a design without synthesizing the design

11 of 22

Overview of LLVM

Clang Frontend parser

Optimizer

Tech Specific Backend

Redundant code removal
Mem Optimization
Change in code structure
Bitwidth reduction

C/C++

IR Code

LLVM is a compiler which converts a high level C/C++ or Python code into technology independent assembly code “Intermediate Representation”.
LLVM consists of 3 parts:

FPGA HLS Input

Arm assembly

X86 assembly

GPU specific codes

12 of 22

Proposed Flow

13 of 22

Feature Analysis

Source Name	Example Features	Number of Features
High Level C/C++ Code	Max and average unrolled factor Max and average batch size of loops Max and average pipeline Initiation Interval Max pipelined loop name	13
IR File	Max and average number of instructions per BB Max, average and total number of (math, logic, sign extn, zero extn, vector and memory) operations	44
Control Flow Graph	Total number of nodes Max length of critical path Number of FCUs Max path length Max and average number of incoming/outgoing edges per nodes	6
Callgraph	Max and min latency of child functions Max and min clock period of child functions Max and mean FCU count of child functions	6
Total		69

14 of 22

Callgraph

Control Flow Graph

Data Flow Graph

Graphs in LLVM

15 of 22

Study of Feature Importance

Tool Used: xgboost plot_importance

16 of 22

Selection of Training/Test Data

Data division : Training 120 designs; Test: 280 designs
Results shown in Mean Absolute Percentage Error

Model Used	Clock Period	Latency	LUT
Vivado HLS	102%	NA	553%
Linear Regression	NP	NP	NP
MLP Regression	9.69%	16.58%	44.14%
Random Forest	7.98%	18.25%	16.28%
Gradient Boost	6.29%	10.22%	10.32%

17 of 22

Design of Experiments

M1: Fully Localized Training Model

Used in Design Space Exploration
A fraction of DSE designs generated using actual HL synthesis
Remaining are predicted using fast ML based predictive DSE

M2: Frequency Sweep Experiment

Trained on 3 different frequencies with varying pragmas
Tested on baseline design for 8 different frequencies
Predicted resource, latency and timing 8 frequencies on baseline design

M3: Multi FPGA robustness test

Trained and tested on 3 different FPGA devices
Trained on 300 versions of adpcm design
Tested on 64 totally new unseen designs

18 of 22

Analysis of Results

	Clock Period		Latency	Resource
Design Name	This Work	Vivado HLS	This Work	This Work	Vivado HLS
adpcm	7.75	198.09	17.54	18.94	529.27
ave8	7.28	182.14	16.41	7.82	73.18
matmul	7.92	67.15	11.12	8.37	560.78
sobel	0.6	163.23	1.76	0.94	584.47
dfadd	6.33	87.88	4.3	2.04	98.58
dfdiv	0.19	26.19	NA	0.03	114.38
dfsin	2.78	39.39	NA	3.78	167.21
aes	9.44	103.46	NA	20.78	911.25
blowfish	14.35	36.57	NA	30.25	491.91
Average	6.29	100.45	10.22	10.32	392.33

M1: Fully Localized Training Model

Mean Absolute Percentage Error Values

19 of 22

Analysis of Results

M2: Frequency Sweep Experiment

Actual vs Predicted Values

	Clock Period (ns)		Latency (clock cycle)		LUT
Frequency (MHz)	Actual	Predicted	Actual	Predicted	Actual	Predicted
100	7.74	7.44	20104	21659	2782	3244
125	6.64	7.41	23604	21677	2774	3244
150	6.64	6.29	28254	26916	2772	3244
175	4.55	4.49	34004	40351	2772	2605
200	4.55	4.60	43154	41332	2588	2735
225	4.55	4.60	43204	41332	2636	2735
300	3.00	3.11	55954	43381	2740	2749
500	3.00	3.45	111654	47566	4443	5095

20 of 22

Analysis of Results

	Clock Period		Latency		LUT
Device	Validation	Test	Validation	Test	Validation	Test
Zynq 7000	5.55	7.75	20.24	17.54	19.02	18.94
Virtex 7	4.11	6.50	18.36	17.10	12.36	17.89
Kintex 7	5.51	5.06	17.69	19.50	14.14	11.70

M3: Multi FPGA Results

MAPE values using Gradient Boost Regression

21 of 22

Summary and Conclusion

First work to predict post route matrices without synthesis.
Tested on 10 benchmarks from Chstone and S2cbench
Can be easily integrate with commercial HLS tools.
Achieved MAE of < 10% for all the 3 predicted parameters for all three proposed training methods.

Proposing to analyze features rigorously to create more robust models.

Achieved average speed up of 3.48x using fully localized as compared to Vivado HLS results.
Tested robustness on multi-frequency designs and multiple FPGA devices

1 of 22

2 of 22

3 of 22

4 of 22

5 of 22

6 of 22

7 of 22

8 of 22

9 of 22

10 of 22

11 of 22

12 of 22

13 of 22

14 of 22

15 of 22

16 of 22

17 of 22

18 of 22

19 of 22

20 of 22

21 of 22

22 of 22