High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives
Jinming Zhuang, Zhuoping Yang, Peipei Zhou
University of Pittsburgh
jinming.zhuang@pitt.edu
peipei.zhou@pitt.edu
https://github.com/arc-research-lab/CHARM
https://peipeizhou-eecs.github.io/
Architecture Overview
1
AIE Array
IO
AIE
VLIW
AIE Core
32KB
Mem
DDR4-DIMM
25.6 GB/s
Programmable Logic
BRAM
URAM
CLB
DSP
1.2 TB/s
NOC
Processor System
(ARM)
Challenge 1
2
AIE Core
150 Lines of Code
How AIEs are behaved in the design
AIE Array
900 Lines of Code
How AIEs are connected in the graph
1000+ Lines of Code
PL
How does the data move between AIE and PL
Proposed Framework
3
Challenge 2
Platforms
0
5
10
15
20
25
16nm
7nm
FP32 Performance (TFLOPS)
NVIDIA Jetson TX2 GPU
AMD U250 FPGA
AMD VCK190 ACAP
Throughput (GFLOPS)
6.4
1.47
0.75
19.5
NVIDIA A100 GPU
4
0
200
400
600
800
1000
1200
1400
1600
Platforms
16nm
7nm
Bandwidth (GB/s)
Off-Chip Bandwidth (GB/s)
NVIDIA Jetson TX2 GPU
AMD U250 FPGA
AMD VCK190 ACAP
NVIDIA A100 GPU
77
1555
25.6
51.2
Challenge 2
5
Platforms
0
50
100
150
200
250
16nm
7nm
Required CTC Ratio (GFLOP/B)
NVIDIA Jetson TX2 GPU
AMD U250 FPGA
AMD VCK190 ACAP
Required CTC Ratio (GFLOP/B)
250
14.7
19.1
NVIDIA A100 GPU
12.5
0
10
20
30
40
50
60
70
Platforms
16nm
7nm
Energy Efficiency (GFLOPS/Watt)
Implemented Energy Efficiency (GFLOPS/Watt)
NVIDIA Jetson TX2 GPU
AMD U250 FPGA
AMD VCK190 ACAP
NVIDIA A100 GPU
2.3x
7.2x
1.1x
64.2
8.9
60.5
27.7
17.0x
13.0x
20.0x
More Harvesting on CTC
0
0
AIE Array
PLIO
AIE
25.6 GB/s
1.2 TB/s
PL
NOC
Processor System
(ARM)
DMA
Sender
Receiver
LHS
RHS
Output
DDR4
DDR Space
Sending
1
2
3
5
6
7
4
0
1
2
3
0
5
6
7
4
6
LHS
RHS
VLIW
Processor
32KB
Mem
Proposed Dataflow
0
0
AIE Array
PLIO
AIE
PL
NOC
Processor System
(ARM)
DMA
Sender
Receiver
LHS
RHS
Output
LHS
RHS
DDR4
DDR Space
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
2
3
5
6
7
4
0
1
2
3
0
5
6
7
4
1
0
2
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
0
2
Sending
0
1
6
25.6 GB/s
1.2 TB/s
VLIW
Processor
32KB
Mem
Proposed Dataflow
Challenges
“High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives”
Solutions
https://github.com/arc-research-lab/CHARM
7
AutoMM Framework Overview
8
AutoMM Input and Output (IOP)
AutoMM Framework Overview
9
AutoMM DSE and ACG
2) AIE Domain: PLIO Reuse, Placement� 3) PL Domain: Buffer Reuse, Efficient DMA�
2) PL: Vitis HLS C/C++
3) Host CPU: AMD/Xilinx Runtime C/C++
AutoMM Design Methodology
1) Highly Efficient Single AIE Design Using VLIW
10
AIE Local Memory
AIE Core
x
=
LHS
RHS
Output
L0
L1
L2
L3
O1
O2
R00
R01
R02
R03
R10
R11
R12
R13
Load R
Load L
MAC
Store
Cycle
Register A
Register B
Accumulation Register
B0
B1
Acc0
Acc1
+
+
A0
A1
Load L
Load R
0
A0🡨L0
B0🡨R0
Pre
load
DoubleReg
Acc0+=
L0*R00
1
A1🡨L1
B1🡨R1
Acc1+=
L0*R10
2
A1🡨L1
B1🡨R1
Latency�Hidden
…
Acc0+=
L3*B03
Store
A0🡨L4
B0🡨R4
Acc1+=
L3*B13
Store
Acc0
N-1
Store
Acc1
N
AIE
AIE
AIE
AIE
PLIO
AutoMM Design Methodology
2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection
11
AIE
AIE
AIE
AIE
PLIO
Broadcast
Packet-switch
0
1
2
3
X
: 32* 32
LHS
RHS
A
B
C
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
3
2
1
0
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
Row 0
Row 1
Row 2
Row 3
PLIO
AIE Array
12
AutoMM Design Methodology
2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection
0
0
0
0
0
1
2
3
X
: 32* 32
LHS
RHS
A
B
C
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
3
2
1
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
Time 0
Row 0
Row 1
Row 2
Row 3
PLIO
0
0
0
0
13
AutoMM Design Methodology
2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection
1
1
1
1
0
0
0
0
0
1
2
3
X
: 32* 32
LHS
RHS
A
B
C
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
3
2
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
Time 1
Row 0
Row 1
Row 2
Row 3
PLIO
1
1
1
1
14
AutoMM Design Methodology
2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection
2
2
2
2
1
1
1
1
0
0
0
0
0
1
2
3
X
: 32* 32
LHS
RHS
A
B
C
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
3
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
Time 2
Row 0
Row 1
Row 2
Row 3
PLIO
2
2
2
2
15
AutoMM Design Methodology
2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection
3
3
3
3
2
2
2
2
1
1
1
1
0
0
0
0
0
1
2
3
X
: 32* 32
LHS
RHS
A
B
C
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
Time 3
Row 0
Row 1
Row 2
Row 3
PLIO
3
3
3
3
16
AutoMM Design Methodology
2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection
AutoMM Design Methodology
3) Routing Optimization on AIE Array Design
17
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
Interface
PLIO 0
Packet-Switch
Broadcast
Switch Box (Router)
PLIO
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
AIE
Interface
PLIO 0
Interface
PLIO 1
10 Connections from West 🡪 East
4 Connections from West 🡪 Eest
Experiment Setup
18
Experiment Results
19
0
20
40
60
80
100
120
140
Platforms
16nm
7nm
Energy Efficiency (GOPS/Watt)
INT16 Energy Efficiency (GOPS/Watt)
NVIDIA Jetson TX2 GPU
AMD U250 FPGA
AMD VCK190 ACAP
NVIDIA A100 GPU
N/A
N/A
132.2
40.6
3.3x
0
10
20
30
40
50
60
70
Platforms
16nm
7nm
Energy Efficiency (GFLOPS/Watt)
Implemented Energy Efficiency (GFLOPS/Watt)
NVIDIA Jetson TX2 GPU
AMD U250 FPGA
AMD VCK190 ACAP
NVIDIA A100 GPU
64.2
8.9
60.5
27.7
2.3x
7.2x
1.1x
Experiment Results
20
0
20
40
60
80
100
120
140
Platforms
16nm
7nm
Energy Efficiency (GOPS/Watt)
INT8 Energy Efficiency (GOPS/Watt)
NVIDIA Jetson TX2 GPU
AMD U250 FPGA
AMD VCK190 ACAP
NVIDIA A100 GPU
N/A
461.7
74.2
270.9
FP32 NCF and MLP Energy Efficiency (GFLOPS/watt)
AMD VCK190 ACAP
NVIDIA A100 GPU
Energy Efficiency (GFLOPS/Watt)
51.6
49.4
55.1
63.9
0.96x
1.16x
6.2x
1.7x
High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives
Jinming Zhuang, Zhuoping Yang, Peipei Zhou
University of Pittsburgh
jinming.zhuang@pitt.edu
peipei.zhou@pitt.edu
https://github.com/arc-research-lab/CHARM
https://peipeizhou-eecs.github.io/
Thank you & Welcome to Questions