1 of 23

High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives

Jinming Zhuang, Zhuoping Yang, Peipei Zhou

University of Pittsburgh

jinming.zhuang@pitt.edu

peipei.zhou@pitt.edu

https://github.com/arc-research-lab/CHARM

https://peipeizhou-eecs.github.io/

2 of 23

Architecture Overview

High Heterogeneity: CPU + Programmable Logic + VLIW AIE Cores

1

AIE Array

IO

AIE

VLIW

AIE Core

32KB

Mem

DDR4-DIMM

25.6 GB/s

Programmable Logic

BRAM

URAM

CLB

DSP

1.2 TB/s

NOC

Processor System

(ARM)

3 of 23

Challenge 1

High Heterogeneity 🡪 Non-Trivial Programming Effort

2

AIE Core

150 Lines of Code

How AIEs are behaved in the design

AIE Array

900 Lines of Code

How AIEs are connected in the graph

1000+ Lines of Code

PL

How does the data move between AIE and PL

4 of 23

Proposed Framework

3

AutoMM: Python based Automatic Framework for Matrix Multiply

5 of 23

Challenge 2

The theoretical computation capability vs. off-chip communication bandwidth

Platforms

0

5

10

15

20

25

16nm

7nm

FP32 Performance (TFLOPS)

NVIDIA Jetson TX2 GPU

AMD U250 FPGA

AMD VCK190 ACAP

Throughput (GFLOPS)

6.4

1.47

0.75

19.5

NVIDIA A100 GPU

4

0

200

400

600

800

1000

1200

1400

1600

Platforms

16nm

7nm

Bandwidth (GB/s)

Off-Chip Bandwidth (GB/s)

NVIDIA Jetson TX2 GPU

AMD U250 FPGA

AMD VCK190 ACAP

NVIDIA A100 GPU

77

1555

25.6

51.2

6 of 23

Challenge 2

AutoMM Implemented Energy Efficiency Comparison

5

Platforms

0

50

100

150

200

250

16nm

7nm

Required CTC Ratio (GFLOP/B)

NVIDIA Jetson TX2 GPU

AMD U250 FPGA

AMD VCK190 ACAP

Required CTC Ratio (GFLOP/B)

250

14.7

19.1

NVIDIA A100 GPU

12.5

0

10

20

30

40

50

60

70

Platforms

16nm

7nm

Energy Efficiency (GFLOPS/Watt)

Implemented Energy Efficiency (GFLOPS/Watt)

NVIDIA Jetson TX2 GPU

AMD U250 FPGA

AMD VCK190 ACAP

NVIDIA A100 GPU

2.3x

7.2x

1.1x

64.2

8.9

60.5

27.7

17.0x

13.0x

20.0x

More Harvesting on CTC

7 of 23

0

AIE Array

PLIO

AIE

25.6 GB/s

1.2 TB/s

PL

NOC

Processor System

(ARM)

DMA

Sender

Receiver

LHS

RHS

Output

DDR4

DDR Space

Sending

1

2

3

5

6

7

4

0

1

2

3

0

5

6

7

4

6

LHS

RHS

VLIW

Processor

32KB

Mem

Proposed Dataflow

8 of 23

0

AIE Array

PLIO

AIE

PL

NOC

Processor System

(ARM)

DMA

Sender

Receiver

LHS

RHS

Output

LHS

RHS

DDR4

DDR Space

0

1

2

3

5

6

7

4

0

1

2

3

0

5

6

7

4

1

0

2

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

Sending

0

1

6

25.6 GB/s

1.2 TB/s

VLIW

Processor

32KB

Mem

Proposed Dataflow

9 of 23

Challenges

“High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives”

Solutions

Off-chip bandwidth doesn’t scaling as fast as computation

Hugely explore the on-chip reuse to overlap the time of off-chip data movement

Conduct experiments on multiple platforms under different data types

On board results for INT8, INT16, FP32 on GPUs and FPGAs

Heterogeneity makes it non-trivial to make a system design with good performance on Versal

Propose a python based interface in the open-sourced Framework AutoMM

https://github.com/arc-research-lab/CHARM

7

10 of 23

AutoMM Framework Overview

8

AutoMM Input and Output (IOP)

Inputs (Python based Interface)�1) Numpy Arrays of Matrix Multiply; �2) Hardware Platforms�3) Python APIs (DSE, ACG)

Outputs�1) Bitstream Running on the AIE and PL (AIE/V++ Compiler)�2) Executable Running on ARM CPU (GCC Cross Compilation)

11 of 23

AutoMM Framework Overview

9

AutoMM DSE and ACG

DSE (Design Space Exploration)�1) AIE & PL: Overall Tiling

2) AIE Domain: PLIO Reuse, Placement� 3) PL Domain: Buffer Reuse, Efficient DMA�

ACG (Automatic Code Generator)�1) AIE: Single Core Instruction & Array Graph C/C++

2) PL: Vitis HLS C/C++

3) Host CPU: AMD/Xilinx Runtime C/C++

The main part of AutoMM consists of three modules including the accelerator design space exploration, the host runtime configuration, and the automatic code generator. Our DSE is resposible for optimizing the tiling selection for both PL and AIE. And on the AIE domain it will optimize the PLIO reuse and placement which will be covered later. On the PL domain it will search for the efficient DMA and buffer design. The CHARM runtime configuration is responsible for managing the data movement between host CPU and the device. The AutoMM automatic code generator leverages the information from DSE and runtime configuration to generate the design code for AIE, PL as well as the Host CPU.

�Because the AI engine array is an attractive feature of Versal ACAP compared to traditional FPGAs, we highlight our design methodology under the hood of DSE in two aspects including the efficient single AIE design, the hugely PLIO reused and routing optimized AIE array design.

12 of 23

AutoMM Design Methodology

1) Highly Efficient Single AIE Design Using VLIW

10

AIE Local Memory

AIE Core

x

=

LHS

RHS

Output

L0

L1

L2

L3

O1

O2

R₀₀

R₀₁

R₀₂

R₀₃

R₁₀

R₁₁

R₁₂

R₁₃

Load R

Load L

MAC

Store

Cycle

Register A

Register B

Accumulation Register

B0

B1

Acc0

Acc1

+

A0

A1

Load L

Load R

0

A0🡨L0

B0🡨R0

Pre

load

DoubleReg

Acc0+=

L0*R₀₀

1

A1🡨L1

B1🡨R1

Acc1+=

L0*R₁₀

2

A1🡨L1

B1🡨R1

Latency�Hidden

…

Acc0+=

L3*B₀₃

Store

A0🡨L4

B0🡨R4

Acc1+=

L3*B₁₃

Store

Acc0

N-1

Store

Acc1

N

For the single AIE, by digging into the hardware we make full use of the very long instruction word capability. �Here we also apply the double buffer technique to the AIE local memory, in this level the time spent on data transmission can be overlapped by the computation. �Thus in the following example, we assume our data is already loaded into the AIE local memory, and demonstrate the process of data movement and computation between local memory and AIE registers. �Assume in the local memory, we have four four-element vectors in the right hand side matrix and two four-element vectors in the left-hand-side matrix. In the AIE core, we allocate four vector registers and two accumulation registers. We illustrate our single AIE design by showing each time step in the pipeline. In cycle 0, we preload two vectors from the AIE local memory into the AIE registers, thus in cycle one while loading another vector of data, it can do vector mac operation based on the preloaded data and store to the accumulator zero. In order to hide the latency of accumulating the temporal results to the same accumulation register, in cycle 2 we insert another mac operation to calculate data in accumulator one. To avoid frequent register eviction, we will store the data from the accumulation register back to the local memory only after going through the reduction dimension as illustrated in cycle N-1. In this pipeline we make full use of the computation resource since there would be no bubbles in the mac operation stage.

13 of 23

AIE

PLIO

AutoMM Design Methodology

2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection

11

AIE

PLIO

Broadcast

Packet-switch

14 of 23

0

1

2

3

X

: 32* 32

LHS

RHS

A

B

C

AIE

3

2

1

0

AIE

Row 0

Row 1

Row 2

Row 3

PLIO

AIE Array

12

AutoMM Design Methodology

2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection

15 of 23

0

1

2

3

X

: 32* 32

LHS

RHS

A

B

C

AIE

3

2

1

AIE

Time 0

Row 0

Row 1

Row 2

Row 3

PLIO

0

13

AutoMM Design Methodology

2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection

16 of 23

1

0

1

2

3

X

: 32* 32

LHS

RHS

A

B

C

AIE

3

2

AIE

Time 1

Row 0

Row 1

Row 2

Row 3

PLIO

1

14

AutoMM Design Methodology

2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection

17 of 23

2

1

0

1

2

3

X

: 32* 32

LHS

RHS

A

B

C

AIE

3

AIE

Time 2

Row 0

Row 1

Row 2

Row 3

PLIO

2

15

AutoMM Design Methodology

2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection

18 of 23

3

2

1

0

1

2

3

X

: 32* 32

LHS

RHS

A

B

C

AIE

Time 3

Row 0

Row 1

Row 2

Row 3

PLIO

3

16

AutoMM Design Methodology

2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection

19 of 23

AutoMM Design Methodology

3) Routing Optimization on AIE Array Design

17

AIE

Interface

PLIO 0

Packet-Switch

Broadcast

Switch Box (Router)

PLIO

AIE

Interface

PLIO 0

Interface

PLIO 1

10 Connections from West 🡪 East

4 Connections from West 🡪 Eest

20 of 23

Experiment Setup

18

Implemented Platform: AMD Versal VCK190 ACAP

Frequency: PL@230MHz, AIE@1GHz

Baseline Platforms: AMD U250 FPGA, NVIDIA Jeston TX2 and A100 GPUs

Applications: Matrix Multiply, NCF and MLP

Software Tools: Vitis 2021.1/2019.2, CUDA Toolkit 10.2/11.3

Resource Utilization of VCK190

21 of 23

Experiment Results

AutoMM Implemented Energy Efficiency Comparison

19

0

20

40

60

80

100

120

140

Platforms

16nm

7nm

Energy Efficiency (GOPS/Watt)

INT16 Energy Efficiency (GOPS/Watt)

NVIDIA Jetson TX2 GPU

AMD U250 FPGA

AMD VCK190 ACAP

NVIDIA A100 GPU

N/A

132.2

40.6

3.3x

0

10

20

30

40

50

60

70

Platforms

16nm

7nm

Energy Efficiency (GFLOPS/Watt)

Implemented Energy Efficiency (GFLOPS/Watt)

NVIDIA Jetson TX2 GPU

AMD U250 FPGA

AMD VCK190 ACAP

NVIDIA A100 GPU

64.2

8.9

60.5

27.7

2.3x

7.2x

1.1x

22 of 23

Experiment Results

AutoMM Implemented Energy Efficiency Comparison

20

0

20

40

60

80

100

120

140

Platforms

16nm

7nm

Energy Efficiency (GOPS/Watt)

INT8 Energy Efficiency (GOPS/Watt)

NVIDIA Jetson TX2 GPU

AMD U250 FPGA

AMD VCK190 ACAP

NVIDIA A100 GPU

N/A

461.7

74.2

270.9

FP32 NCF and MLP Energy Efficiency (GFLOPS/watt)

AMD VCK190 ACAP

NVIDIA A100 GPU

Energy Efficiency (GFLOPS/Watt)

51.6

49.4

55.1

63.9

0.96x

1.16x

6.2x

1.7x

23 of 23

High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives

Jinming Zhuang, Zhuoping Yang, Peipei Zhou

University of Pittsburgh

jinming.zhuang@pitt.edu

peipei.zhou@pitt.edu

https://github.com/arc-research-lab/CHARM

https://peipeizhou-eecs.github.io/

Thank you & Welcome to Questions