1 of 23

High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives

Jinming Zhuang, Zhuoping Yang, Peipei Zhou

University of Pittsburgh

jinming.zhuang@pitt.edu

peipei.zhou@pitt.edu

https://github.com/arc-research-lab/CHARM

https://peipeizhou-eecs.github.io/

2 of 23

Architecture Overview

  • High Heterogeneity: CPU + Programmable Logic + VLIW AIE Cores

1

AIE Array

IO

AIE

VLIW

AIE Core

32KB

Mem

DDR4-DIMM

25.6 GB/s

Programmable Logic

BRAM

URAM

CLB

DSP

1.2 TB/s

NOC

Processor System

(ARM)

3 of 23

Challenge 1

  • High Heterogeneity 🡪 Non-Trivial Programming Effort

2

AIE Core

150 Lines of Code

How AIEs are behaved in the design

AIE Array

900 Lines of Code

How AIEs are connected in the graph

1000+ Lines of Code

PL

How does the data move between AIE and PL

4 of 23

Proposed Framework

3

  • AutoMM: Python based Automatic Framework for Matrix Multiply

5 of 23

Challenge 2

  • The theoretical computation capability vs. off-chip communication bandwidth

Platforms

0

5

10

15

20

25

16nm

7nm

FP32 Performance (TFLOPS)

NVIDIA Jetson TX2 GPU

AMD U250 FPGA

AMD VCK190 ACAP

Throughput (GFLOPS)

6.4

1.47

0.75

19.5

NVIDIA A100 GPU

4

0

200

400

600

800

1000

1200

1400

1600

Platforms

16nm

7nm

Bandwidth (GB/s)

Off-Chip Bandwidth (GB/s)

NVIDIA Jetson TX2 GPU

AMD U250 FPGA

AMD VCK190 ACAP

NVIDIA A100 GPU

77

1555

25.6

51.2

6 of 23

Challenge 2

  • AutoMM Implemented Energy Efficiency Comparison

5

Platforms

0

50

100

150

200

250

16nm

7nm

Required CTC Ratio (GFLOP/B)

NVIDIA Jetson TX2 GPU

AMD U250 FPGA

AMD VCK190 ACAP

Required CTC Ratio (GFLOP/B)

250

14.7

19.1

NVIDIA A100 GPU

12.5

0

10

20

30

40

50

60

70

Platforms

16nm

7nm

Energy Efficiency (GFLOPS/Watt)

Implemented Energy Efficiency (GFLOPS/Watt)

NVIDIA Jetson TX2 GPU

AMD U250 FPGA

AMD VCK190 ACAP

NVIDIA A100 GPU

2.3x

7.2x

1.1x

64.2

8.9

60.5

27.7

17.0x

13.0x

20.0x

More Harvesting on CTC

7 of 23

0

0

AIE Array

PLIO

AIE

25.6 GB/s

1.2 TB/s

PL

NOC

Processor System

(ARM)

DMA

Sender

Receiver

LHS

RHS

Output

DDR4

DDR Space

Sending

1

2

3

5

6

7

4

0

1

2

3

0

5

6

7

4

6

LHS

RHS

VLIW

Processor

32KB

Mem

Proposed Dataflow

8 of 23

0

0

AIE Array

PLIO

AIE

PL

NOC

Processor System

(ARM)

DMA

Sender

Receiver

LHS

RHS

Output

LHS

RHS

DDR4

DDR Space

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

2

3

5

6

7

4

0

1

2

3

0

5

6

7

4

1

0

2

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

0

2

Sending

0

1

6

25.6 GB/s

1.2 TB/s

VLIW

Processor

32KB

Mem

Proposed Dataflow

9 of 23

Challenges

“High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives”

Solutions

  • Off-chip bandwidth doesn’t scaling as fast as computation
  • Hugely explore the on-chip reuse to overlap the time of off-chip data movement
  • Conduct experiments on multiple platforms under different data types
  • On board results for INT8, INT16, FP32 on GPUs and FPGAs
  • Heterogeneity makes it non-trivial to make a system design with good performance on Versal
  • Propose a python based interface in the open-sourced Framework AutoMM

https://github.com/arc-research-lab/CHARM

7

10 of 23

AutoMM Framework Overview

8

AutoMM Input and Output (IOP)

  • Inputs (Python based Interface)�1) Numpy Arrays of Matrix Multiply; �2) Hardware Platforms�3) Python APIs (DSE, ACG)
  • Outputs�1) Bitstream Running on the AIE and PL (AIE/V++ Compiler)�2) Executable Running on ARM CPU (GCC Cross Compilation)

11 of 23

AutoMM Framework Overview

9

AutoMM DSE and ACG

  • DSE (Design Space Exploration)�1) AIE & PL: Overall Tiling

2) AIE Domain: PLIO Reuse, Placement� 3) PL Domain: Buffer Reuse, Efficient DMA

  • ACG (Automatic Code Generator)�1) AIE: Single Core Instruction & Array Graph C/C++

2) PL: Vitis HLS C/C++

3) Host CPU: AMD/Xilinx Runtime C/C++

12 of 23

AutoMM Design Methodology

1) Highly Efficient Single AIE Design Using VLIW

10

AIE Local Memory

AIE Core

x

=

LHS

RHS

Output

L0

L1

L2

L3

O1

O2

R00

R01

R02

R03

R10

R11

R12

R13

Load R

Load L

MAC

Store

Cycle

Register A

Register B

Accumulation Register

B0

B1

Acc0

Acc1

+

+

A0

A1

Load L

Load R

0

A0🡨L0

B0🡨R0

Pre

load

DoubleReg

Acc0+=

L0*R00

1

A1🡨L1

B1🡨R1

Acc1+=

L0*R10

2

A1🡨L1

B1🡨R1

Latency�Hidden

Acc0+=

L3*B03

Store

A0🡨L4

B0🡨R4

Acc1+=

L3*B13

Store

Acc0

N-1

Store

Acc1

N

13 of 23

AIE

AIE

AIE

AIE

PLIO

AutoMM Design Methodology

2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection

11

AIE

AIE

AIE

AIE

PLIO

Broadcast

Packet-switch

14 of 23

0

1

2

3

X

: 32* 32

LHS

RHS

A

B

C

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

3

2

1

0

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

Row 0

Row 1

Row 2

Row 3

PLIO

AIE Array

12

AutoMM Design Methodology

2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection

15 of 23

0

0

0

0

0

1

2

3

X

: 32* 32

LHS

RHS

A

B

C

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

3

2

1

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

Time 0

Row 0

Row 1

Row 2

Row 3

PLIO

0

0

0

0

13

AutoMM Design Methodology

2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection

16 of 23

1

1

1

1

0

0

0

0

0

1

2

3

X

: 32* 32

LHS

RHS

A

B

C

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

3

2

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

Time 1

Row 0

Row 1

Row 2

Row 3

PLIO

1

1

1

1

14

AutoMM Design Methodology

2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection

17 of 23

2

2

2

2

1

1

1

1

0

0

0

0

0

1

2

3

X

: 32* 32

LHS

RHS

A

B

C

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

3

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

Time 2

Row 0

Row 1

Row 2

Row 3

PLIO

2

2

2

2

15

AutoMM Design Methodology

2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection

18 of 23

3

3

3

3

2

2

2

2

1

1

1

1

0

0

0

0

0

1

2

3

X

: 32* 32

LHS

RHS

A

B

C

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

Time 3

Row 0

Row 1

Row 2

Row 3

PLIO

3

3

3

3

16

AutoMM Design Methodology

2) Hugely IO Reused AIE Array Design : Broadcast-Packet switch Connection

19 of 23

AutoMM Design Methodology

3) Routing Optimization on AIE Array Design

17

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

Interface

PLIO 0

Packet-Switch

Broadcast

Switch Box (Router)

PLIO

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

AIE

Interface

PLIO 0

Interface

PLIO 1

10 Connections from West 🡪 East

4 Connections from West 🡪 Eest

20 of 23

Experiment Setup

18

  • Implemented Platform: AMD Versal VCK190 ACAP
  • Frequency: PL@230MHz, AIE@1GHz
  • Baseline Platforms: AMD U250 FPGA, NVIDIA Jeston TX2 and A100 GPUs
  • Applications: Matrix Multiply, NCF and MLP
  • Software Tools: Vitis 2021.1/2019.2, CUDA Toolkit 10.2/11.3
  • Resource Utilization of VCK190

21 of 23

Experiment Results

  • AutoMM Implemented Energy Efficiency Comparison

19

0

20

40

60

80

100

120

140

Platforms

16nm

7nm

Energy Efficiency (GOPS/Watt)

INT16 Energy Efficiency (GOPS/Watt)

NVIDIA Jetson TX2 GPU

AMD U250 FPGA

AMD VCK190 ACAP

NVIDIA A100 GPU

N/A

N/A

132.2

40.6

3.3x

0

10

20

30

40

50

60

70

Platforms

16nm

7nm

Energy Efficiency (GFLOPS/Watt)

Implemented Energy Efficiency (GFLOPS/Watt)

NVIDIA Jetson TX2 GPU

AMD U250 FPGA

AMD VCK190 ACAP

NVIDIA A100 GPU

64.2

8.9

60.5

27.7

2.3x

7.2x

1.1x

22 of 23

Experiment Results

  • AutoMM Implemented Energy Efficiency Comparison

20

0

20

40

60

80

100

120

140

Platforms

16nm

7nm

Energy Efficiency (GOPS/Watt)

INT8 Energy Efficiency (GOPS/Watt)

NVIDIA Jetson TX2 GPU

AMD U250 FPGA

AMD VCK190 ACAP

NVIDIA A100 GPU

N/A

461.7

74.2

270.9

FP32 NCF and MLP Energy Efficiency (GFLOPS/watt)

AMD VCK190 ACAP

NVIDIA A100 GPU

Energy Efficiency (GFLOPS/Watt)

51.6

49.4

55.1

63.9

0.96x

1.16x

6.2x

1.7x

23 of 23

High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives

Jinming Zhuang, Zhuoping Yang, Peipei Zhou

University of Pittsburgh

jinming.zhuang@pitt.edu

peipei.zhou@pitt.edu

https://github.com/arc-research-lab/CHARM

https://peipeizhou-eecs.github.io/

Thank you & Welcome to Questions