1 of 25

2022 IEEE International Conference on Artificial Intelligence Circuits and Systems

Virtual & Hybrid Conference

Optimizing Accelerator Configurability for

Mobile Transformer Networks

(*) Dept of Electrical Engineering, MICAS-ESAT, KULeuven, Belgium

(**) OPPO Electronics

Steven Colleman*,Peter Zhu**,Wei Sun**,Marian Verhelst*

2 of 25

2

Outline

  • Introduction
    • What is SU&TU?
    • Background & problem statement
  • Experiments and discussion
    • Impact of number of PEs
    • Impact of memory hierarchy on SUs
    • Impact of memory hierarchy on TUs
  • Conclusion

3 of 25

3

Outline

  • Introduction
    • What is SU&TU?
    • Background & problem statement
  • Experiments and discussion
    • Impact of number of PEs
    • Impact of memory hierarchy on SUs
    • Impact of memory hierarchy on TUs
  • Conclusion

4 of 25

4

What is SU&TU?

  • SU = parallel execution of computations in the PE array
  • TU = order in which computations happen, for loops

PE array

I

mem

O

mem

 

 

FSM

5 of 25

5

Impact of SU: spatial utilization

 

6 of 25

6

Impact of SU: spatial utilization

 

C

K

C

K

C

K

C

K

C 1-40

K 1-40

C 1-40

K 41-64

C 41-64

K 1-40

C 41-64

K 41-64

7 of 25

7

Impact of SU: spatial utilization

  • Spatial utilization: (64/80)*(64/80) = 64%

8 of 25

8

Impact of SU: spatial utilization

 

9 of 25

9

Impact of SU: spatial utilization, illustration

ResNet101 layer

K = 256, C = 1024,

Ox = Oy = 14

96.2%

Presence of C, K in layer

0.7%

Absence of G, Fx, Fy in layer

MobileNetv2 layer

G = 32, Fx = Fy = 3

Ox = Oy = 112

0.7%

Absence of C, K in layer

100%

Presence of G, Fx, Fy in layer

10 of 25

10

Impact of TU: temporal utilization

  • Temporal utilization: depends on stalling cycles

  • Defined by SU and most inner loop of TU, as these define the

required memory BW 🡪 equations in the paper

W mem

I mem

O mem

PE array

Are BWs large enough

to provide data every clock cycle?

11 of 25

11

Problem statement

  • Transformer networks: wide variety layer types
  • Each layer type: other optimal scheduling for SU&TU
    • Big differences 🡪 need for flexibility and support more than 1

  • Which SUs/TUs to select?
  • Impact of HW architecture?

12 of 25

12

Outline

  • Introduction
    • What is SU&TU?
    • Background & problem statement
  • Experiments and discussion
    • Impact of number of PEs
    • Impact of memory hierarchy on SUs
    • Impact of memory hierarchy on TUs
  • Conclusion

13 of 25

13

Used framework

  • COAC framework(*)

to derive optimal set of SUs

(*): S. Colleman, M. Shi, and M. Verhelst,

“Hyper-flexible single core cnn execution,” in arXiv.

14 of 25

14

Two extensions

  • SU pruning: for Ox = Oy images, take unrollings equal
    • Proof in paper
  • Matrix-matrix multiplication for attention units in transformer

network

    • Translate matrix-matrix mult into convolutional layer
    • Matrix-matrix mult: (P*Q)*(Q*R) = P*R
    • Conv layer:
      • C = Q
      • K = R
      • Fx = Fy = 1
      • Ox*Oy = P 🡪 Ox = Oy = sqrt(P)

15 of 25

15

Outline

  • Introduction
    • What is SU&TU?
    • Background & problem statement
  • Experiments and discussion
    • Impact of number of PEs
    • Impact of memory hierarchy on SUs
    • Impact of memory hierarchy on TUs
  • Conclusion

16 of 25

16

Used hardware architecture

  • 1 memory level

  • Impact of nb of PEs on EDP?
    • Is combining 2 SUs needed more for

bigger PE arrays?

  • Is this impact true for all BWs?
    • External BW as we are assuming here

1 memory level

  • Swept over these two parameters

17 of 25

17

Results

  • The more PEs, the more advantage you have from

combining 2 SUs (flexible PE array), for all BWs

Why? The more PEs, the more difficult to keep them all busy

PEs

BW I/O memory [bits]

Optimal individual SU

256

128

1.557

256

512

1.409

256

2048

1.415

8192

2048

4.124

8192

8192

3.839

18 of 25

18

Outline

  • Introduction
    • What is SU&TU?
    • Background & problem statement
  • Experiments and discussion
    • Impact of number of PEs
    • Impact of memory hierarchy on SUs
    • Impact of memory hierarchy on TUs
  • Conclusion

19 of 25

19

Used hardware architecture

  • 2 memory levels
  • Swept over L1 BWs to investigate

their impact

  • SU = function of BWs?

BW W

BW I

BW O

Architecture 1

2048

2048

2048

Architecture 2

4096

1024

1024

Architecture 3

1024

4096

1024

Architecture 4

1024

1024

4096

20 of 25

20

Results & selected SUs

EDP lowest for all BWs equal, makes sense 🡪 more options

Gain highest for BW I highest, but all gains lower than for 1 shared memory level

21 of 25

21

Results & selected SUs

For big PE array size

1 SU:

  • Function of architecture

Combination of 2 SUs:

  • Always Oxy (each layer)
  • 1st: C/K unrolling
  • 2nd: G/Fx/Fy unrolling

22 of 25

22

Outline

  • Introduction
    • What is SU&TU?
    • Background & problem statement
  • Experiments and discussion
    • Impact of number of PEs
    • Impact of memory hierarchy on SUs
    • Impact of memory hierarchy on TUs
  • Conclusion

23 of 25

23

Results

  • Implicit impact: arch&layer dimensions 🡪 SU
  • Most selected inner TU loop for each SU

  • General insights:
    • For non-DW layers, ‘for K’ when Cu > Ku and vice versa
    • For DW layers, always ‘for Fx/Fy’ instead when already included in SU

SU

non-DW

DW

1 SU

1

1

16

4

128

For C

For Fx

2 SUs, nr 1

1

64

2

1

64

For K

For Fx

2 SUs, nr 2

1

2

4

16

64

For C

For Ox

24 of 25

24

Outline

  • Introduction
    • What is SU&TU?
    • Background & problem statement
  • Experiments and discussion
    • Impact of number of PEs
    • Impact of memory hierarchy on SUs
    • Impact of memory hierarchy on TUs
  • Conclusion

25 of 25

25

Results

  • Big influence of SU and TU
    • Spatial unrolling depends on layer dimension
    • Temporal unrolling depends on architecture (BWs of memories)
  • More important for big PE arrays
    • Because more difficult to keep all PEs busy
  • Combination of 2 SUs: always C/K with G/Fx/Fy
    • Independent of architecture
  • SU & BWs of memory have impact on ideal stationarity

  • Future work: impact of overhead