1 of 25

2022 IEEE International Conference on Artificial Intelligence Circuits and Systems

Virtual & Hybrid Conference

Optimizing Accelerator Configurability for

Mobile Transformer Networks

(*) Dept of Electrical Engineering, MICAS-ESAT, KULeuven, Belgium

(**) OPPO Electronics

Steven Colleman*,Peter Zhu**,Wei Sun**,Marian Verhelst*

2 of 25

2

Outline

Introduction

What is SU&TU?
Background & problem statement

Experiments and discussion

Impact of number of PEs
Impact of memory hierarchy on SUs
Impact of memory hierarchy on TUs

Conclusion

3 of 25

3

Outline

Introduction

What is SU&TU?
Background & problem statement

Experiments and discussion

Impact of number of PEs
Impact of memory hierarchy on SUs
Impact of memory hierarchy on TUs

Conclusion

4 of 25

4

What is SU&TU?

SU = parallel execution of computations in the PE array
TU = order in which computations happen, for loops

PE array

I

mem

O

mem

…

FSM

This is an example of how one convolutional layer of one CNN can be executed.

With spatial unrolling, we mean the parallel execution of computations in the PE array. Let’s illustrate with an example: we have here a PE array with 8 rows and 8 columns.

In this example, we want to perform computations for 8 input channels and 8 output channels in parallel sending the inputs horizontal through the PE array and adding results in the vertical direction.

The parallelization degree of a given loop is indicated with a suffix u so as you see, we write this down as C_u = 8 and K_u = 8. So the spatial urnolling deals with what happens in one clock cycle.

The temporal unrolling deals with the finite state machine, in other words the order in which computations happen or the order of the for loops.

5 of 25

5

Impact of SU: spatial utilization

6 of 25

6

Impact of SU: spatial utilization

C

K

C

K

C

K

C

K

C 1-40

K 1-40

C 1-40

K 41-64

C 41-64

K 1-40

C 41-64

K 41-64

In the first clock cycle, we can use all our PEs as the number of channels is higher than the C_u and the K_u values. In the second clock cycle, we want to perform the computations for the next output channels but now we have a problem: there are only 24 (which 64-40) output channels less to compute, whereas the K_u is equal to 40. This means that only 24 out of the 40 columns will be used and the spatial utilization will not be perfect.

In the third clock cycle, we perform computation with the next input channels for the first 40 output channels. Again, we only have 24 input channels left and are therefore using only 24 rows out of the 40.

In the last clock cycle, we evaluate the impact of the last 24 input channels on the last 24 ouput channels and are therefore using even less PEs useful.

7 of 25

7

Impact of SU: spatial utilization

Spatial utilization: (64/80)*(64/80) = 64%

8 of 25

8

Impact of SU: spatial utilization

But how do we compute this now in general?

For each of the loop dimensions, we can compute the according utilization factor and multiply them all with each other. We already got a feeling about this in the previous slide, we had there 64/80 for each of the loop dimensions. The numerator is easy: that’s just the loop dimension. The denominator is a bit more difficult: that’s the smallest multiple of the loop unrolling size that is bigger than or equal to the loop dimension. For the example in the previous slide: the smallest multiple of 40 that is bigger than or equal to 64 and that’s therefore equal to 80. We compute this by multiplying the loop unrolling size with the ceil of the fraction of the loop dimension and the loop unrolling size.

This formula already illustrates that a given spatial unrolling will not be as efficiently for each layer dimension and vice versa. We will illustrate this with an example in the next slide.

9 of 25

9

Impact of SU: spatial utilization, illustration


ResNet101 layer K = 256, C = 1024, Ox = Oy = 14	96.2% Presence of C, K in layer	0.7% Absence of G, Fx, Fy in layer
MobileNetv2 layer G = 32, Fx = Fy = 3 Ox = Oy = 112	0.7% Absence of C, K in layer	100% Presence of G, Fx, Fy in layer

In this slide, we have two different spatial unrollings. For fair comparison, we have two spatial unrollings that deal with the same number of PEs, in this case 144. The first spatial unrolling is Edge TPU-like, so with an unrolling of 12 input channels and 12 output channels.

The second spatial unrolling unrolls spatially over 16 groups (so ideally for grouped convolutions), and also spatially unrolls over the kernel filter dimension Fx and Fy.

We will evaluate these spatial unrollings for two different convolutionl layers: first, a pointwise layers from ResNet and a depthwise layer from MobileNetv2.

The first spatial unrolling is efficient for the ResNet layer and not efficient for the MobileNetv2 layer. Why? Because of the fact that the C and the K from the spatial unrolling are present in the ResNet layer and absent in the MobileNetv2 layer. For the second spatial unrolling, it is exactly the other way around. The loop dimensions from the spatial unrolling are presnet in the MobileNetv2 layer but not in the ResNet layer.

10 of 25

10

Impact of TU: temporal utilization

Temporal utilization: depends on stalling cycles

Defined by SU and most inner loop of TU, as these define the

required memory BW 🡪 equations in the paper

W mem

I mem

O mem

PE array

Are BWs large enough

to provide data every clock cycle?

11 of 25

11

Problem statement

Transformer networks: wide variety layer types
Each layer type: other optimal scheduling for SU&TU

Big differences 🡪 need for flexibility and support more than 1

Which SUs/TUs to select?
Impact of HW architecture?

12 of 25

12

Outline

Introduction

What is SU&TU?
Background & problem statement

Experiments and discussion

Impact of number of PEs
Impact of memory hierarchy on SUs
Impact of memory hierarchy on TUs

Conclusion

13 of 25

13

Used framework

COAC framework(*)

to derive optimal set of SUs

(*): S. Colleman, M. Shi, and M. Verhelst,

“Hyper-flexible single core cnn execution,” in arXiv.

14 of 25

14

Two extensions

SU pruning: for Ox = Oy images, take unrollings equal

Proof in paper

Matrix-matrix multiplication for attention units in transformer

network

Translate matrix-matrix mult into convolutional layer
Matrix-matrix mult: (P*Q)*(Q*R) = P*R
Conv layer:

C = Q
K = R
Fx = Fy = 1
Ox*Oy = P 🡪 Ox = Oy = sqrt(P)

15 of 25

15

Outline

Introduction

What is SU&TU?
Background & problem statement

Experiments and discussion

Impact of number of PEs
Impact of memory hierarchy on SUs
Impact of memory hierarchy on TUs

Conclusion

16 of 25

16

Used hardware architecture

1 memory level

Impact of nb of PEs on EDP?

Is combining 2 SUs needed more for

bigger PE arrays?

Is this impact true for all BWs?

External BW as we are assuming here

1 memory level

Swept over these two parameters

17 of 25

17

Results

The more PEs, the more advantage you have from

combining 2 SUs (flexible PE array), for all BWs

Why? The more PEs, the more difficult to keep them all busy

PEs	BW I/O memory [bits]		Optimal individual SU
256	128	1.557
256	512	1.409
256	2048	1.415
8192	2048	4.124
8192	8192	3.839

18 of 25

18

Outline

Introduction

What is SU&TU?
Background & problem statement

Experiments and discussion

Impact of number of PEs
Impact of memory hierarchy on SUs
Impact of memory hierarchy on TUs

Conclusion

19 of 25

19

Used hardware architecture

2 memory levels
Swept over L1 BWs to investigate

their impact

SU = function of BWs?

	BW W	BW I	BW O
Architecture 1	2048	2048	2048
Architecture 2	4096	1024	1024
Architecture 3	1024	4096	1024
Architecture 4	1024	1024	4096

20 of 25

20

Results & selected SUs

EDP lowest for all BWs equal, makes sense 🡪 more options

Gain highest for BW I highest, but all gains lower than for 1 shared memory level

21 of 25

21

Results & selected SUs

For big PE array size

1 SU:

Function of architecture

Combination of 2 SUs:

Always Oxy (each layer)
1st: C/K unrolling
2nd: G/Fx/Fy unrolling

But which spatial unrollings are combined now?

In the left column, you always see the selected spatial unrolling for ezch architecture if we can only pick 1. We see there is quite some difference between the architectures: unrolling over Ox and Oy is always token, because all layers have this loop dimension. But the other loop that re spatially unrolled differ a bit from architecture to architecture: we sometimes have input channels, sometimes output channels and sometimes Fx and Fy.

For the combination of two spatial unrollings, the architecture dependency is muc smaller. We again always have Ox and Oy unrolling in both sâtial unrollings, but now we always have one spatial unrolling that exploits unrolling iver the input and output channels and one spatial unrolling that exploits parallelization over the number of groups G anf the kernel filter Fx and Fy. As you see, the exact values of the unrollings differ a bit from architecture to architecture but the general structure remains the same.

22 of 25

22

Outline

Introduction

What is SU&TU?
Background & problem statement

Experiments and discussion

Impact of number of PEs
Impact of memory hierarchy on SUs
Impact of memory hierarchy on TUs

Conclusion

23 of 25

23

Results

Implicit impact: arch&layer dimensions 🡪 SU
Most selected inner TU loop for each SU

General insights:

For non-DW layers, ‘for K’ when Cu > Ku and vice versa
For DW layers, always ‘for Fx/Fy’ instead when already included in SU

SU						non-DW	DW
1 SU	1	1	16	4	128	For C	For Fx
2 SUs, nr 1	1	64	2	1	64	For K	For Fx
2 SUs, nr 2	1	2	4	16	64	For C	For Ox

And we noticed that that thsi impact is more implicitly as we found out that a good temporal unrolling is more function of the selected spatial unrolling. So the architecture and especially the layer dimensions determine the spatial unrolling and the spatial unrolling determines the optimal temporal unrolling.

In general, we saw that for non-depthwise layers, the ‘for K’ loop as most inner temporal unrolling loop was selected if C_u > K_u and vice versa. This is logic as that decision minimizes bandwidth requirements.

For depthwise layers, we alwyas select for Fx and for Fy as most inner for loops again to minimize bandwidth, but in case the Fx and Fy is already used in the spatial unrolling, it doesn’t need to be executed in the temporal unrolling anymore and therefore, something else is picked.

24 of 25

24

Outline

Introduction

What is SU&TU?
Background & problem statement

Experiments and discussion

Impact of number of PEs
Impact of memory hierarchy on SUs
Impact of memory hierarchy on TUs

Conclusion

25 of 25

25

Results

Big influence of SU and TU

Spatial unrolling depends on layer dimension
Temporal unrolling depends on architecture (BWs of memories)

More important for big PE arrays

Because more difficult to keep all PEs busy

Combination of 2 SUs: always C/K with G/Fx/Fy

Independent of architecture

SU & BWs of memory have impact on ideal stationarity

Future work: impact of overhead

The selected spatial and temporal unrolling is very important for efficient CNN execution. The spatial unrolling depends on the layer dimensions, the temporal unrolling depends on the spatial unrolling and the architecture (the memory bandwidths tobe more precise).

The flexibility to combine spatial unrollings is moe improtant for big PE arrays as it is more difficult there to keep all Pes busy. So that’s something to take into account in the future as chips always become bigger and bigger.

If we combine two spatial urnollings, you can best pick one that unrols C and K for the non-depthwise layers and oen that unrols G, Fx and Fy for the non-depthwise layers.

As already said, the combination of spatial unrolling and memory bandwidths determines the ideal stationarity.

Our research group is now investigating what the impact of this flexibility is because of course every flexibility comes with some overhead. I would like to thank you for your attention.