1 of 36

Systolic Arrays & Their Applications

DR.SANGEETA ARORA

(HOD, P.G DEPT OF COMPUTER SCIENCE & IT)

2 of 36

Overview

What it is
N-body problem
Matrix multiplication
Cannon’s method
Other applications
Conclusion

3 of 36

What Is a Systolic Array?

A systolic array is an arrangement of processors in an array

where data flows synchronously across the array between

neighbors, usually with different data flowing in different directions

Each processor at each step takes in data from one or more

neighbors (e.g. North and West), processes it and, in the next

step, outputs results in the opposite direction (South and East).

�

H. T. Kung and Charles Leiserson were the first to publish

a paper on systolic arrays in 1978, and coined the name.

4 of 36

What Is a Systolic Array?

A specialized form of parallel computing.
Multiple processors connected by short wires.
Unlike many forms of parallelism which lose speed through their connection.
Cells(processors), compute data and store it independently of each other.

5 of 36

Systolic Unit(cell)

Each unit is an independent processor.
Every processor has some registers and an ALU.
The cells share information with their neighbors, after performing the needed operations on the data.

6 of 36

Some simple examples of systolic array models.

7 of 36

N-body Problem

Conventional method: N²
Systolic method: N
We will need one processing element for each body, lets call them planets.

8 of 36

B1

B2

B3

B6

B4

B5

Six planets in orbit with different masses, and different forces

acting on each other, so are systolic array will look like this.

9 of 36

Each processor will hold the newtonian force formula:

F_ij = km_im_j/d²_ij , k being the gravitational constant.

Then load the array with values.

B1

B2

B3

B4

B5

B6

Now that the array has the mass and the coordinates of

Each body we can begin are computation.

10 of 36

B1_m

B1_c

B2_m

B2_c

B5_m

B5_c

B4_m

B4_c

B6_m

B6_c

B3_m

B3_c

B6, B5, B4, B3, B2, B1

F_ij = km_im_j/d²_ij

11 of 36

B1_m

B1_c

B2_m

B2_c

B5_m

B5_c

B4_m

B4_c

B6*B1

B3_m

B3_c

B6, B5, B4, B3, B2

F_ij = km_im_j/d²_ij

12 of 36

B1_m

B1_c

B2_m

B2_c

B5*B1

B4_m

B4_c

B6*B2

B3_m

B3_c

B6, B5, B4, B3

F_ij = km_im_j/d²_ij

13 of 36

B1_m

B1_c

B2_m

B2_c

B5*B2

B4*B1

B6*B3

B3_m

B3_c

B6, B5, B4

F_ij = km_im_j/d²_ij

14 of 36

B1_m

B1_c

B2_m

B2_c

B5*B3

B4*B2

B6*B4

B3*B1

B6, B5

F_ij = km_im_j/d²_ij

15 of 36

B1_m

B1_c

B2*B1

B5*B4

B4*B3

B6*B5

B3*B2

B6

F_ij = km_im_j/d²_ij

16 of 36

Done

F_ij = km_im_j/d²_ij

17 of 36

Matrix Multiplication

a11 a12 a13

a21 a22 a23

a31 a32 a33

*

b11 b12 b13

b21 b22 b23

b31 b32 b33

=

c11 c12 c13

c21 c22 c23

c31 c32 c33

Conventional Method: N³

For I = 1 to N

For J = 1 to N

For K = 1 to N

C[I,J] = C[I,J] + A[J,K] * B[K,J];

18 of 36

Systolic Method

This will run in O(n) time!

To run in N time we need N x N processing units, in this case

we need 9.

P9

P8

P7

P6

P5

P4

P1

P2

P3

19 of 36

We need to modify the input data, like so:

a13 a12 a11

a23 a22 a21

a33 a32 a31

b31 b32 b33

b21 b22 b23

b11 b12 b13

Flip columns 1 & 3

Flip rows 1 & 3

and finally stagger the data sets for input.

20 of 36

P9

P8

P7

P6

P5

P4

P1

P2

P3

a13 a12 a11

a23 a22 a21

a33 a32 a31

b31

b21

b11

b32

b22

b12

b33

b23

b13

At every tick of the global system clock data is passed to each

processor from two different directions, then it is multiplied

and the result is saved in a register.

21 of 36

3 4 2

2 5 3

3 2 5

*

=

3 4 2

2 5 3

3 2 5

23 36 28

25 39 34

28 32 37

Lets try this using a systolic array.

P9

P8

P7

P6

P5

P4

P1

P2

P3

2 4 3

3 5 2

3

2

3

5 2 3

5

3

2

5

4

22 of 36

3*3

2 4

3 5 2

3

2

5 2 3

5

3

2

5

4

Clock tick: 1

9	0	0	0	0	0	0	0	0

P1

P2

P3

P4

P6

P5

P7

P8

P9

23 of 36

2*3

4*2

3*4

2

3 5

3

5 2 3

5

3

2

5

Clock tick: 2

17	12	0	6	0	0	0	0	0

P1

P2

P3

P4

P6

P5

P7

P8

P9

24 of 36

3*3

2*4

5*2

2*3

4*5

3*2

3

5 2

5

3

2

Clock tick: 3

23	32	6	16	8	0	9	0	0

P1

P2

P3

P4

P6

P5

P7

P8

P9

25 of 36

3*4

2*2

5*5

3*3

2*2

4*3

5

Clock tick: 4

23	36	18	25	33	4	13	12	0

P1

P2

P3

P4

P6

P5

P7

P8

P9

26 of 36

3*2

5*2

5*3

3*2

2*5

Clock tick: 5

23	36	28	25	39	19	28	22	6

P1

P2

P3

P4

P6

P5

P7

P8

P9

27 of 36

2*3

5*2

3*5

Clock tick: 6

23	36	28	25	39	34	28	32	12

P1

P2

P3

P4

P6

P5

P7

P8

P9

28 of 36

5*5

Clock tick: 7

23	36	28	25	39	34	28	32	37

P1

P2

P3

P4

P6

P5

P7

P8

P9

29 of 36

37

32

28

34

39

25

23

36

28

23	36	28	25	39	34	28	32	37

P1

P2

P3

P4

P6

P5

P7

P8

P9

Same answer! In 2n + 1 time, can we do better?

The answer is yes, there is an optimization.

30 of 36

Cannon’s Technique

No processor is idle.
Instead of feeding the data, it is cycled.
Data is staggered, but slightly different than before.
Not including the loading which can be done in parallel for a time of 1, this technique is effectively N.

31 of 36

Other Applications

Signal processing
Image processing
Solutions of differential equations
Graph algorithms
Biological sequence comparison
Other computationally intensive tasks

32 of 36

Samba: Systolic Accelerator for Molecular Biological Applications

This systolic array contains 128 processors shared into 32 full custom

VLSI chips. One chip houses 4 processors, and one processor performs

10 millions matrix cells per second.

33 of 36

Why Systolic?

Extremely fast.
Easily scalable architecture.
Can do many tasks single processor machines cannot attain.
Turns some exponential problems into linear or polynomial time.

34 of 36

Why Not Systolic?

Expensive.
Not needed on most applications, they are a highly specialized processor type.
Difficult to implement and build.

35 of 36

Summary

What a systolic array is.
Step through of matrix multiplication.
Cannon’s optimization.
Other applications including samba.
Positives and negatives of systolic arrays.

36 of 36

THANK YOU