Systolic Arrays & Their Applications
DR.SANGEETA ARORA
(HOD, P.G DEPT OF COMPUTER SCIENCE & IT)
Overview
What Is a Systolic Array?
A systolic array is an arrangement of processors in an array
where data flows synchronously across the array between
neighbors, usually with different data flowing in different directions
Each processor at each step takes in data from one or more
neighbors (e.g. North and West), processes it and, in the next
step, outputs results in the opposite direction (South and East).
�
�
H. T. Kung and Charles Leiserson were the first to publish
a paper on systolic arrays in 1978, and coined the name.
What Is a Systolic Array?
Systolic Unit(cell)
Some simple examples of systolic array models.
N-body Problem
B1
B2
B3
B6
B4
B5
Six planets in orbit with different masses, and different forces
acting on each other, so are systolic array will look like this.
Each processor will hold the newtonian force formula:
Fij = kmi mj /d2ij , k being the gravitational constant.
Then load the array with values.
B1
B2
B3
B4
B5
B6
Now that the array has the mass and the coordinates of
Each body we can begin are computation.
B1m
B1c
B2m
B2c
B5m
B5c
B4m
B4c
B6m
B6c
B3m
B3c
B6, B5, B4, B3, B2, B1
Fij = kmimj/d2ij
B1m
B1c
B2m
B2c
B5m
B5c
B4m
B4c
B6*B1
B3m
B3c
B6, B5, B4, B3, B2
Fij = kmimj/d2ij
B1m
B1c
B2m
B2c
B5*B1
B4m
B4c
B6*B2
B3m
B3c
B6, B5, B4, B3
Fij = kmimj/d2ij
B1m
B1c
B2m
B2c
B5*B2
B4*B1
B6*B3
B3m
B3c
B6, B5, B4
Fij = kmimj/d2ij
B1m
B1c
B2m
B2c
B5*B3
B4*B2
B6*B4
B3*B1
B6, B5
Fij = kmimj/d2ij
B1m
B1c
B2*B1
B5*B4
B4*B3
B6*B5
B3*B2
B6
Fij = kmimj/d2ij
Done
Done
Done
Done
Done
Done
Fij = kmimj/d2ij
Matrix Multiplication
a11 a12 a13
a21 a22 a23
a31 a32 a33
*
b11 b12 b13
b21 b22 b23
b31 b32 b33
=
c11 c12 c13
c21 c22 c23
c31 c32 c33
Conventional Method: N3
For I = 1 to N
For J = 1 to N
For K = 1 to N
C[I,J] = C[I,J] + A[J,K] * B[K,J];
Systolic Method
This will run in O(n) time!
To run in N time we need N x N processing units, in this case
we need 9.
P9
P8
P7
P6
P5
P4
P1
P2
P3
We need to modify the input data, like so:
a13 a12 a11
a23 a22 a21
a33 a32 a31
b31 b32 b33
b21 b22 b23
b11 b12 b13
Flip columns 1 & 3
Flip rows 1 & 3
and finally stagger the data sets for input.
P9
P8
P7
P6
P5
P4
P1
P2
P3
a13 a12 a11
a23 a22 a21
a33 a32 a31
b31
b21
b11
b32
b22
b12
b33
b23
b13
At every tick of the global system clock data is passed to each
processor from two different directions, then it is multiplied
and the result is saved in a register.
3 4 2
2 5 3
3 2 5
*
=
3 4 2
2 5 3
3 2 5
23 36 28
25 39 34
28 32 37
Lets try this using a systolic array.
P9
P8
P7
P6
P5
P4
P1
P2
P3
2 4 3
3 5 2
3
2
3
5 2 3
5
3
2
2
5
4
3*3
2 4
3 5 2
3
2
5 2 3
5
3
2
2
5
4
Clock tick: 1
9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
P1
P2
P3
P4
P6
P5
P7
P8
P9
2*3
4*2
3*4
2
3 5
3
5 2 3
5
3
2
2
5
Clock tick: 2
17 | 12 | 0 | 6 | 0 | 0 | 0 | 0 | 0 |
P1
P2
P3
P4
P6
P5
P7
P8
P9
3*3
2*4
5*2
2*3
4*5
3*2
3
5 2
5
3
2
Clock tick: 3
23 | 32 | 6 | 16 | 8 | 0 | 9 | 0 | 0 |
P1
P2
P3
P4
P6
P5
P7
P8
P9
3*4
2*2
2*2
5*5
3*3
2*2
4*3
5
5
Clock tick: 4
23 | 36 | 18 | 25 | 33 | 4 | 13 | 12 | 0 |
P1
P2
P3
P4
P6
P5
P7
P8
P9
3*2
5*2
5*3
5*3
3*2
2*5
Clock tick: 5
23 | 36 | 28 | 25 | 39 | 19 | 28 | 22 | 6 |
P1
P2
P3
P4
P6
P5
P7
P8
P9
2*3
5*2
3*5
Clock tick: 6
23 | 36 | 28 | 25 | 39 | 34 | 28 | 32 | 12 |
P1
P2
P3
P4
P6
P5
P7
P8
P9
5*5
Clock tick: 7
23 | 36 | 28 | 25 | 39 | 34 | 28 | 32 | 37 |
P1
P2
P3
P4
P6
P5
P7
P8
P9
37
32
28
34
39
25
23
36
28
23 | 36 | 28 | 25 | 39 | 34 | 28 | 32 | 37 |
P1
P2
P3
P4
P6
P5
P7
P8
P9
Same answer! In 2n + 1 time, can we do better?
The answer is yes, there is an optimization.
Cannon’s Technique
Other Applications
Samba: Systolic Accelerator for Molecular Biological Applications
This systolic array contains 128 processors shared into 32 full custom
VLSI chips. One chip houses 4 processors, and one processor performs
10 millions matrix cells per second.
Why Systolic?
Why Not Systolic?
Summary
THANK YOU