Towards General-Purpose Acceleration by Exploiting Common Data-Dependence Forms
Vidushi Dadu, Jian Weng, Sihao Liu, Tony Nowatzki
UCLA
MICRO 2019
Challenging trade-off in domain-specific and domain-agnostic acceleration
2
Maximum efficiency
Maximum generality
DOMAIN-SPECIFIC
DOMAIN-AGNOSTIC
CPU
REASON: Control/memory data-dependence
Support for application-specific dependencies
Relies on vectorization and prefetching.
Challenging trade-off in domain-specific and domain-agnostic acceleration
3
Maximum efficiency
Maximum generality
DOMAIN-SPECIFIC
DOMAIN-AGNOSTIC
CPU
OUR GOAL
DOMAIN-AGNOSTIC
Relies on vectorization and prefetching.
Support for application-specific dependencies
Programmable Accelerators (eg. GPUs) Fail to Handle Arbitrary Control/Memory Dependence
4
Control Dependence
Memory Dependence
a[3] | a[0] | a[5] | a[1] |
Request vector
Arbitrary access location
Arbitrary code execution
Memory
a[0] | a[1] | a[2] | a[3] | a[4] | a[5] |
Code 2
Branch
Branch
Code 1
Code 3
Code 4
Insight: Restricted control and memory dependence is sufficient for many data-processing algorithms.
Outline
5
Irregularity is Ubiquitous
6
Sparsity within dataset
(Machine Learning)
Data-structures representing relationships (Graphs)
Purpose to reorder data
(Databases)
4 | 2 | 3 | 6 | 5 | 1 | 7 |
1 | 2 | 3 | 4 | 5 | 6 | 7 |
Pruned Neural Network
Decision tree building
Bayesian Networks
Sorting
A |
B |
C |
F |
B |
D |
F |
G |
B |
F |
Table X
Table Y
Table Z =
Inner Join (X, Y)
=
Database Join
Triangle Counting
Irregularity Stems from Data-dependence
7
Main-Insight: There are narrow forms of dependence which are:
Data-dependent aspects of execution
Restricted Control flow: Stream-Join
Restricted Memory Access: Alias-Free Indirection
8
Algorithm Classification
General
Irregularity
Regular
Stream
Join
Alias-free
Indirect
No control/memory dependence
Restricted control dependence
Restricted memory dependence
Regular Example: �Dense Matrix Multiply
9
1 | 0 | 4 | 1 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 7 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 9 |
0 | 3 | 3 | 1 | 1 | 0 | 0 |
0 | 2 | 0 | 0 | 0 | 0 | 3 |
2 | 0 | 5 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 6 | 0 | 0 |
2 | 3 | 4 | 0 | 0 | 0 | 0 |
0 | 2 | 0 | 3 | 0 | 4 | 0 |
3 |
0 |
0 |
9 |
4 |
0 |
0 |
0 |
6 |
Input Vector A (N)
Output Vector C (N)
×
∑
Input Matrix B (NxN)
Sparse matrix-multiply can be implemented in two ways:
10
2 | 3 | 4 |
idx
2 | 3 | 5 |
val
A
B[0]
0 | 1 | 3 |
1 | 4 | 1 |
idx
val
CSR format: Compressed Sparse Row
total+=3*1
Output of conditional
Conditional output 0 means no multiplication
Sparse Inner Product Multiply (stream-join)
0
0
0
1
0
11
2 | 3 | 4 |
idx
2 | 3 | 5 |
val
A
B[0]
0 | 1 | 3 |
1 | 4 | 1 |
idx
val
CSR format: Compressed Sparse Row
total+=3*1
Output of conditional
Conditional output 0 means no multiplication
Sparse Inner Product Multiply (stream-join)
0
0
0
1
0
Sparse Outer Product Multiply (Alias-free Indirection)
12
2 | 3 | 4 |
1 | 3 | 5 |
0 | 1 | 5 | 3 | 4 | 0 | 3 | 5 | 0 | 3 |
1 | 2 | 2 | 3 | 2 | 4 | 3 | 5 | 1 | 1 |
col1
idx
val
A
B
idx
val
col0
col2
col3
C
| | | | | | | | | |
Accumulate output vector
CSC: Compressed Sparse Column
Sparse Outer Product Multiply (Alias-free Indirection)
13
2 | 3 | 4 |
1 | 3 | 5 |
0 | 1 | 5 | 3 | 4 | 0 | 3 | 5 | 0 | 3 |
1 | 2 | 2 | 3 | 2 | 4 | 3 | 5 | 1 | 1 |
col1
idx
val
A
B
idx
val
col0
col2
col3
C
| | | | | | | | | |
Accumulate output vector
CSC: Compressed Sparse Column
Graph Mining (e.g. Triangle Counting)
14
b
c
e
d
a
f
b | d | a | c | e | b | d | e | f | a | c | f | b | c | c | d |
edge list
(stream-join)
A
B
C
D
E
F
(alias-free indirect)
15
Stream Join
(irreg. control)
Alias-free Indirection
(irregular memory)
Neural Net (FC + Conv)
Machine Learning
Supp. Vector (SVM)
Decision Trees (GBDT)
Databases
Bayesian Networks
Join (inner)
Sort
Filter
Graph
Page Rank
& BFS
Triangle Counting
Inner Product Mult.
“”
Outer Product Mult.
“”
Sort-Join
Merge-Sort
Generate Filtered Col.
Hash-Join
Radix-Sort
Generate Column Ind.
Sparse data access
+ Histogramming
Condition on
node type
Sparse join of active list
+ Indirect acc.
for edges
Find common neighbor edges
+ Indirect acc.
for edges
+ DAG Access
Outline
16
Approach: Start with a Dense Programmable Accelerator
17
Systolic Array
Wide Scratchpad
Control
Google TPU v2
ISCA’17
PuDianNao (ASPLOS’15)
Tabla (HPCA’16)
Stereotypical Dense Accelerator Core
Approach: Start with a Dense Programmable Accelerator
18
Approach: Start with a Dense Programmable Accelerator
19
Approach: Start with a Dense Programmable Accelerator
20
Specializing for Stream Join
21
Novel Dataflow for Stream Join
22
2 | 3 | 4 |
idx
2 | 3 | 5 |
val
A
B[0]
0 | 1 | 3 |
1 | 4 | 1 |
idx
val
Ld
idxA
Ld
idxB
Cmp
++
Gen
addr
++
Gen
addr
Ld
ValB
Ld
valA
Gen
addr
Gen
addr
=
<=
>=
×
acc
Traditional Dataflow
Sparse MM
Example
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
Systolic array
Control-dep. Load,
Cyclic dependence,
Unpredictable branch!
Novel Dataflow for Stream Join
23
2 | 3 | 4 |
idx
2 | 3 | 5 |
val
A
B[0]
0 | 1 | 3 |
1 | 4 | 1 |
idx
val
Ld
idxA
Ld
idxB
Cmp
++
Gen
addr
++
Gen
addr
Ld
ValB
Ld
valA
Gen
addr
Gen
addr
=
<=
>=
×
strm
idxA
strm
idxB
strm
valA
strm
valB
Cmp
×
acc
acc
>,<,=
c
c
c
reuse
reuse
reuse
reuse
discard
reset
Traditional Dataflow
Novel Stream Join Dataflow
init
Sparse MM
Example
Novel Dataflow for Stream Join
24
2 | 3 | 4 |
idx
2 | 3 | 5 |
val
A
B[0]
0 | 1 | 3 |
1 | 4 | 1 |
idx
val
Ld
idxA
Ld
idxB
Cmp
++
Gen
addr
++
Gen
addr
Ld
ValB
Ld
valA
Gen
addr
Gen
addr
=
<=
>=
×
strm
idxA
strm
idxB
strm
valA
strm
valB
Cmp
×
acc
acc
>,<,=
c
c
c
reuse
reuse
reuse
reuse
discard
reset
Traditional Dataflow
Novel Stream Join Dataflow
init
Sparse MM
Example
0
2
2
0
3
1
Novel Dataflow for Stream Join
25
2 | 3 | 4 |
idx
2 | 3 | 5 |
val
A
B[0]
0 | 1 | 3 |
1 | 4 | 1 |
idx
val
Ld
idxA
Ld
idxB
Cmp
++
Gen
addr
++
Gen
addr
Ld
ValB
Ld
valA
Gen
addr
Gen
addr
=
<=
>=
×
strm
idxA
strm
idxB
strm
valA
strm
valB
Cmp
×
acc
acc
>,<,=
c
c
c
reuse
reuse
reuse
reuse
discard
reset
Traditional Dataflow
Novel Stream Join Dataflow
init
Sparse MM
Example
0
0
2
2
2
0
3
1
consume
<
Novel Dataflow for Stream Join
26
2 | 3 | 4 |
idx
2 | 3 | 5 |
val
A
B[0]
0 | 1 | 3 |
1 | 4 | 1 |
idx
val
Ld
idxA
Ld
idxB
Cmp
++
Gen
addr
++
Gen
addr
Ld
ValB
Ld
valA
Gen
addr
Gen
addr
=
<=
>=
×
strm
idxA
strm
idxB
strm
valA
strm
valB
Cmp
×
acc
acc
>,<,=
c
c
c
reuse
reuse
reuse
reuse
discard
reset
Traditional Dataflow
Novel Stream Join Dataflow
init
Sparse MM
Example
1
1
2
2
2
0
3
3
<
consume
Other Kernels as Stream Join
27
strm
key1
strm
key2
strm
val1
strm
val2
Cmp
cat
>,<,=
c
c
reuse
reuse
reuse
reuse
discard
Database Join
strm
out
cat
Merge (sort)
strm
key1
strm
key2
Cmp
c
reuse
reuse
Mux
strm
out
Resparsify (filter)
strm
in
s-ind
s-val
Cmp
Relu
(max)
acc
c
reset
init
i
=0
c
discard
c
discard
Supporting Stream-Join in Hardware
28
discard
Func. Unit
ACC
CLT
FIFO0
FIFO1
FIFO2
CGRA Processing Element
reuse
reset
NE
NW
SE
SW
NE
NW
SE
SW
NE
NW
SE
SW
SE
Stream-Join
Flow-Ctrl
Data-Flow
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
“Systolic” CGRA
Specializing for Indirection
29
Indirection with guaranteed alias-freedom
30
Traditional Banked Memory
Bank
0
Bank
1
Bank
2
Bank
3
Bank
4
Bank
n
Arbitrated Crossbar
From/To Compute Fabric & Network
Dependence check in a vector
Data
Data
Observations: Dependence check serializes over a group of non-conflicting requests.
Idea: SPU allow aggressive reordering with a simple reorder buffer.
Addr.
Bank
0
Bank
1
Bank
2
Bank
3
Bank
4
Bank
n
Arbitrated Crossbar
Address
Generation
From/To Compute Fabric & Network
Queue
Queue
Queue
Queue
Queue
Queue
Indirect
Reorder Buffer
Addr.
Data
Data
SPU Banked Memory
Indirect updates
Indirect access w/o dependence check
Reordering for indirect read
Typical GPU
Known apriori that they won’t alias. Serialization unrequired.
Vec read req1
Vec write req2
Indirect Access Reordering in Scratchpad
31
Typical GPU
SPU
Vec read req1
Vec write req2
32
Network: Traditional mesh NoC
Scratchpads in global address space: for nearby core communication
Broadcast: using memory stream engine
Synchronization: Dataflow Counters (sync. On SPAD write)
SPU: Sparse Processing Unit
Main Memory
Memory Stream Engine
SPU Core
SPU Core
SPU Core
…
SPU Core
SPU Core
SPU Core
SPU Core
SPU Core
SPU Core
…
…
…
…
…
Outline
33
Methodology
34
Workloads | CPU | GPU |
GBDT | LightGBM | LightGBM |
Kernel-SVM | LibSVM | Hand-coded |
AC | Hand-coded | Hand-coded |
FC | MKL SPBLAS | cuSparse |
Conv layer | MKL-DNN | cuDNN |
Graph Alg. | Graphmat | - |
TPCH | MonetDB | - |
Wkld | | | | |
GBDT | Cifar10-bin (1) | Higgs-bin (0.28) | Yahoo-bin (0.05) | Ltrc-bin (0.008) |
KSVM | Connect (0.33) | Higgs (0.92) | Yahoo (0.59) | Ltrc (0.24) |
CONV | VGG-3 (0.34) | VGG-4 (0.1) | ALEX-2 (0.14) | RES-1 (0.05) |
FC | VGG-12 (0.04) | VGG-13 (0.09) | ALEX-6 (0.16) | RES-1 (0.22) |
AC | Pigs | Munin | Andes | Mildew |
Graph | Flickr | Fb-artist | NY-road | |
Benchmarks
Datasets (with varying sparsity)
Domain-agnostic comparison points
35
P4000 Pascal GPU
SPU-inorder
Overall Design
Core Design
SPU
SM-1
SM-2
Main Memory
Ctrl
FU-1
FU-N
L1 Cache
Shared mem
SM-N
FU-2
…
…
On-chip mem:4MB
FP units: 3696
Mem bw = 243 GB/s
Main Memory
CPUCore
CPU Core
CPU Core
CPU Core
CPU Core
CPU Core
…
…
…
…
Ctrl
FU-1
FU-N
Lin. Scratch
Bank Scratch
FU-2
…
Main Memory
SPU Core
SPU Core
SPU Core
SPU Core
SPU Core
SPU Core
…
…
…
…
Ctrl
FU-1
FU-2
FU-8
FU-N
On-chip mem:3MB
FP units: 2560
Mem bw = 256 GB/s
On-chip mem:3MB
FP units: 2632
Mem bw = 256 GB/s
Ctrl
Ctrl
Lin. Scratch
Bank Scratch
SIMD unit
Array of in-order cores
Systolic CGRA
Domain-specific comparison points
36
SCNN (Sparse convolution): ISCA’17
EIE (Sparse fully connected): ISCA’16
Graphicionado (Graph Analytics): MICRO’16
Q100 (Databases): ASPLOS’14
Overall Results
37
GPU
SPU-inorder
SPU
ASIC
EIE
SCNN
Q100
Graphicionado
Cost of adding stream-join in systolic CGRA
38
Compared to whole design, it is 6.9% area overhead and 14.2% power overhead.
Methodology: SPU’s DGRA is implemented in Chisel and synthesized using Synopsis DC with a 28nm UMC technology library.
Conclusion
39
Efficiency on Stream-Join Algorithms
Efficiency on Alias-Free Indirection Algorithms
SCNN (Conv)
EIE (FC)
Graph.
(Graph)
Intel SpM-SpV
Q100
Outerspace
CPU
GPU
SPU
EXTRA SLIDES
40
Programming SPU
Example of gradient boosting decision trees (GBDT)
41
Approach Overview
Stream-Dataflow ISA
Sparsity-Enabled SPU Core
Streaming Memory:
Streams of data fetched from memory and stored back to memory.
Dataflow Computation: Dependence graph (DFG) with input/output vector ports.
Maps dataflow computation to the systolic array
Streams of data flow from wide scratchpad
A[0.N]
B[0.N]
12
Note: The relative position are the best to our knowledge
43
Efficiency on Stream-Join Algorithms
Efficiency on Alias-Free Indirection Algorithms
SCNN (Conv)
EIE (FC)
Graph.
(Graph)
Intel SpM-SpV
Q100
Outerspace
CPU
GPU
SPU
Vector thread
Pasticine
Trig. inst
Indirect Access Reordering in Scratchpad
44
IROB Buffer at Cycle 2
Typical GPU
SPU
Main Memory
Global Stream Engine
SPU Core
SPU Core
SPU Core
…
SPU Core
SPU Core
SPU Core
SPU Core
SPU Core
SPU Core
…
…
…
…
…
SPU: Sparse Processing Unit
Main Memory
Global Stream Engine
SPU Core
SPU Core
SPU Core
…
SPU Core
SPU Core
SPU Core
SPU Core
SPU Core
SPU Core
…
…
…
…
…
Main Memory
Global Stream Engine
SPU Core
SPU Core
SPU Core
…
SPU Core
SPU Core
SPU Core
SPU Core
SPU Core
SPU Core
…
…
…
…
…
Fully Connected Layer
(broadcast row)
Sparse Convolution
(communication with neighbors for halos)
Independent Lanes
(with/without Communication)
Local Spatial Communication
Pipelined Communication
Arithmetic Circuits – Pipelined DAG Traveral
(pipeline node updates)
Graph Processing
(core-core communication)
How much density is exploitable?
27
Alias-free Indirection Abstractions
1: Indirect Memory -- d = a[b[i]]
Example C code
struct {int f1, f2} a[N]�for i=0 to N
ind = b[i] str1=load(b[0..n])
.. = a[ind].f1 -> ind_load(addr=str1, offset_list={0,4})� .. = a[ind].f2
14
Stream code
Update stream
2: Histogram -- a[hist_bin] += c
Sparsity-Enhanced Memory Micro-architecture
We keep 2 logical scratchpad memories: banked and linear.
Linear Memory
- reads/writes
Arbiter
XBAR(eg.16x32)
Indirect Address Gener-ation
Linear
Access Stream Table
Linear Address
Generation
MUX
Indirect
Rd/Wr/
Atomic
Stream
Table
rd-wr bank-queues
Control Logic
Composable
Banked Scratchpad
NoC
Linear Scratchpad
Control Unit
Sel
Indirect
ROB
To Compute Fabric
From Compute Fabric
15
Sparsity-Enhanced Memory Micro-architecture
Arbiter
XBAR(eg.16x32)
Indirect Address Gener-ation
Linear
Access Stream Table
Linear Address
Generation
MUX
Indirect
Rd/Wr/
Atomic
Stream
Table
rd-wr bank-queues
Control Logic
Composable
Banked Scratchpad
NoC
Linear Scratchpad
Control Unit
Sel
Indirect
ROB
To Compute Fabric
From Compute Fabric
We keep 2 logical scratchpad memories: banked and linear.
Linear Memory
- reads/writes
Banked Memory
17
Benefits of Heterogeneity on CGRA
Example DFG
Mapping to DGRA
21
CGRA datapath width = 64-bits
Initial datatypes= 16-bits
Sparsity-Enhanced Computation Micro-architecture
DGRA Switch: same external interface but splits i/p and o/p.
DGRA Processing Element: Decomposed to fine-grained PEs
22
Cost of Decomposability & Stream-Join
53
Stream-Join in Systolic Array
More Decomposable
More Decomposable
Remaining Challenges
54
Domain-agnostic comparison points
55
24-core Intel Skylake CPU
P4000 Pascal GPU
Main Memory
SPU Core
SPU Core
SPU Core
SPU Core
SPU Core
SPU Core
…
…
…
…
…
SPU-inorder
SM-1
SM-2
Main Memory
Ctrl
FU-1
FU-N
L1 Cache
L2 Cache
Ctrl
FU-1
FU-N
L1 Cache
Shared mem
Ctrl
FU-1
FU-N
Lin. Scratch
Bank Scratch
Ctrl
Ctrl
FU-2
OOO-3
OOO-1
OOO-2
OOO-4
Main Memory
SM-N
4MB
3584
2048
2.5MB
FU-2
FU-2
Overall Design
Core Design