1 of 55

Towards General-Purpose Acceleration by Exploiting Common Data-Dependence Forms

Vidushi Dadu, Jian Weng, Sihao Liu, Tony Nowatzki

UCLA

MICRO 2019

2 of 55

Challenging trade-off in domain-specific and domain-agnostic acceleration

2

Maximum efficiency

Maximum generality

DOMAIN-SPECIFIC

DOMAIN-AGNOSTIC

CPU

REASON: Control/memory data-dependence

Support for application-specific dependencies

Relies on vectorization and prefetching.

Considering diminishing technology returns, there have been significant attention towards specialization. There are two approaches: in domain-specific one, people specialize for application’s computation and memory characteristics thereby achieving high efficiency however they often lacks in terms of generality (as it is designed for the application at hand).

Domain agnostic accelerators which specialize for the common behavior of data processing algorithms. Although these hardware are efficient than highly general-purpose CPU, they often lack behind domain-specific hardware creating a challenging trade-off.

The reason is data-dependence in the form of control and memory. Domain-specific deal with this by providing efficient support for application-specific dependencies while domain-agnostic relies on vectorization and prefetching techniques which are not effective in the presence of dependencies. I will explain this phenomenon in detail in the next slide. Our goal…

To achieve this goal, we identify the issue with existing hardware.

3 of 55

Challenging trade-off in domain-specific and domain-agnostic acceleration

3

Maximum efficiency

Maximum generality

DOMAIN-SPECIFIC

DOMAIN-AGNOSTIC

CPU

OUR GOAL

DOMAIN-AGNOSTIC

Relies on vectorization and prefetching.

Support for application-specific dependencies

4 of 55

Programmable Accelerators (eg. GPUs) Fail to Handle Arbitrary Control/Memory Dependence

4

Control Dependence

Memory Dependence

a[3]	a[0]	a[5]	a[1]

Request vector

Arbitrary access location

Arbitrary code execution

Memory

a[0]	a[1]	a[2]	a[3]	a[4]	a[5]

Code 2

Branch

Code 1

Code 3

Code 4

Insight: Restricted control and memory dependence is sufficient for many data-processing algorithms.

5 of 55

Outline

Irregularity is ubiquitous

Sufficient and Exploitable forms of Control and Memory dependence
Example Workload: Matrix Multiply

Exploiting data-dependence with SPU accelerator

uArch: Stream-join Dataflow & Compute-enabled Scratchpad
SPU Multicore Design

Evaluating SPU
Conclusion

5

6 of 55

Irregularity is Ubiquitous

6

Sparsity within dataset

(Machine Learning)

Data-structures representing relationships (Graphs)

Purpose to reorder data

(Databases)

4	2	3	6	5	1	7

1	2	3	4	5	6	7

Pruned Neural Network

Decision tree building

Bayesian Networks

Sorting

A
B
C
F

B
D
F
G

B
F

Table X

Table Y

Table Z =

Inner Join (X, Y)

=

Database Join

Triangle Counting

7 of 55

Irregularity Stems from Data-dependence

7

Main-Insight: There are narrow forms of dependence which are:

Sufficient to express many algorithms (from ML, graph analytics, databases)
Exploitable with minimal hardware overhead

Data-dependent aspects of execution

Control flow: if(f(a[i]))

Memory Access: b[a[i]]

Restricted Control flow: Stream-Join

Restricted Memory Access: Alias-Free Indirection

8 of 55

8

Algorithm Classification

General

Irregularity

Regular

Stream

Join

Alias-free

Indirect

No control/memory dependence

Restricted control dependence

Restricted memory dependence

9 of 55

Regular Example: �Dense Matrix Multiply

No data-dependence; �ie. the dynamic pattern of:

Control
Data Access

… is known a priori.

9

1	0	4	1	0	0
2	0	0	0	7	0
0	0	0	0	0	9
0	3	3	1	1	0
0	2	0	0	0	3
2	0	5	0	0	0
0	0	0	0	0	0
1	0	0	0	6	0
2	3	4	0	0	0

0	2	0	3	0	4	0

3
0
0
9
4
0
0
0
6

Input Vector A (N)

Output Vector C (N)

×

∑

Input Matrix B (NxN)

Sparse matrix-multiply can be implemented in two ways:

Inner product: Data-dependent control
Outer product: Data-dependent memory

10 of 55

Known memory access pattern, but unpredictability in control

10

2	3	4

idx

2	3	5

val

A

B[0]

0	1	3

1	4	1

idx

val

CSR format: Compressed Sparse Row

total+=3*1

Output of conditional

Conditional output 0 means no multiplication

Sparse Inner Product Multiply (stream-join)

0

1

0

Now I will include inner product multiply as a tool to describe a execution pattern which relaxes regular algorithms to include a form of data-dependent control which we call stream-join.

Lets first consider the inner product implementation of sparse multiply. Here, the CSR format includes two data-structures for non-zero values and their corresponding indices. To find the dot product of the vector and the corresponding row in the matrix, two pointers are maintained. Based on the comparison of indices, either of the pointers are incremented. Further, multiplication is performed only when the indices match (Eg. 3 here). This process is repeated till we reach the end of 2 lists.

The source code shows the inner product implementation of sparse matrix-multiply where we effectively need to find the dot product of a row of the matrix and the vector. As the lists are sorted, the matching indices can be found using two pointers parsing over are them. They are incremented based if the indices matched or not. In this example, B[0] is incremented as its value is smaller until they find a match.

As we can see that the memory read can be sequential, however the order of consumption indicative by the pointer depends on the output of a conditional.

The property of inorder access makes it exploitable.

11 of 55

11

2	3	4

idx

2	3	5

val

A

B[0]

0	1	3

1	4	1

idx

val

CSR format: Compressed Sparse Row

total+=3*1

Output of conditional

Conditional output 0 means no multiplication

Sparse Inner Product Multiply (stream-join)

0

1

0

Known memory access pattern, but unpredictability in control
Stream Join:

Memory read can be independent of data*
Order that we consume streams of data is data-dependent

12 of 55

Sparse Outer Product Multiply (Alias-free Indirection)

12

2	3	4

1	3	5

0	1	5	3	4	0	3	5	0	3

1	2	2	3	2	4	3	5	1	1

col1

idx

val

A

B

idx

val

col0

col2

col3

C

Accumulate output vector

CSC: Compressed Sparse Column

High memory unpredictability, but known control pattern
No unknown dependencies (only atomic updates: out[i]=out[i]+prod[i])

Now consider the outer product implementation where the matrix is in a column major form. To find the product, columns corresponding to non-zero values in vector are accessed. And the product is accumulated in the partial sums of output vector.

Now this involves two things, first as access locations depend on the non-zero index it cause memory irregularity. Second, while doing accumulation there exists read-write dependencies.

Outer product implementation basically inverses the matrix multiply loops. Basically, the columns of the matrix are multiplied with the elements of the row producing the partial sums in the output vector. In sparse implementation, the column corresponding to only non-zero value in the vector is accessed. This requires accessing the location determined by the non-zero index. Hence, unpredictability in memory.

Accumulating partial sums require atomic updates. However, this dependence is not arbitrary, as it is known apriori that it can appear only in the read and write of a particular update.

The property of alias freedom makes it exploitable.

13 of 55

Sparse Outer Product Multiply (Alias-free Indirection)

13

2	3	4

1	3	5

0	1	5	3	4	0	3	5	0	3

1	2	2	3	2	4	3	5	1	1

col1

idx

val

A

B

idx

val

col0

col2

col3

C

Accumulate output vector

CSC: Compressed Sparse Column

High memory unpredictability, but known control pattern
No unknown dependencies (only atomic updates: out[i]=out[i]+prod[i])
Alias-free Indirect:

Produce addresses depending on other data
Memory dependences, but no unknown (data-dependent) aliases

Now consider the outer product implementation where the matrix is in a column major form. To find the product, columns corresponding to non-zero values in vector are accessed. And the product is accumulated in the partial sums of output vector.

Now this involves two things, first as access locations depend on the non-zero index it cause memory irregularity. Second, while doing accumulation dependence only exist for the update update and not among elements.

Outer product implementation basically inverses the matrix multiply loops. Basically, the columns of the matrix are multiplied with the elements of the row producing the partial sums in the output vector. In sparse implementation, the column corresponding to only non-zero value in the vector is accessed. This requires accessing the location determined by the non-zero index. Hence, unpredictability in memory.

Accumulating partial sums require atomic updates. However, this dependence is not arbitrary, as it is known apriori that it can appear only in the read and write of a particular update.

The property of alias freedom makes it exploitable.

14 of 55

Graph Mining (e.g. Triangle Counting)

For every pair of connected nodes,�find if they have a common neighbor

14

b

c

e

d

a

f

b	d	a	c	e	b	d	e	f	a	c	f	b	c	c	d

edge list

(stream-join)

A

B

C

D

E

F

(alias-free indirect)

15 of 55

15

Stream Join

(irreg. control)

Alias-free Indirection

(irregular memory)

Neural Net (FC + Conv)

Machine Learning

Supp. Vector (SVM)

Decision Trees (GBDT)

Databases

Bayesian Networks

Join (inner)

Sort

Filter

Graph

Page Rank

& BFS

Triangle Counting

Inner Product Mult.

“”

Outer Product Mult.

“”

Sort-Join

Merge-Sort

Generate Filtered Col.

Hash-Join

Radix-Sort

Generate Column Ind.

Sparse data access

+ Histogramming

Condition on

node type

Sparse join of active list

+ Indirect acc.

for edges

Find common neighbor edges

+ Indirect acc.

for edges

+ DAG Access

16 of 55

Outline

Irregularity is ubiquitous

Sufficient and Exploitable forms of Control and Memory dependence
Example Workload: Matrix Multiply

Exploiting data-dependence with SPU accelerator

uArch: Stream-join Dataflow & Compute-enabled Scratchpad
SPU Multicore Design

Evaluating SPU
Conclusion

16

17 of 55

Approach: Start with a Dense Programmable Accelerator

17

Systolic Array

Wide Scratchpad

Control

Google TPU v2

ISCA’17

PuDianNao (ASPLOS’15)

Tabla (HPCA’16)

Stereotypical Dense Accelerator Core

18 of 55

Approach: Start with a Dense Programmable Accelerator

18

19 of 55

Approach: Start with a Dense Programmable Accelerator

19

20 of 55

Approach: Start with a Dense Programmable Accelerator

20

21 of 55

Specializing for Stream Join

21

22 of 55

Novel Dataflow for Stream Join

22

2	3	4

idx

2	3	5

val

A

B[0]

0	1	3

1	4	1

idx

val

Ld

idxA

Ld

idxB

Cmp

++

Gen

addr

++

Gen

addr

Ld

ValB

Ld

valA

Gen

addr

Gen

addr

=

<=

>=

×

acc

Traditional Dataflow

Sparse MM

Example

PE

Systolic array

Control-dep. Load,

Cyclic dependence,

Unpredictable branch!

23 of 55

Novel Dataflow for Stream Join

Observation: For a stream join, memory is (mostly) separable from computation
Idea: Allow Dataflow to conditionally pop/discard/reset values based on control decisions.

23

2	3	4

idx

2	3	5

val

A

B[0]

0	1	3

1	4	1

idx

val

Ld

idxA

Ld

idxB

Cmp

++

Gen

addr

++

Gen

addr

Ld

ValB

Ld

valA

Gen

addr

Gen

addr

=

<=

>=

×

strm

idxA

strm

idxB

strm

valA

strm

valB

Cmp

×

acc

>,<,=

c

reuse

discard

reset

Traditional Dataflow

Novel Stream Join Dataflow

init

Sparse MM

Example

24 of 55

Novel Dataflow for Stream Join

24

2	3	4

idx

2	3	5

val

A

B[0]

0	1	3

1	4	1

idx

val

Ld

idxA

Ld

idxB

Cmp

++

Gen

addr

++

Gen

addr

Ld

ValB

Ld

valA

Gen

addr

Gen

addr

=

<=

>=

×

strm

idxA

strm

idxB

strm

valA

strm

valB

Cmp

×

acc

>,<,=

c

reuse

discard

reset

Traditional Dataflow

Novel Stream Join Dataflow

init

Sparse MM

Example

0

2

0

3

1

25 of 55

Novel Dataflow for Stream Join

25

2	3	4

idx

2	3	5

val

A

B[0]

0	1	3

1	4	1

idx

val

Ld

idxA

Ld

idxB

Cmp

++

Gen

addr

++

Gen

addr

Ld

ValB

Ld

valA

Gen

addr

Gen

addr

=

<=

>=

×

strm

idxA

strm

idxB

strm

valA

strm

valB

Cmp

×

acc

>,<,=

c

reuse

discard

reset

Traditional Dataflow

Novel Stream Join Dataflow

init

Sparse MM

Example

0

2

0

3

1

consume

<

26 of 55

Novel Dataflow for Stream Join

26

2	3	4

idx

2	3	5

val

A

B[0]

0	1	3

1	4	1

idx

val

Ld

idxA

Ld

idxB

Cmp

++

Gen

addr

++

Gen

addr

Ld

ValB

Ld

valA

Gen

addr

Gen

addr

=

<=

>=

×

strm

idxA

strm

idxB

strm

valA

strm

valB

Cmp

×

acc

>,<,=

c

reuse

discard

reset

Traditional Dataflow

Novel Stream Join Dataflow

init

Sparse MM

Example

1

2

0

3

<

consume

27 of 55

Other Kernels as Stream Join

27

strm

key1

strm

key2

strm

val1

strm

val2

Cmp

cat

>,<,=

c

reuse

discard

Database Join

strm

out

cat

Merge (sort)

strm

key1

strm

key2

Cmp

c

reuse

Mux

strm

out

Resparsify (filter)

strm

in

s-ind

s-val

Cmp

Relu

(max)

acc

c

reset

init

i

=0

c

discard

c

discard

28 of 55

Supporting Stream-Join in Hardware

28

discard

Func. Unit

ACC

CLT

FIFO₀

FIFO₁

FIFO₂

CGRA Processing Element

reuse

reset

NE

NW

SE

SW

NE

NW

SE

SW

NE

NW

SE

SW

SE

Stream-Join

Flow-Ctrl

Data-Flow

PE

“Systolic” CGRA

29 of 55

Specializing for Indirection

29

30 of 55

Indirection with guaranteed alias-freedom

30

Traditional Banked Memory

Bank

0

Bank

1

Bank

2

Bank

3

Bank

4

Bank

n

Arbitrated Crossbar

From/To Compute Fabric & Network

Dependence check in a vector

Data

Observations: Dependence check serializes over a group of non-conflicting requests.

Idea: SPU allow aggressive reordering with a simple reorder buffer.

Addr.

Bank

0

Bank

1

Bank

2

Bank

3

Bank

4

Bank

n

Arbitrated Crossbar

Address

Generation

From/To Compute Fabric & Network

Queue

Indirect

Reorder Buffer

Addr.

Data

SPU Banked Memory

Indirect updates

Indirect access w/o dependence check

Reordering for indirect read

Typical GPU

Known apriori that they won’t alias. Serialization unrequired.

Vec read req1

Vec write req2

31 of 55

Indirect Access Reordering in Scratchpad

31

Typical GPU

SPU

Vec read req1

Vec write req2

32 of 55

32

Network: Traditional mesh NoC

Scratchpads in global address space: for nearby core communication

Broadcast: using memory stream engine

Synchronization: Dataflow Counters (sync. On SPAD write)

SPU: Sparse Processing Unit

Main Memory

Memory Stream Engine

SPU Core

…

SPU Core

…

33 of 55

Outline

Irregularity is ubiquitous

Sufficient and Exploitable forms of Control and Memory dependence
Example Workload: Matrix Multiply

Exploiting data-dependence with SPU accelerator

uArch: Stream-join Dataflow & Compute-enabled Scratchpad
SPU Multicore Design

Evaluating SPU
Conclusion

33

34 of 55

Methodology

Programming: C + Intrinsics + Dataflow Graphs
SPU Simulation: Gem5+Ruby (RISCV inorder control core)

34

Workloads	CPU	GPU
GBDT	LightGBM	LightGBM
Kernel-SVM	LibSVM	Hand-coded
AC	Hand-coded	Hand-coded
FC	MKL SPBLAS	cuSparse
Conv layer	MKL-DNN	cuDNN
Graph Alg.	Graphmat	-
TPCH	MonetDB	-

Wkld
GBDT	Cifar10-bin (1)	Higgs-bin (0.28)	Yahoo-bin (0.05)	Ltrc-bin (0.008)
KSVM	Connect (0.33)	Higgs (0.92)	Yahoo (0.59)	Ltrc (0.24)
CONV	VGG-3 (0.34)	VGG-4 (0.1)	ALEX-2 (0.14)	RES-1 (0.05)
FC	VGG-12 (0.04)	VGG-13 (0.09)	ALEX-6 (0.16)	RES-1 (0.22)
AC	Pigs	Munin	Andes	Mildew
Graph	Flickr	Fb-artist	NY-road

Benchmarks

Datasets (with varying sparsity)

35 of 55

Domain-agnostic comparison points

35

P4000 Pascal GPU

SPU-inorder

Overall Design

Core Design

SPU

SM-1

SM-2

Main Memory

Ctrl

FU-1

FU-N

L1 Cache

Shared mem

SM-N

FU-2

…

On-chip mem:4MB

FP units: 3696

Mem bw = 243 GB/s

Main Memory

CPUCore

CPU Core

…

Ctrl

FU-1

FU-N

Lin. Scratch

Bank Scratch

FU-2

…

Main Memory

SPU Core

…

Ctrl

FU-1

FU-2

FU-8

FU-N

On-chip mem:3MB

FP units: 2560

Mem bw = 256 GB/s

On-chip mem:3MB

FP units: 2632

Mem bw = 256 GB/s

Ctrl

Lin. Scratch

Bank Scratch

SIMD unit

Array of in-order cores

Systolic CGRA

36 of 55

Domain-specific comparison points

36

SCNN (Sparse convolution): ISCA’17

EIE (Sparse fully connected): ISCA’16

Graphicionado (Graph Analytics): MICRO’16

Q100 (Databases): ASPLOS’14

37 of 55

Overall Results

37

GPU

SPU-inorder

SPU

ASIC

EIE

SCNN

Q100

Graphicionado

38 of 55

Cost of adding stream-join in systolic CGRA

38

1.69x area overhead due to addition of flow control.

1.63x power overhead.

Compared to whole design, it is 6.9% area overhead and 14.2% power overhead.

Methodology: SPU’s DGRA is implemented in Chisel and synthesized using Synopsis DC with a 28nm UMC technology library.

39 of 55

Conclusion

39

Efficiency on Stream-Join Algorithms

Efficiency on Alias-Free Indirection Algorithms

SCNN (Conv)

EIE (FC)

Graph.

(Graph)

Intel SpM-SpV

Q100

Outerspace

CPU

GPU

SPU

40 of 55

EXTRA SLIDES

40

41 of 55

Programming SPU

Example of gradient boosting decision trees (GBDT)

Stream join control expressed in the dataflow graph
Alias-Free Indirection expressed as update stream

41

42 of 55

Approach Overview

Stream-Dataflow ISA

Sparsity-Enabled SPU Core

Streaming Memory:

Streams of data fetched from memory and stored back to memory.

Dataflow Computation: Dependence graph (DFG) with input/output vector ports.

Maps dataflow computation to the systolic array

Streams of data flow from wide scratchpad

A[0.N]

B[0.N]

12

43 of 55

Note: The relative position are the best to our knowledge

43

Efficiency on Stream-Join Algorithms

Efficiency on Alias-Free Indirection Algorithms

SCNN (Conv)

EIE (FC)

Graph.

(Graph)

Intel SpM-SpV

Q100

Outerspace

CPU

GPU

SPU

Vector thread

Pasticine

Trig. inst

44 of 55

Indirect Access Reordering in Scratchpad

44

IROB Buffer at Cycle 2

Typical GPU

SPU

45 of 55

Main Memory

Global Stream Engine

SPU Core

…

SPU Core

…

SPU: Sparse Processing Unit

Main Memory

Global Stream Engine

SPU Core

…

SPU Core

…

Main Memory

Global Stream Engine

SPU Core

…

SPU Core

…

Fully Connected Layer

(broadcast row)

Sparse Convolution

(communication with neighbors for halos)

Independent Lanes

(with/without Communication)

Local Spatial Communication

Pipelined Communication

Arithmetic Circuits – Pipelined DAG Traveral

(pipeline node updates)

Graph Processing

(core-core communication)

46 of 55

How much density is exploitable?

Bounded by memory bandwidth, sparse versions are better at less than 50% density.
SPU-sparse has exponential gain with sparsity.

27

47 of 55

Alias-free Indirection Abstractions

1: Indirect Memory -- d = a[b[i]]

Allow to specify indirect loads or stores using an input stream as address values.
Offset list for array-of-structs organization.

Example C code

struct {int f1, f2} a[N]�for i=0 to N

ind = b[i] str1=load(b[0..n])

.. = a[ind].f1 -> ind_load(addr=str1, offset_list={0,4})� .. = a[ind].f2

14

Stream code

48 of 55

Update stream

2: Histogram -- a[hist_bin] += c

Enhance ISA with compute-enabled semantics for the access stream
Add update stream for common reduction operations

49 of 55

Sparsity-Enhanced Memory Micro-architecture

We keep 2 logical scratchpad memories: banked and linear.

Linear Memory

- reads/writes

Arbiter

XBAR(eg.16x32)

Indirect Address Gener-ation

Linear

Access Stream Table

Linear Address

Generation

MUX

Indirect

Rd/Wr/

Atomic

Stream

Table

rd-wr bank-queues

Control Logic

Composable

Banked Scratchpad

NoC

Linear Scratchpad

Control Unit

Sel

Indirect

ROB

To Compute Fabric

^{From Compute Fabric}

15

50 of 55

Sparsity-Enhanced Memory Micro-architecture

Arbiter

XBAR(eg.16x32)

Indirect Address Gener-ation

Linear

Access Stream Table

Linear Address

Generation

MUX

Indirect

Rd/Wr/

Atomic

Stream

Table

rd-wr bank-queues

Control Logic

Composable

Banked Scratchpad

NoC

Linear Scratchpad

Control Unit

Sel

Indirect

ROB

To Compute Fabric

^{From Compute Fabric}

We keep 2 logical scratchpad memories: banked and linear.

Linear Memory

- reads/writes

Banked Memory

Indirect writes/updates
Indirect reads

17

51 of 55

Benefits of Heterogeneity on CGRA

Example DFG

Mapping to DGRA

21

Effective vectorization width is increased by 64/data-width.
DGRA supports Concat and Extract using sub-networks.

CGRA datapath width = 64-bits

Initial datatypes= 16-bits

52 of 55

Sparsity-Enhanced Computation Micro-architecture

DGRA Switch: same external interface but splits i/p and o/p.

DGRA Processing Element: Decomposed to fine-grained PEs

22

53 of 55

Cost of Decomposability & Stream-Join

53

Stream-Join in Systolic Array

More Decomposable

54 of 55

Remaining Challenges

Generality

What about other forms of irregularity? (task-based?)

Programmability Challenges
Workload balance (1) same amount of work in each core 2) efficient use of available on-chip memory (global addressing helps in this case)

Partitioning of Computation/Memory (Locality & Parallelism)
Programming in low-level intrinsic (dataflow compute & stream mem)

Virtualization/integration with CPU

54

55 of 55

Domain-agnostic comparison points

55

24-core Intel Skylake CPU

P4000 Pascal GPU

Main Memory

SPU Core

…

SPU-inorder

SM-1

SM-2

Main Memory

Ctrl

FU-1

FU-N

L1 Cache

L2 Cache

Ctrl

FU-1

FU-N

L1 Cache

Shared mem

Ctrl

FU-1

FU-N

Lin. Scratch

Bank Scratch

Ctrl

FU-2

OOO-3

OOO-1

OOO-2

OOO-4

Main Memory

SM-N

4MB

3584

2048

2.5MB

FU-2

Overall Design

Core Design

1	0	4	1	0	0
2	0	0	0	7	0
0	0	0	0	0	9
0	3	3	1	1	0
0	2	0	0	0	3
2	0	5	0	0	0
0	0	0	0	0	0
1	0	0	0	6	0
2	3	4	0	0	0

1	0	4	1	0	0
2	0	0	0	7	0
0	0	0	0	0	9
0	3	3	1	1	0
0	2	0	0	0	3
2	0	5	0	0	0
0	0	0	0	0	0
1	0	0	0	6	0
2	3	4	0	0	0

1	0	4	1	0	0
2	0	0	0	7	0
0	0	0	0	0	9
0	3	3	1	1	0
0	2	0	0	0	3
2	0	5	0	0	0
0	0	0	0	0	0
1	0	0	0	6	0
2	3	4	0	0	0