1 of 55

Towards General-Purpose Acceleration by Exploiting Common Data-Dependence Forms

Vidushi Dadu, Jian Weng, Sihao Liu, Tony Nowatzki

UCLA

MICRO 2019

2 of 55

Challenging trade-off in domain-specific and domain-agnostic acceleration

2

Maximum efficiency

Maximum generality

DOMAIN-SPECIFIC

DOMAIN-AGNOSTIC

CPU

REASON: Control/memory data-dependence

Support for application-specific dependencies

Relies on vectorization and prefetching.

3 of 55

Challenging trade-off in domain-specific and domain-agnostic acceleration

3

Maximum efficiency

Maximum generality

DOMAIN-SPECIFIC

DOMAIN-AGNOSTIC

CPU

OUR GOAL

DOMAIN-AGNOSTIC

Relies on vectorization and prefetching.

Support for application-specific dependencies

4 of 55

Programmable Accelerators (eg. GPUs) Fail to Handle Arbitrary Control/Memory Dependence

4

Control Dependence

Memory Dependence

a[3]

a[0]

a[5]

a[1]

Request vector

Arbitrary access location

Arbitrary code execution

Memory

a[0]

a[1]

a[2]

a[3]

a[4]

a[5]

Code 2

Branch

Branch

Code 1

Code 3

Code 4

Insight: Restricted control and memory dependence is sufficient for many data-processing algorithms.

5 of 55

Outline

  • Irregularity is ubiquitous
    • Sufficient and Exploitable forms of Control and Memory dependence
    • Example Workload: Matrix Multiply
  • Exploiting data-dependence with SPU accelerator
    • uArch: Stream-join Dataflow & Compute-enabled Scratchpad
    • SPU Multicore Design
  • Evaluating SPU
  • Conclusion

5

6 of 55

Irregularity is Ubiquitous

6

Sparsity within dataset

(Machine Learning)

Data-structures representing relationships (Graphs)

Purpose to reorder data

(Databases)

4

2

3

6

5

1

7

1

2

3

4

5

6

7

Pruned Neural Network

Decision tree building

Bayesian Networks

Sorting

A

B

C

F

B

D

F

G

B

F

Table X

Table Y

Table Z =

Inner Join (X, Y)

=

Database Join

Triangle Counting

7 of 55

Irregularity Stems from Data-dependence

7

Main-Insight: There are narrow forms of dependence which are:

  • Sufficient to express many algorithms (from ML, graph analytics, databases)
  • Exploitable with minimal hardware overhead

Data-dependent aspects of execution

  1. Control flow: if(f(a[i]))

  1. Memory Access: b[a[i]]

Restricted Control flow: Stream-Join

Restricted Memory Access: Alias-Free Indirection

8 of 55

8

Algorithm Classification

General

Irregularity

Regular

Stream

Join

Alias-free

Indirect

No control/memory dependence

Restricted control dependence

Restricted memory dependence

9 of 55

Regular Example: �Dense Matrix Multiply

  • No data-dependence; �ie. the dynamic pattern of:
    • Control
    • Data Access
  • … is known a priori.

9

1

4

1

2

 0

 0

 0

 0

 0

 0

 0

3

3

1

 0

 0

2

2

 0

5

 0

 0

 0

 0

 0

1

0

0

0

6

0

0

2

3

4

0

0

0

0

0

2

0

3

0

4

0

3

0

0

9

4

0

0

0

6

Input Vector A (N)

Output Vector C (N)

×

Input Matrix B (NxN)

Sparse matrix-multiply can be implemented in two ways:

  1. Inner product: Data-dependent control
  2. Outer product: Data-dependent memory

10 of 55

  • Known memory access pattern, but unpredictability in control

10

2

3

4

idx

2

3

5

val

A

B[0]

0

1

3

1

4

1

idx

val

CSR format: Compressed Sparse Row

total+=3*1

Output of conditional

Conditional output 0 means no multiplication

Sparse Inner Product Multiply (stream-join)

0

0

0

1

0

11 of 55

11

2

3

4

idx

2

3

5

val

A

B[0]

0

1

3

1

4

1

idx

val

CSR format: Compressed Sparse Row

total+=3*1

Output of conditional

Conditional output 0 means no multiplication

Sparse Inner Product Multiply (stream-join)

0

0

0

1

0

  • Known memory access pattern, but unpredictability in control
  • Stream Join:
    • Memory read can be independent of data*
    • Order that we consume streams of data is data-dependent

12 of 55

Sparse Outer Product Multiply (Alias-free Indirection)

12

2

3

4

1

3

5

0

1

5

3

4

0

3

5

0

3

1

2

2

3

2

4

3

5

1

1

col1

idx

val

A

B

idx

val

col0

col2

col3

C

Accumulate output vector

CSC: Compressed Sparse Column

  • High memory unpredictability, but known control pattern
  • No unknown dependencies (only atomic updates: out[i]=out[i]+prod[i])

13 of 55

Sparse Outer Product Multiply (Alias-free Indirection)

13

2

3

4

1

3

5

0

1

5

3

4

0

3

5

0

3

1

2

2

3

2

4

3

5

1

1

col1

idx

val

A

B

idx

val

col0

col2

col3

C

Accumulate output vector

CSC: Compressed Sparse Column

  • High memory unpredictability, but known control pattern
  • No unknown dependencies (only atomic updates: out[i]=out[i]+prod[i])
  • Alias-free Indirect:
    • Produce addresses depending on other data
    • Memory dependences, but no unknown (data-dependent) aliases

14 of 55

Graph Mining (e.g. Triangle Counting)

  • For every pair of connected nodes,�find if they have a common neighbor

14

b

c

e

d

a

f

b

d

a

c

e

b

d

e

f

a

c

f

b

c

c

d

edge list

(stream-join)

A

B

C

D

E

F

(alias-free indirect)

15 of 55

15

Stream Join

(irreg. control)

Alias-free Indirection

(irregular memory)

Neural Net (FC + Conv)

Machine Learning

Supp. Vector (SVM)

Decision Trees (GBDT)

Databases

Bayesian Networks

Join (inner)

Sort

Filter

Graph

Page Rank

& BFS

Triangle Counting

Inner Product Mult.

“”

Outer Product Mult.

“”

Sort-Join

Merge-Sort

Generate Filtered Col.

Hash-Join

Radix-Sort

Generate Column Ind.

Sparse data access

+ Histogramming

Condition on

node type

Sparse join of active list

+ Indirect acc.

for edges

Find common neighbor edges

+ Indirect acc.

for edges

+ DAG Access

16 of 55

Outline

  • Irregularity is ubiquitous
    • Sufficient and Exploitable forms of Control and Memory dependence
    • Example Workload: Matrix Multiply
  • Exploiting data-dependence with SPU accelerator
    • uArch: Stream-join Dataflow & Compute-enabled Scratchpad
    • SPU Multicore Design
  • Evaluating SPU
  • Conclusion

16

17 of 55

Approach: Start with a Dense Programmable Accelerator

17

Systolic Array

Wide Scratchpad

Control

Google TPU v2

ISCA’17

PuDianNao (ASPLOS’15)

Tabla (HPCA’16)

Stereotypical Dense Accelerator Core

18 of 55

Approach: Start with a Dense Programmable Accelerator

18

19 of 55

Approach: Start with a Dense Programmable Accelerator

19

20 of 55

Approach: Start with a Dense Programmable Accelerator

20

21 of 55

Specializing for Stream Join

21

22 of 55

Novel Dataflow for Stream Join

22

2

3

4

idx

2

3

5

val

A

B[0]

0

1

3

1

4

1

idx

val

Ld

idxA

Ld

idxB

Cmp

++

Gen

addr

++

Gen

addr

Ld

ValB

Ld

valA

Gen

addr

Gen

addr

=

<=

>=

×

acc

Traditional Dataflow

Sparse MM

Example

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

Systolic array

Control-dep. Load,

Cyclic dependence,

Unpredictable branch!

23 of 55

Novel Dataflow for Stream Join

  • Observation: For a stream join, memory is (mostly) separable from computation
  • Idea: Allow Dataflow to conditionally pop/discard/reset values based on control decisions.

23

2

3

4

idx

2

3

5

val

A

B[0]

0

1

3

1

4

1

idx

val

Ld

idxA

Ld

idxB

Cmp

++

Gen

addr

++

Gen

addr

Ld

ValB

Ld

valA

Gen

addr

Gen

addr

=

<=

>=

×

strm

idxA

strm

idxB

strm

valA

strm

valB

Cmp

×

acc

acc

>,<,=

c

c

c

reuse

reuse

reuse

reuse

discard

reset

Traditional Dataflow

Novel Stream Join Dataflow

init

Sparse MM

Example

24 of 55

Novel Dataflow for Stream Join

24

2

3

4

idx

2

3

5

val

A

B[0]

0

1

3

1

4

1

idx

val

Ld

idxA

Ld

idxB

Cmp

++

Gen

addr

++

Gen

addr

Ld

ValB

Ld

valA

Gen

addr

Gen

addr

=

<=

>=

×

strm

idxA

strm

idxB

strm

valA

strm

valB

Cmp

×

acc

acc

>,<,=

c

c

c

reuse

reuse

reuse

reuse

discard

reset

Traditional Dataflow

Novel Stream Join Dataflow

init

Sparse MM

Example

0

2

2

0

3

1

25 of 55

Novel Dataflow for Stream Join

25

2

3

4

idx

2

3

5

val

A

B[0]

0

1

3

1

4

1

idx

val

Ld

idxA

Ld

idxB

Cmp

++

Gen

addr

++

Gen

addr

Ld

ValB

Ld

valA

Gen

addr

Gen

addr

=

<=

>=

×

strm

idxA

strm

idxB

strm

valA

strm

valB

Cmp

×

acc

acc

>,<,=

c

c

c

reuse

reuse

reuse

reuse

discard

reset

Traditional Dataflow

Novel Stream Join Dataflow

init

Sparse MM

Example

0

0

2

2

2

0

3

1

consume

<

26 of 55

Novel Dataflow for Stream Join

26

2

3

4

idx

2

3

5

val

A

B[0]

0

1

3

1

4

1

idx

val

Ld

idxA

Ld

idxB

Cmp

++

Gen

addr

++

Gen

addr

Ld

ValB

Ld

valA

Gen

addr

Gen

addr

=

<=

>=

×

strm

idxA

strm

idxB

strm

valA

strm

valB

Cmp

×

acc

acc

>,<,=

c

c

c

reuse

reuse

reuse

reuse

discard

reset

Traditional Dataflow

Novel Stream Join Dataflow

init

Sparse MM

Example

1

1

2

2

2

0

3

3

<

consume

27 of 55

Other Kernels as Stream Join

27

strm

key1

strm

key2

strm

val1

strm

val2

Cmp

cat

>,<,=

c

c

reuse

reuse

reuse

reuse

discard

Database Join

strm

out

cat

Merge (sort)

strm

key1

strm

key2

Cmp

c

reuse

reuse

Mux

strm

out

Resparsify (filter)

strm

in

s-ind

s-val

Cmp

Relu

(max)

acc

c

reset

init

i

=0

c

discard

c

discard

28 of 55

Supporting Stream-Join in Hardware

28

discard

Func. Unit

ACC

CLT

FIFO0

FIFO1

FIFO2

CGRA Processing Element

reuse

reset

NE

NW

SE

SW

NE

NW

SE

SW

NE

NW

SE

SW

SE

Stream-Join

Flow-Ctrl

Data-Flow

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

“Systolic” CGRA

29 of 55

Specializing for Indirection

29

30 of 55

Indirection with guaranteed alias-freedom

30

Traditional Banked Memory

Bank

0

Bank

1

Bank

2

Bank

3

Bank

4

Bank

n

Arbitrated Crossbar

From/To Compute Fabric & Network

Dependence check in a vector

Data

Data

Observations: Dependence check serializes over a group of non-conflicting requests.

Idea: SPU allow aggressive reordering with a simple reorder buffer.

Addr.

Bank

0

Bank

1

Bank

2

Bank

3

Bank

4

Bank

n

Arbitrated Crossbar

Address

Generation

From/To Compute Fabric & Network

Queue

Queue

Queue

Queue

Queue

Queue

Indirect

Reorder Buffer

Addr.

Data

Data

SPU Banked Memory

Indirect updates

Indirect access w/o dependence check

Reordering for indirect read

Typical GPU

Known apriori that they won’t alias. Serialization unrequired.

Vec read req1

Vec write req2

31 of 55

Indirect Access Reordering in Scratchpad

31

Typical GPU

SPU

Vec read req1

Vec write req2

32 of 55

32

Network: Traditional mesh NoC

Scratchpads in global address space: for nearby core communication

Broadcast: using memory stream engine

Synchronization: Dataflow Counters (sync. On SPAD write)

SPU: Sparse Processing Unit

Main Memory

Memory Stream Engine

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

33 of 55

Outline

  • Irregularity is ubiquitous
    • Sufficient and Exploitable forms of Control and Memory dependence
    • Example Workload: Matrix Multiply
  • Exploiting data-dependence with SPU accelerator
    • uArch: Stream-join Dataflow & Compute-enabled Scratchpad
    • SPU Multicore Design
  • Evaluating SPU
  • Conclusion

33

34 of 55

Methodology

  • Programming: C + Intrinsics + Dataflow Graphs
  • SPU Simulation: Gem5+Ruby (RISCV inorder control core)

34

Workloads

CPU

GPU

GBDT

LightGBM

LightGBM

Kernel-SVM

LibSVM

Hand-coded

AC

Hand-coded

Hand-coded

FC

MKL SPBLAS

cuSparse

Conv layer

MKL-DNN

cuDNN

Graph Alg.

Graphmat

-

TPCH

MonetDB

-

Wkld

GBDT

Cifar10-bin

(1)

Higgs-bin

(0.28)

Yahoo-bin

(0.05)

Ltrc-bin

(0.008)

KSVM

Connect

(0.33)

Higgs

(0.92)

Yahoo

(0.59)

Ltrc

(0.24)

CONV

VGG-3

(0.34)

VGG-4

(0.1)

ALEX-2

(0.14)

RES-1

(0.05)

FC

VGG-12

(0.04)

VGG-13

(0.09)

ALEX-6

(0.16)

RES-1

(0.22)

AC

Pigs

Munin

Andes

Mildew

Graph

Flickr

Fb-artist

NY-road

Benchmarks

Datasets (with varying sparsity)

35 of 55

Domain-agnostic comparison points

35

P4000 Pascal GPU

SPU-inorder

Overall Design

Core Design

SPU

SM-1

SM-2

Main Memory

Ctrl

FU-1

FU-N

L1 Cache

Shared mem

SM-N

FU-2

On-chip mem:4MB

FP units: 3696

Mem bw = 243 GB/s

Main Memory

CPUCore

CPU Core

CPU Core

CPU Core

CPU Core

CPU Core

Ctrl

FU-1

FU-N

Lin. Scratch

Bank Scratch

FU-2

Main Memory

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

Ctrl

FU-1

FU-2

FU-8

FU-N

On-chip mem:3MB

FP units: 2560

Mem bw = 256 GB/s

On-chip mem:3MB

FP units: 2632

Mem bw = 256 GB/s

Ctrl

Ctrl

Lin. Scratch

Bank Scratch

SIMD unit

Array of in-order cores

Systolic CGRA

36 of 55

Domain-specific comparison points

36

SCNN (Sparse convolution): ISCA’17

EIE (Sparse fully connected): ISCA’16

Graphicionado (Graph Analytics): MICRO’16

Q100 (Databases): ASPLOS’14

37 of 55

Overall Results

37

GPU

SPU-inorder

SPU

ASIC

EIE

SCNN

Q100

Graphicionado

38 of 55

Cost of adding stream-join in systolic CGRA

38

  • 1.69x area overhead due to addition of flow control.

  • 1.63x power overhead.

Compared to whole design, it is 6.9% area overhead and 14.2% power overhead.

Methodology: SPU’s DGRA is implemented in Chisel and synthesized using Synopsis DC with a 28nm UMC technology library.

39 of 55

Conclusion

39

Efficiency on Stream-Join Algorithms

Efficiency on Alias-Free Indirection Algorithms

SCNN (Conv)

EIE (FC)

Graph.

(Graph)

Intel SpM-SpV

Q100

Outerspace

CPU

GPU

SPU

40 of 55

EXTRA SLIDES

40

41 of 55

Programming SPU

Example of gradient boosting decision trees (GBDT)

  • Stream join control expressed in the dataflow graph
  • Alias-Free Indirection expressed as update stream

41

42 of 55

Approach Overview

Stream-Dataflow ISA

Sparsity-Enabled SPU Core

Streaming Memory:

Streams of data fetched from memory and stored back to memory.

Dataflow Computation: Dependence graph (DFG) with input/output vector ports.

Maps dataflow computation to the systolic array

Streams of data flow from wide scratchpad

A[0.N]

B[0.N]

12

43 of 55

Note: The relative position are the best to our knowledge

43

Efficiency on Stream-Join Algorithms

Efficiency on Alias-Free Indirection Algorithms

SCNN (Conv)

EIE (FC)

Graph.

(Graph)

Intel SpM-SpV

Q100

Outerspace

CPU

GPU

SPU

Vector thread

Pasticine

Trig. inst

44 of 55

Indirect Access Reordering in Scratchpad

44

IROB Buffer at Cycle 2

Typical GPU

SPU

45 of 55

Main Memory

Global Stream Engine

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU: Sparse Processing Unit

Main Memory

Global Stream Engine

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

Main Memory

Global Stream Engine

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

Fully Connected Layer

(broadcast row)

Sparse Convolution

(communication with neighbors for halos)

Independent Lanes

(with/without Communication)

Local Spatial Communication

Pipelined Communication

Arithmetic Circuits – Pipelined DAG Traveral

(pipeline node updates)

Graph Processing

(core-core communication)

46 of 55

How much density is exploitable?

  • Bounded by memory bandwidth, sparse versions are better at less than 50% density.
  • SPU-sparse has exponential gain with sparsity.

27

47 of 55

Alias-free Indirection Abstractions

1: Indirect Memory -- d = a[b[i]]

  • Allow to specify indirect loads or stores using an input stream as address values.
  • Offset list for array-of-structs organization.

Example C code

struct {int f1, f2} a[N]�for i=0 to N

ind = b[i] str1=load(b[0..n])

.. = a[ind].f1 -> ind_load(addr=str1, offset_list={0,4})� .. = a[ind].f2

14

Stream code

48 of 55

Update stream

2: Histogram -- a[hist_bin] += c

  • Enhance ISA with compute-enabled semantics for the access stream
  • Add update stream for common reduction operations

49 of 55

Sparsity-Enhanced Memory Micro-architecture

We keep 2 logical scratchpad memories: banked and linear.

Linear Memory

- reads/writes

Arbiter

XBAR(eg.16x32)

Indirect Address Gener-ation

Linear

Access Stream Table

Linear Address

Generation

MUX

Indirect

Rd/Wr/

Atomic

Stream

Table

rd-wr bank-queues

Control Logic

Composable

Banked Scratchpad

NoC

Linear Scratchpad

Control Unit

Sel

Indirect

ROB

To Compute Fabric

From Compute Fabric

15

50 of 55

Sparsity-Enhanced Memory Micro-architecture

Arbiter

XBAR(eg.16x32)

Indirect Address Gener-ation

Linear

Access Stream Table

Linear Address

Generation

MUX

Indirect

Rd/Wr/

Atomic

Stream

Table

rd-wr bank-queues

Control Logic

Composable

Banked Scratchpad

NoC

Linear Scratchpad

Control Unit

Sel

Indirect

ROB

To Compute Fabric

From Compute Fabric

We keep 2 logical scratchpad memories: banked and linear.

Linear Memory

- reads/writes

Banked Memory

  • Indirect writes/updates
  • Indirect reads

17

51 of 55

Benefits of Heterogeneity on CGRA

Example DFG

Mapping to DGRA

21

  • Effective vectorization width is increased by 64/data-width.
  • DGRA supports Concat and Extract using sub-networks.

CGRA datapath width = 64-bits

Initial datatypes= 16-bits

52 of 55

Sparsity-Enhanced Computation Micro-architecture

DGRA Switch: same external interface but splits i/p and o/p.

DGRA Processing Element: Decomposed to fine-grained PEs

22

53 of 55

Cost of Decomposability & Stream-Join

53

Stream-Join in Systolic Array

More Decomposable

More Decomposable

54 of 55

Remaining Challenges

  • Generality
    • What about other forms of irregularity? (task-based?)
  • Programmability Challenges
  • Workload balance (1) same amount of work in each core 2) efficient use of available on-chip memory (global addressing helps in this case)
    • Partitioning of Computation/Memory (Locality & Parallelism)
    • Programming in low-level intrinsic (dataflow compute & stream mem)
  • Virtualization/integration with CPU

54

55 of 55

Domain-agnostic comparison points

55

24-core Intel Skylake CPU

P4000 Pascal GPU

Main Memory

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU Core

SPU-inorder

SM-1

SM-2

Main Memory

Ctrl

FU-1

FU-N

L1 Cache

L2 Cache

Ctrl

FU-1

FU-N

L1 Cache

Shared mem

Ctrl

FU-1

FU-N

Lin. Scratch

Bank Scratch

Ctrl

Ctrl

FU-2

OOO-3

OOO-1

OOO-2

OOO-4

Main Memory

SM-N

4MB

3584

2048

2.5MB

FU-2

FU-2

Overall Design

Core Design