1 of 42

TaskStream: Accelerating Task-Parallel Workloads by Recovering Program Structure

Vidushi Dadu, Tony Nowatzki

ASPLOS 2022

2/3/2022

2 of 42

Goal: Broaden the Scope of Accelerators

2

DB+ML, SpMSpM

Matrix decomposition (e.g. Cholesky)

Tree/graph

(e.g. kNN, GCN)

  1. Heterogeneous tasks
  2. Coarse-grained inter-task dependencies

3 of 42

3

Existing Task Parallel Works are Limited by Shared Memory Communication

Task Dependence Graph

Core 0

Core 1

Core 2

Core 4

Rfg core

Mem

Rfg core

Rfg core

Rfg core

4 of 42

4

Core 0

Core 1

Core 2

Core 4

Rfg core

Mem

Rfg core

Rfg core

Rfg core

Existing Task Parallel Works are Limited by Shared Memory Communication

Synchronization point

Task Dependence Graph

5 of 42

5

TaskStream Graph

C

1

2

4

Rfg core

T

S

M

Rfg core

T

S

M

Rfg core

T

S

M

Rfg core

T

S

M

Exposing Inter-Task Dependencies in the Hardware

Mem

6 of 42

6

0

1

2

4

Rfg core

T

S

M

Rfg core

T

S

M

Rfg core

T

S

M

Rfg core

T

S

M

Exposing Inter-Task Dependencies in the Hardware

Mem

TaskStream Graph

7 of 42

7

0

1

2

4

Rfg core

T

S

M

Rfg core

T

S

M

Rfg core

T

S

M

Rfg core

T

S

M

Exposing Inter-Task Dependencies in the Hardware

Mem

TaskStream Graph

Goal: Enable Efficient Task Parallelism for Reconfigurable Accelerators.

8 of 42

8

Overview of TaskStream Program Representation

Core 0

Core 1

Rfg core

Mem

Rfg core

Streaming

(producer-consumer)

Core 0

Core 1

Rfg core

Mem

Rfg core

Creation

(load balance)

Core 0

Core 1

Core 2

Core 4

Rfg core

Mem

Rfg core

Rfg core

Rfg core

Batching

(spatial locality)

Data item

TaskStream

Graph

Coarse-grained dependency

Scalar dependency

9 of 42

  • Inefficiencies of Shared-memory-based task models
    • Background on existing task parallel models
    • Challenges
  • TaskStream: Recovering Program Structure
    • TaskStream Edge Types
    • Task Protocol
  • TaskStream on Reconfigurable Hardware: Delta
  • Methodology and Evaluation
    • Performance across task parallel workloads
    • Cholesky cycle-level utilization
  • Conclusion

9

Outline

Creation

Streaming

Batching

Sizehint

Creation

10 of 42

inv = 1/a[k,k]

invsqr = 1/sqrt(a[k,k])

10

Example Program: Cholesky

Heterogeneous tasks

O(1)

Point

O(N)

Vector

O(N*N)

Matrix

for (j=k+1; j<n; ++j)

for (i=j; i<n; ++i)

a[j,i] -= a[k,i]*a[k,j]*inv

for (j=k; j<n; ++j)

l[j,i] = a[i,j]*invsqr

for (k=0; k<n; ++k)

11 of 42

11

Step 1: Split task types

Step 2: Determine the size of a task type

Cholesky Implemented with Basic Tasks

Point

Vector

Creation

Matrix

Creation

Point

Vector

Matrix

Task Dependence Graph

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

12 of 42

12

Inter-task Dependencies in Cholesky Decomposition

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

K=0

Inter-task dependencies

Point

Vector

Matrix

Coarse-grained dependency

Scalar dependency

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

K=1

K=1

2

2

2

2

2

2

2

2

2

2

K=2

K=2

13 of 42

13

Step 1: Split task types

Step 2: Determine the size of a task type

Cholesky Implemented with Basic Tasks

Point

Vec-tile

Creation

Mat-tile

Creation

Point

Vector

Matrix

Task Dependence Graph

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Step 3: Decide “barrier” position

Barrier

0

0

0

0

0

14 of 42

14

Core 0

Core 1

Core 2

Core 4

Rfg core

Mem

Rfg core

Rfg core

Rfg core

Challenges with Basic Tasks

Load imbalance

Synchronization

1

1

1

1

1

1

Task Dependence Graph

Reconfiguration

Point

Vec-tile

Creation

Mat-tile

Creation

Barrier

1

15 of 42

  • Inefficiencies of Shared-memory-based task models
    • Background on existing task parallel models
    • Challenges
  • TaskStream: Recovering Program Structure
    • TaskStream Edge Types
    • Task Protocol
  • TaskStream on Reconfigurable Hardware: Delta
  • Methodology and Evaluation
    • Performance across task parallel workloads
    • Cholesky cycle-level utilization
  • Conclusion

15

Outline

Creation

Streaming

Batching

Sizehint

Creation

16 of 42

16

TaskStream: Task Types and Core Mask

Point

Vec-tile

Creation

CoreMask = 1000

CoreMask = 1000

Mat-tile

Creation

CoreMask = 0111

Barrier

Core 0

Core 1

Core 2

Core 4

Rfg

Mem

Rfg core

Rfg core

Rfg core

Core

1

1

1

1

1

1

1

TaskStream Graph

17 of 42

  • Creation: create a new task for balanced load

  • Stream-recovery:
    • Streaming: stream data during a producer-consumer dependence
    • Batching: batch tasks to exploit spatial reuse

17

TaskStream Edges

18 of 42

18

TaskStream: Creation Edge

Sizehint =

if triangle: 1

If square: 2

Core 0

Core 1

Core 2

Core 4

Rfg core

Mem

Rfg core

Rfg core

Rfg core

Point

Mat-tile

Creation

Sizehint = 1

TaskStream Graph

19 of 42

19

TaskStream: Creation Edge

Sizehint =

if triangle: 1

If square: 2

Core 0

Core 1

Core 2

Core 4

Rfg core

Mem

Rfg core

Rfg core

Rfg core

Point

Mat-tile

Creation

Sizehint = 1

TaskStream Graph

20 of 42

20

TaskStream StreamRecovery: Streaming edge

Core 0

Core 1

Core 2

Core 4

Rfg core

Mem

Rfg core

Rfg core

Rfg core

Mat-tile

1

1

2

2

Streaming

Coarse-grained dependency

TaskStream Graph

2

2

1

1

21 of 42

21

TaskStream StreamRecovery: Streaming edge

Core 0

Core 1

Core 2

Core 4

Rfg core

Mem

Rfg core

Rfg core

Rfg core

Mat-tile

Streaming

2

1

2

1

Data item

TaskStream Graph

2

2

1

1

Coarse-grained dependency

22 of 42

22

TaskStream StreamRecovery: Batching Edge

Core 0

Core 1

Core 2

Core 4

Rfg core

Mem

Rfg core

Rfg core

Rfg core

Reschedule tasks to exploit spatial locality

v1

v3

v2

v6

v4

v7

v5

v14

v15

v12

v13

v10

v11

v8

v9

dense_search task

chose_child task

v13

v11

K-nearest neighbors

Program-order: v11 (Q1), v13 (Q2), v11 (Q3), v8 (Q4), v11 (Q5), v9(Q6), v11 (Q7), v9 (Q8)

Q1

Q3

Q5

Q7

v11

Schedule these first

23 of 42

23

TaskStream Graphs for Studied Workloads

e. Cholesky Decomposition

23

24 of 42

  • Inefficiencies of Shared-memory-based task models
    • Background on existing task parallel models
    • Challenges
  • TaskStream: Recovering Program Structure
    • TaskStream Edge Types
    • Task Protocol
  • TaskStream on Reconfigurable Hardware: Delta
  • Methodology and Evaluation
    • Performance across task parallel workloads
    • Cholesky cycle-level utilization
  • Conclusion

24

Outline

Creation

Streaming

Batching

Sizehint

Creation

25 of 42

25

Integrating TaskStream with Dataflow

26 of 42

26

CGRA

computation

Output port interface

Address Generation

Control Core

Input port interface

SRAM/Shared-Cache

To other tiles

Ack buffer

Router

FIFO scheduler

Delta: Hardware Support for TaskStream

Task Management Unit

(TaskStream Edges)

27 of 42

27

CGRA

computation

Output port interface

Address Generation

Control Core

Input port interface

SRAM/Shared-Cache

To other tiles

Ack buffer

Router

FIFO scheduler

Delta: Hardware Support for TaskStream

Task Management Unit (TaskStream Edges)

Reconfigurable Core

28 of 42

28

CGRA

computation

(Tid)

(Arg1, Arg2)

Output port interface

Address Generation

Control Core

Input port interface

SRAM/Shared-Cache

To other tiles

Ack buffer

Router

Task argument buffer

Explicit tasks

Dynamic tasks

To memory

Reconfigurable Core

task2

task1

Many concurrently active tasks.

Delta: Hardware Support for Creation

Creation

29 of 42

29

CGRA

computation

(Data

id)

(TID, Bytes)

Output port interface

Address Generation

Control Core

Input port interface

SRAM/Shared-Cache

To other tiles

CAM search (Data id)

Router

Task batching buffer

Delta: Hardware Support for Batching

Reconfigurable Core

task2

task3

Batch + Stream

30 of 42

30

CGRA

computation

(Data

id)

(TID, Bytes)

(0)

(3)

Output port interface

Address Generation

Control Core

Input port interface

SRAM/Shared-Cache

To other tiles

Router

Task batching buffer

Delta: Hardware Support for Batching

Reconfigurable Core

Single DataID serves multiple TID.

Batch + Stream

CAM search (Data id)

task3

task4

task5

task2

task3

task3

task3

task2

31 of 42

31

Task Protocol for Stream Recovery

T3

T2

T1

stream

creation

Legend

Green: Task state changes

TaskStream graph

creation

Producer

Consumer

32 of 42

32

Origin core: T1

Core for T2

T3

T2

1. Task creation

T2

3. Ready Ack

2. Schedule T2

Core (s) for T3

5. Ready Ack

response

7. Streaming

Data

T3

6. Execution

Producer

Consumer

4. Scheduled and ready state

Task Protocol for Stream Recovery

8. Done

T1

T3

T2

T1

stream

creation

Legend

Green: Task state changes

TaskStream graph

creation

33 of 42

  • Inefficiencies of Shared-memory-based task models
    • Background on existing task parallel models
    • Challenges
  • TaskStream: Recovering Program Structure
    • TaskStream Edge Types
    • Task Protocol
  • TaskStream on Reconfigurable Hardware: Delta
  • Methodology and Evaluation
    • Performance across task parallel workloads
    • Cholesky cycle-level utilization
  • Conclusion

33

Outline

Creation

Streaming

Batching

Sizehint

Creation

34 of 42

Programming: C + Intrinsic + DSL for TaskStream and Dataflow Graphs

Delta Simulation: Gem5+Ruby (RISCV inorder control core)

Baselines:

  • 24-core SKL CPU
  • Static-parallel CGRA: no tasks
  • Delta-onlyTasks: only SizeHint is ON
  • Delta-Tasks+StreamRec: all optimizations are ON

34

Methodology

35 of 42

35

Methodology

Characteristics

Value

Cores

16

FP Units

1024

Task buf. entries

16 64-byte

Memory b/w

256 GB/s

Shared Cache

2 MB

Private Scratch

256 kB

Network

64-byte mesh

Workload

Dataset size

Other Params

SpMM

Matrix-size=512, 1k, 4k

Density=0.1

DB-ML

Table-size=512, 1k, 4k

join=0.1

Cholesky

Matrix-size=128, 256

Tile=32

kNN

#queries = 512, 1k, 2k

Tree-depth=8

Leaf-size=2k

GCN

Graph-edges=10k,4k

Feat-len=64

36 of 42

36

Tasks: 2x with dynamic scheduling

Stream recovery: 2x with reuse.

Overall Performance

Delta-onlyTasks

Delta-Tasks+StreamRec

37 of 42

37

Recovering Streaming Reduces Network Traffic

38 of 42

38

Cholesky Cycle-level Utilization

Delta-onlyTasks (unoptimized): util = 6%

39 of 42

39

Cholesky Cycle-level Utilization

Delta-onlyTasks (unoptimized): util = 6%

Delta-StreamRecovery (optimized): util = 32%

40 of 42

  • Task parallelism enables better load balancing, but at the cost of hurting program structure.
  • TaskStream exposes locality and compute structure as first-class abstractions in the execution model.
  • Delta achieves 2.2x performance gain with 3.6% area overhead.
  • Our work broadens the scope of reconfigurable accelerators to coarse-grained irregular parallelism.

40

Conclusion

41 of 42

Comparison to Prior Work

42 of 42

42

Comparison to CPU

Area numbers: buffers cost 3.6% area overhead.