TaskStream: Accelerating Task-Parallel Workloads by Recovering Program Structure
Vidushi Dadu, Tony Nowatzki
ASPLOS 2022
2/3/2022
Goal: Broaden the Scope of Accelerators
2
DB+ML, SpMSpM
Matrix decomposition (e.g. Cholesky)
Tree/graph
(e.g. kNN, GCN)
3
Existing Task Parallel Works are Limited by Shared Memory Communication
Task Dependence Graph
Core 0
Core 1
Core 2
Core 4
Rfg core
Mem
Rfg core
Rfg core
Rfg core
4
Core 0
Core 1
Core 2
Core 4
Rfg core
Mem
Rfg core
Rfg core
Rfg core
Existing Task Parallel Works are Limited by Shared Memory Communication
Synchronization point
Task Dependence Graph
5
TaskStream Graph
C
1
2
4
Rfg core
T
S
M
Rfg core
T
S
M
Rfg core
T
S
M
Rfg core
T
S
M
Exposing Inter-Task Dependencies in the Hardware
Mem
6
0
1
2
4
Rfg core
T
S
M
Rfg core
T
S
M
Rfg core
T
S
M
Rfg core
T
S
M
Exposing Inter-Task Dependencies in the Hardware
Mem
TaskStream Graph
7
0
1
2
4
Rfg core
T
S
M
Rfg core
T
S
M
Rfg core
T
S
M
Rfg core
T
S
M
Exposing Inter-Task Dependencies in the Hardware
Mem
TaskStream Graph
Goal: Enable Efficient Task Parallelism for Reconfigurable Accelerators.
8
Overview of TaskStream Program Representation
Core 0
Core 1
Rfg core
Mem
Rfg core
Streaming
(producer-consumer)
Core 0
Core 1
Rfg core
Mem
Rfg core
Creation
(load balance)
Core 0
Core 1
Core 2
Core 4
Rfg core
Mem
Rfg core
Rfg core
Rfg core
Batching
(spatial locality)
Data item
TaskStream
Graph
Coarse-grained dependency
Scalar dependency
9
Outline
Creation
Streaming
Batching
Sizehint
Creation
inv = 1/a[k,k]
invsqr = 1/sqrt(a[k,k])
10
Example Program: Cholesky
Heterogeneous tasks
O(1)
Point
O(N)
Vector
O(N*N)
Matrix
for (j=k+1; j<n; ++j)
for (i=j; i<n; ++i)
a[j,i] -= a[k,i]*a[k,j]*inv
for (j=k; j<n; ++j)
l[j,i] = a[i,j]*invsqr
for (k=0; k<n; ++k)
11
Step 1: Split task types
Step 2: Determine the size of a task type
Cholesky Implemented with Basic Tasks
Point
Vector
Creation
Matrix
Creation
Point
Vector
Matrix
Task Dependence Graph
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
12
Inter-task Dependencies in Cholesky Decomposition
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
K=0
Inter-task dependencies
Point
Vector
Matrix
Coarse-grained dependency
Scalar dependency
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
K=1
K=1
2
2
2
2
2
2
2
2
2
2
K=2
…
K=2
13
Step 1: Split task types
Step 2: Determine the size of a task type
Cholesky Implemented with Basic Tasks
Point
Vec-tile
Creation
Mat-tile
Creation
Point
Vector
Matrix
Task Dependence Graph
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Step 3: Decide “barrier” position
Barrier
0
0
0
0
0
14
Core 0
Core 1
Core 2
Core 4
Rfg core
Mem
Rfg core
Rfg core
Rfg core
Challenges with Basic Tasks
Load imbalance
Synchronization
1
1
1
1
1
1
Task Dependence Graph
Reconfiguration
Point
Vec-tile
Creation
Mat-tile
Creation
Barrier
1
15
Outline
Creation
Streaming
Batching
Sizehint
Creation
16
TaskStream: Task Types and Core Mask
Point
Vec-tile
Creation
CoreMask = 1000
CoreMask = 1000
Mat-tile
Creation
CoreMask = 0111
Barrier
Core 0
Core 1
Core 2
Core 4
Rfg
Mem
Rfg core
Rfg core
Rfg core
Core
1
1
1
1
1
1
1
TaskStream Graph
17
TaskStream Edges
18
TaskStream: Creation Edge
Sizehint =
if triangle: 1
If square: 2
Core 0
Core 1
Core 2
Core 4
Rfg core
Mem
Rfg core
Rfg core
Rfg core
Point
Mat-tile
Creation
Sizehint = 1
TaskStream Graph
19
TaskStream: Creation Edge
Sizehint =
if triangle: 1
If square: 2
Core 0
Core 1
Core 2
Core 4
Rfg core
Mem
Rfg core
Rfg core
Rfg core
Point
Mat-tile
Creation
Sizehint = 1
TaskStream Graph
20
TaskStream StreamRecovery: Streaming edge
Core 0
Core 1
Core 2
Core 4
Rfg core
Mem
Rfg core
Rfg core
Rfg core
Mat-tile
1
1
2
2
Streaming
Coarse-grained dependency
TaskStream Graph
2
2
1
1
21
TaskStream StreamRecovery: Streaming edge
Core 0
Core 1
Core 2
Core 4
Rfg core
Mem
Rfg core
Rfg core
Rfg core
Mat-tile
Streaming
2
1
2
1
Data item
TaskStream Graph
2
2
1
1
Coarse-grained dependency
22
TaskStream StreamRecovery: Batching Edge
Core 0
Core 1
Core 2
Core 4
Rfg core
Mem
Rfg core
Rfg core
Rfg core
Reschedule tasks to exploit spatial locality
v1
v3
v2
v6
v4
v7
v5
v14
v15
v12
v13
v10
v11
v8
v9
| | | | | | | | |
dense_search task
chose_child task
v13
v11
K-nearest neighbors
Program-order: v11 (Q1), v13 (Q2), v11 (Q3), v8 (Q4), v11 (Q5), v9(Q6), v11 (Q7), v9 (Q8)
Q1
Q3
Q5
Q7
v11
Schedule these first
23
TaskStream Graphs for Studied Workloads
e. Cholesky Decomposition
23
24
Outline
Creation
Streaming
Batching
Sizehint
Creation
25
Integrating TaskStream with Dataflow
26
CGRA
computation
Output port interface
Address Generation
Control Core
Input port interface
SRAM/Shared-Cache
To other tiles
Ack buffer
Router
FIFO scheduler
|
|
Delta: Hardware Support for TaskStream
Task Management Unit
(TaskStream Edges)
27
CGRA
computation
Output port interface
Address Generation
Control Core
Input port interface
SRAM/Shared-Cache
To other tiles
Ack buffer
Router
FIFO scheduler
|
|
Delta: Hardware Support for TaskStream
Task Management Unit (TaskStream Edges)
Reconfigurable Core
28
CGRA
computation
(Tid) | (Arg1, Arg2) |
| |
| |
| |
| |
Output port interface
Address Generation
Control Core
Input port interface
SRAM/Shared-Cache
To other tiles
Ack buffer
Router
Task argument buffer
Explicit tasks
Dynamic tasks
To memory
Reconfigurable Core
task2
task1
Many concurrently active tasks.
Delta: Hardware Support for Creation
Creation
29
CGRA
computation
(Data id) | (TID, Bytes) |
| |
| |
| |
| |
Output port interface
Address Generation
Control Core
Input port interface
SRAM/Shared-Cache
To other tiles
CAM search (Data id)
Router
Task batching buffer
Delta: Hardware Support for Batching
Reconfigurable Core
task2
task3
Batch + Stream
30
CGRA
computation
(Data id) | (TID, Bytes) |
(0) | |
(3) | |
| |
| |
Output port interface
Address Generation
Control Core
Input port interface
SRAM/Shared-Cache
To other tiles
Router
Task batching buffer
Delta: Hardware Support for Batching
Reconfigurable Core
Single DataID serves multiple TID.
Batch + Stream
CAM search (Data id)
task3
task4
task5
task2
task3
task3
task3
task2
31
Task Protocol for Stream Recovery
T3
T2
T1
stream
creation
Legend
Green: Task state changes
TaskStream graph
creation
Producer
Consumer
32
Origin core: T1
Core for T2
T3
T2
1. Task creation
T2
3. Ready Ack
2. Schedule T2
Core (s) for T3
5. Ready Ack
response
7. Streaming
Data
T3
6. Execution
Producer
Consumer
4. Scheduled and ready state
Task Protocol for Stream Recovery
8. Done
T1
T3
T2
T1
stream
creation
Legend
Green: Task state changes
TaskStream graph
creation
33
Outline
Creation
Streaming
Batching
Sizehint
Creation
Programming: C + Intrinsic + DSL for TaskStream and Dataflow Graphs
Delta Simulation: Gem5+Ruby (RISCV inorder control core)
Baselines:
34
Methodology
35
Methodology
Characteristics | Value |
Cores | 16 |
FP Units | 1024 |
Task buf. entries | 16 64-byte |
Memory b/w | 256 GB/s |
Shared Cache | 2 MB |
Private Scratch | 256 kB |
Network | 64-byte mesh |
Workload | Dataset size | Other Params |
SpMM | Matrix-size=512, 1k, 4k | Density=0.1 |
DB-ML | Table-size=512, 1k, 4k | join=0.1 |
Cholesky | Matrix-size=128, 256 | Tile=32 |
kNN | #queries = 512, 1k, 2k | Tree-depth=8 Leaf-size=2k |
GCN | Graph-edges=10k,4k | Feat-len=64 |
36
Tasks: 2x with dynamic scheduling
Stream recovery: 2x with reuse.
Overall Performance
Delta-onlyTasks
Delta-Tasks+StreamRec
37
Recovering Streaming Reduces Network Traffic
38
Cholesky Cycle-level Utilization
Delta-onlyTasks (unoptimized): util = 6%
39
Cholesky Cycle-level Utilization
Delta-onlyTasks (unoptimized): util = 6%
Delta-StreamRecovery (optimized): util = 32%
40
Conclusion
Comparison to Prior Work
42
Comparison to CPU
Area numbers: buffers cost 3.6% area overhead.