FEATHER: A Reconfigurable Accelerator
with Data Reordering Support
for Low-Cost On-Chip Dataflow Switching
Jianming Tong+ Anirudh Itagi+ Prasanth Chatarasi* Tushar Krishna+
+ *
ISCA 2024
github
Outlines
2
Why Do We Need AI Accelerators?
3
Computation Challenge:
Trillions of Computations
Energy Challenge:
Heavy Data Movement
→ Need a lot of parallel compute
Billions of Parameters
→ Need to reduce data movement
Spatial (or Dataflow) Accelerators
4
Trillions of Computations
Heavy Data Movement
Spread computations across sea of ALUs
Compute’s Energy Problem [1]
Reuse Data within Array via Local Storage and Direct Communication
[1] Horowitz et al, ISSCC 2014 [2] Han et al, EIE, ISCA 2016 [3] Chen et al, Eyeriss, ISSCC 2016 [4] Shao et al, Simba, MICRO 2019
EIE [2]
Eyeriss [3]
Simba [4]
Local
Storage
Control
How to leverage Data Reuse? – Dataflow!
5
[5] Kao et al , SIGMETRICS 2022
(1,16,64,4,4,2,2)
P
O
T
Dataflow: fine-grained compute and memory scheduling over space and time [5]
Reuse over space
Reuse over time
1026
choices!
Dataflow
Tiling
= Mapping
+
, MC
NMCHWRS
Performance Impact of Dataflows
Fixed Dataflow – Error Bar indicates performance of different choices.
Adaptively switching dataflows → 2x theoretical speedup
[6] Parashar et al, ISPASS 2019 [7] Qin et al, HPCA 2019 [8] Martínez et al, ASPLOS 2020.
Lots of accelerators focus on enabling flexible dataflows (SIGMA[7], Flexagon[8]),
No Global Optimal Dataflow!
but data layout in on-chip buffer is being ignored!
Evaluation
via
Timeloop[6]
Data Layout in On-chip Buffer
7
Layout: fine-grain organization of data in on-chip buffer.
[Start, End, Step]
for ht in [0,4,1]:
for wt in [0,4,1]:
for ct in [0,31,16]:
C
W
H
# Inter-line dimension order (aka row)
# Intra-line dimension order (aka col)
Inter line
Intra line
Data Layout in On-chip Buffer
8
Layout: fine-grain organization of data in on-chip buffer.
[Start, End, Step]
for ht in [0,4,1]:
for wt in [0,4,1]:
for ct in [0,32,16]:
C
W
H
0
0
0:15
# Inter-line dimension order (aka row)
# Intra-line dimension order (aka col)
Data Layout in On-chip Buffer
9
Layout: fine-grain organization of data in on-chip buffer.
[Start, End, Step]
for ht in [0,4,1]:
for wt in [0,4,1]:
for ct in [0,32,16]:
C
W
H
0
0
16:31
# Inter-line dimension order (aka row)
# Intra-line dimension order (aka col)
Data Layout in On-chip Buffer
10
Layout: fine-grain organization of data in on-chip buffer.
[Start, End, Step]
for ht in [0,4,1]:
for wt in [0,4,1]:
for ct in [0,32,16]:
C
W
H
0
1
0:15
# Inter-line dimension order (aka row)
# Intra-line dimension order (aka col)
Data Layout in On-chip Buffer
11
Layout: fine-grain organization of data in on-chip buffer.
[Start, End, Step]
for ht in [0,4,1]:
for wt in [0,4,1]:
for ct in [0,32,16]:
C
W
H
0
4
0:15
# Inter-line dimension order (aka row)
# Intra-line dimension order (aka col)
Data Layout in On-chip Buffer
12
Layout: fine-grain organization of data in on-chip buffer.
[Start, End, Step]
for ht in [0,4,1]:
for wt in [0,4,1]:
for ct in [0,32,16]:
C
W
H
0
4
16:31
# Inter-line dimension order (aka row)
# Intra-line dimension order (aka col)
Data Layout in On-chip Buffer
13
Layout: fine-grain organization of data in on-chip buffer.
[Start, End, Step]
for ht in [0,4,1]:
for wt in [0,4,1]:
for ct in [0,32,16]:
for w in [wt,wt+1,1]:
for h in [ht,ht+1,1]:� for c in [ct,ct+16,1]:
Hx1
Wx1
Cx16
3
HWC_Wx1Hx1Cx16 →HWC_Cx16 (Omit dimensions w/o parallelism)
H
C
W
Layout Terminology: <inter_line_loop>_<intra_line_loop>
# Inter-line dimension order (aka row)
# Intra-line dimension order (aka col)
Performance Impact of Dataflows considering layout
“Dataflow Switching” under real layout → Bank conflict. Upto 128✕ slowdown!
Fixed (Dataflow, Layout) Pair – No Global Optimal Dataflow (Error Bar)
Enable “Dataflow Switching” assuming ideal Layout – Theoretical 2✕ Speedup
Let’s take an Example!
Evaluation
via
Timeloop
“Concordant” Dataflow-Layout → Optimal Performance
15
Layout: HWC_Cx16
Layer 1 - Large C
Level-0 Tile
for ct in [0,32,16):
for mt in [0,32,16):
for ht in [0,5,1):
for wt in [0,5,1):
Level-1 Tile
for h in [0,1,1):
for w in [0,1,1):
parfor c in [0,16,16):
parfor m in [0,16,4):
Dataflow (CM Parallelism)
Layout (Channel Last)
Inter-line (HWC)
for ht in [0,5,1):
for wt in [0,5,1):
for ct in [0,32,16):
Intra-line (Cx16)
for h in [0,1,1):
for w in [0,1,1):
for c in [0,16,16):
Concordant Layout achieves theoretical compute utilization of dataflow.
“Discordant” Dataflow-Layout → Performance Slowdown
16
Level-0 Tile
for ct in [0,8,1):
for mt in [0,32,16):
for ht in [0,56,1):
for wt in [0,56,16):
Level-1 Tile
for h in [0,1,1):
for c in [0,1,1):
parfor w in [0,16,16):
parfor m in [0,16,4):
Dataflow (WM Parallelism)
Layer 2 - Large HW
Layout: HWC_Wx2Cx8
Bank Conflicts!
(2 ports for 8 rows)
Inter-line (HWC)
for ht in [0,56,1):
for wt in [0,56,2):
for ct in [0,8,8):
Intra-line (Wx2Cx8)
for h in [0,1,1):
for w in [0,2,2):
for c in [0,8,8):
Layout (Channel Last)
Dataflow (WM parallelism) gives 100% utilization but suffers from 4✕ bank-conflict slowdown!
Discordant
Dataflow Switching with Data Layout Reordering
17
Inter-line (HWC)
for ht in [0,56,1):
for wt in [0,56,2):
for ct in [0,8,8):
Intra-line (WX2CX8)
for h in [0,1,1):
for w in [0,2,2):
for c in [0,8,8):
Layout (Channel Last)
Inter-line (HWC)
for ht in [0,56,1):
for ct in [0,8,1):
for wt in [0,56,16):
Intra-line (Wx16)
for h in [0,1,1):
for c in [0,1,1):
for w in [0,16,16):
Layout (Row Major)
Change!
Co-Switch (Dataflow, Layout) for optimal performance, requiring On-chip Reordering!
Level-0 Tile
for ct in [0,8,1):
for mt in [0,32,16):
for ht in [0,56,1):
for wt in [0,56,16):
Level-1 Tile
for h in [0,1,1):
for c in [0,1,1):
parfor w in [0,16,16):
parfor m in [0,16,4):
Dataflow (WM Parallelism)
Layout: HCW_Wx16
Reordering
Bank Conflict
Resolve!
Performance Impact of Dataflows considering layout
“Dataflow Switching” under real layout → Bank conflict. Upto 128✕ slowdown!
Fixed (Dataflow, Layout) Pair – No Global Optimal Dataflow (Error Bar)
Enable “Dataflow Switching” assuming ideal Layout – Theoretical 2✕ Speedup
Co-switching (dataflow, layout) per layer crucial to obtain optimal performance.
Evaluation
via
Timeloop
Outlines
19
What Reordering is Needed and How to Implement it?
20
Co-Switch (Dataflow, Layout) for optimal performance, requiring On-chip Reordering!
What kind of reordering is needed? (Functionality)
How to implement reordering with minimal latency overhead? (Performance)
Functionality: Overview of Available Reordering Techniques
21
Data
Provided
Per cycle
Function: FEATHER picks Arbitrary Reordering to ensure bank conflict free for all dataflows!
Reordering transforms the layout, thereby enabling finer-grained data access per cycle
Google TPUv4 [9], MTIA [10]
[9] Jouppi et al, ISCA 2023 [10] Firoozshahian et al, ISCA 2023
Concordant Dataflow Space
A SRAM Bank has 2 ports, enabling access of up-to 2 rows.
FEATHER
Performance: Reordering Implementation Overview
22
Implement Insight: RIR hides reordering latency behind reduction.
Prior Arts, e.g. TPUv4[9], MTIA[10], Perform On-chip Reorder After Reduction (RAR)
[9] Jouppi et al, ISCA 2023 [10] Firoozshahian et al, ISCA 2023
Extra Reorder Overhead
Read�Layout 1
Write�Layout 1
Write�Layout 2
Read�Layout 1
Write�Layout 2
Compute & Reduction
Compute & Reorder in Reduction (RIR)
Reorder
RAR
RIR
Overview of FEATHER Contributions
23
Functional: Arbitrary Reordering
Performance: Reorder in Reduction (RIR) with Negligible Reorder Latency
Layout Switching – Reconfigurable NoC (BIRRD)
What (Dataflow, Layout) to switch?
Dataflow Switching
Scalable Reconfigurable 2D Compute Array (NEST)
LayoutLoop: Co-Search Optimal (Dataflow, Layout) Per Layer
First Accelerator to CoSwitch (Dataflow, Layout) Per Layer !
Outlines
24
FEATHER Overview - Co-Switch (Dataflow, Layout) Per Layer
25
Iteration
Full Dataflow Flexibility Powered by NEST
NEST Supports Flexible Dataflow via Three-level Arbitrary Reduction over PEs!
More details in the paper!
26
Full Layout Flexibility Powered by BIRRD
27
2log(N)
Illustrations of BIRRD “Reordering in Reduction” Functionality
Arbitrary data shuffling among banks
Specifying different addresses per bank
28
Col-wise Reorder
Row-wise Reorder
Arbitrary Reorder Layout Flexibility
Outlines
29
Comparison to SoTA (On FPGA ZCU104 Board)
30
Takeaway: FEATHER achieves 2.65~4.56× throughput/PE because of flexible dataflows.
Comparison to SoTA (LayoutLoop)
31
Takeaway: FEATHER achieves 1.18~2.89× for supporting low-cost arbitrary reordering.
Resources Evaluation of BIRRD and FEATHER
32
Outlines
33
Summary
34
Thank you!
Code
Reorder in Reduction (RIR)