1 of 34

FEATHER: A Reconfigurable Accelerator

with Data Reordering Support

for Low-Cost On-Chip Dataflow Switching

Jianming Tong+ Anirudh Itagi+ Prasanth Chatarasi* Tushar Krishna+

+ *

ISCA 2024

github

2 of 34

Outlines

2

  • Background and Motivation
  • Challenges and FEATHER Contributions
  • FEATHER - Switch Dataflows with Negligible Overhead
  • Evaluation
  • Summary

3 of 34

Why Do We Need AI Accelerators?

3

Computation Challenge:

Trillions of Computations

Energy Challenge:

Heavy Data Movement

→ Need a lot of parallel compute

Billions of Parameters

→ Need to reduce data movement

4 of 34

Spatial (or Dataflow) Accelerators

4

Trillions of Computations

Heavy Data Movement

Spread computations across sea of ALUs

Compute’s Energy Problem [1]

Reuse Data within Array via Local Storage and Direct Communication

[1] Horowitz et al, ISSCC 2014 [2] Han et al, EIE, ISCA 2016 [3] Chen et al, Eyeriss, ISSCC 2016 [4] Shao et al, Simba, MICRO 2019

EIE [2]

Eyeriss [3]

Simba [4]

Local

Storage

Control

5 of 34

How to leverage Data Reuse? – Dataflow!

5

[5] Kao et al , SIGMETRICS 2022​

(1,16,64,4,4,2,2)

P

O

T

Dataflow: fine-grained compute and memory scheduling over space and time [5]

Reuse over space

Reuse over time

1026

choices!

Dataflow

Tiling

= Mapping

+

, MC

NMCHWRS

6 of 34

Performance Impact of Dataflows

Fixed Dataflow – Error Bar indicates performance of different choices.

Adaptively switching dataflows → 2x theoretical speedup

[6] Parashar et al, ISPASS 2019 [7] Qin et al, HPCA 2019 [8] Martínez et al, ASPLOS 2020.

Lots of accelerators focus on enabling flexible dataflows (SIGMA[7], Flexagon[8]),

No Global Optimal Dataflow!

but data layout in on-chip buffer is being ignored!

Evaluation

via

Timeloop[6]

7 of 34

Data Layout in On-chip Buffer

7

Layout: fine-grain organization of data in on-chip buffer.

[Start, End, Step]

for ht in [0,4,1]:

for wt in [0,4,1]:

for ct in [0,31,16]:

C

W

H

# Inter-line dimension order (aka row)

# Intra-line dimension order (aka col)

Inter line

Intra line

8 of 34

Data Layout in On-chip Buffer

8

Layout: fine-grain organization of data in on-chip buffer.

[Start, End, Step]

for ht in [0,4,1]:

for wt in [0,4,1]:

for ct in [0,32,16]:

C

W

H

0

0

0:15

# Inter-line dimension order (aka row)

# Intra-line dimension order (aka col)

9 of 34

Data Layout in On-chip Buffer

9

Layout: fine-grain organization of data in on-chip buffer.

[Start, End, Step]

for ht in [0,4,1]:

for wt in [0,4,1]:

for ct in [0,32,16]:

C

W

H

0

0

16:31

# Inter-line dimension order (aka row)

# Intra-line dimension order (aka col)

10 of 34

Data Layout in On-chip Buffer

10

Layout: fine-grain organization of data in on-chip buffer.

[Start, End, Step]

for ht in [0,4,1]:

for wt in [0,4,1]:

for ct in [0,32,16]:

C

W

H

0

1

0:15

# Inter-line dimension order (aka row)

# Intra-line dimension order (aka col)

11 of 34

Data Layout in On-chip Buffer

11

Layout: fine-grain organization of data in on-chip buffer.

[Start, End, Step]

for ht in [0,4,1]:

for wt in [0,4,1]:

for ct in [0,32,16]:

C

W

H

0

4

0:15

# Inter-line dimension order (aka row)

# Intra-line dimension order (aka col)

12 of 34

Data Layout in On-chip Buffer

12

Layout: fine-grain organization of data in on-chip buffer.

[Start, End, Step]

for ht in [0,4,1]:

for wt in [0,4,1]:

for ct in [0,32,16]:

C

W

H

0

4

16:31

# Inter-line dimension order (aka row)

# Intra-line dimension order (aka col)

13 of 34

Data Layout in On-chip Buffer

13

Layout: fine-grain organization of data in on-chip buffer.

[Start, End, Step]

for ht in [0,4,1]:

for wt in [0,4,1]:

for ct in [0,32,16]:

for w in [wt,wt+1,1]:

for h in [ht,ht+1,1]:� for c in [ct,ct+16,1]:

Hx1

Wx1

Cx16

3

HWC_Wx1Hx1Cx16 →HWC_Cx16 (Omit dimensions w/o parallelism)

H

C

W

Layout Terminology: <inter_line_loop>_<intra_line_loop>

# Inter-line dimension order (aka row)

# Intra-line dimension order (aka col)

14 of 34

Performance Impact of Dataflows considering layout

“Dataflow Switching” under real layout → Bank conflict. Upto 128✕ slowdown!

Fixed (Dataflow, Layout) Pair – No Global Optimal Dataflow (Error Bar)

Enable “Dataflow Switching” assuming ideal Layout – Theoretical 2✕ Speedup

Let’s take an Example!

Evaluation

via

Timeloop

15 of 34

Concordant” Dataflow-Layout → Optimal Performance

15

Layout: HWC_Cx16

Layer 1 - Large C

Level-0 Tile

for ct in [0,32,16):

for mt in [0,32,16):

for ht in [0,5,1):

for wt in [0,5,1):

Level-1 Tile

for h in [0,1,1):

for w in [0,1,1):

parfor c in [0,16,16):

parfor m in [0,16,4):

Dataflow (CM Parallelism)

Layout (Channel Last)

Inter-line (HWC)

for ht in [0,5,1):

for wt in [0,5,1):

for ct in [0,32,16):

Intra-line (Cx16)

for h in [0,1,1):

for w in [0,1,1):

for c in [0,16,16):

Concordant Layout achieves theoretical compute utilization of dataflow.

16 of 34

Discordant” Dataflow-Layout → Performance Slowdown

16

Level-0 Tile

for ct in [0,8,1):

for mt in [0,32,16):

for ht in [0,56,1):

for wt in [0,56,16):

Level-1 Tile

for h in [0,1,1):

for c in [0,1,1):

parfor w in [0,16,16):

parfor m in [0,16,4):

Dataflow (WM Parallelism)

Layer 2 - Large HW

Layout: HWC_Wx2Cx8

Bank Conflicts!

(2 ports for 8 rows)

Inter-line (HWC)

for ht in [0,56,1):

for wt in [0,56,2):

for ct in [0,8,8):

Intra-line (Wx2Cx8)

for h in [0,1,1):

for w in [0,2,2):

for c in [0,8,8):

Layout (Channel Last)

Dataflow (WM parallelism) gives 100% utilization but suffers from 4✕ bank-conflict slowdown!

Discordant

17 of 34

Dataflow Switching with Data Layout Reordering

17

Inter-line (HWC)

for ht in [0,56,1):

for wt in [0,56,2):

for ct in [0,8,8):

Intra-line (WX2CX8)

for h in [0,1,1):

for w in [0,2,2):

for c in [0,8,8):

Layout (Channel Last)

Inter-line (HWC)

for ht in [0,56,1):

for ct in [0,8,1):

for wt in [0,56,16):

Intra-line (Wx16)

for h in [0,1,1):

for c in [0,1,1):

for w in [0,16,16):

Layout (Row Major)

Change!

Co-Switch (Dataflow, Layout) for optimal performance, requiring On-chip Reordering!

Level-0 Tile

for ct in [0,8,1):

for mt in [0,32,16):

for ht in [0,56,1):

for wt in [0,56,16):

Level-1 Tile

for h in [0,1,1):

for c in [0,1,1):

parfor w in [0,16,16):

parfor m in [0,16,4):

Dataflow (WM Parallelism)

Layout: HCW_Wx16

Reordering

Bank Conflict

Resolve!

18 of 34

Performance Impact of Dataflows considering layout

“Dataflow Switching” under real layout → Bank conflict. Upto 128✕ slowdown!

Fixed (Dataflow, Layout) Pair – No Global Optimal Dataflow (Error Bar)

Enable “Dataflow Switching” assuming ideal Layout – Theoretical 2✕ Speedup

Co-switching (dataflow, layout) per layer crucial to obtain optimal performance.

Evaluation

via

Timeloop

19 of 34

Outlines

19

  • Background and Motivation
  • Challenges and FEATHER Contributions
  • FEATHER - Switch Dataflows with Negligible Overhead
  • Evaluation
  • Summary

20 of 34

What Reordering is Needed and How to Implement it?

20

Co-Switch (Dataflow, Layout) for optimal performance, requiring On-chip Reordering!

What kind of reordering is needed? (Functionality)

How to implement reordering with minimal latency overhead? (Performance)

21 of 34

Functionality: Overview of Available Reordering Techniques

21

Data

Provided

Per cycle

Function: FEATHER picks Arbitrary Reordering to ensure bank conflict free for all dataflows!

Reordering transforms the layout, thereby enabling finer-grained data access per cycle

Google TPUv4 [9], MTIA [10]

[9] Jouppi et al, ISCA 2023 [10] Firoozshahian et al, ISCA 2023

Concordant Dataflow Space

A SRAM Bank has 2 ports, enabling access of up-to 2 rows.

FEATHER

22 of 34

Performance: Reordering Implementation Overview

22

Implement Insight: RIR hides reordering latency behind reduction.

Prior Arts, e.g. TPUv4[9], MTIA[10], Perform On-chip Reorder After Reduction (RAR)

[9] Jouppi et al, ISCA 2023 [10] Firoozshahian et al, ISCA 2023

Extra Reorder Overhead

Read�Layout 1

Write�Layout 1

Write�Layout 2

Read�Layout 1

Write�Layout 2

Compute & Reduction

Compute & Reorder in Reduction (RIR)

Reorder

RAR

RIR

23 of 34

Overview of FEATHER Contributions

23

Functional: Arbitrary Reordering

Performance: Reorder in Reduction (RIR) with Negligible Reorder Latency

Layout Switching – Reconfigurable NoC (BIRRD)

What (Dataflow, Layout) to switch?

Dataflow Switching

Scalable Reconfigurable 2D Compute Array (NEST)

LayoutLoop: Co-Search Optimal (Dataflow, Layout) Per Layer

First Accelerator to CoSwitch (Dataflow, Layout) Per Layer !

24 of 34

Outlines

24

  • Background and Motivation
  • Challenges and FEATHER Contributions
  • FEATHER - Switch Dataflows with Negligible Overhead
  • Evaluation
  • Summary

25 of 34

FEATHER Overview - Co-Switch (Dataflow, Layout) Per Layer

25

  • Read from StaB → Change Layout in Compute →Write to StaB (New Layout)
  • Co-Switch (Dataflow, Layout) per iteration of compute pipeline

Iteration

    • NEST – Dataflow Flexibility; BIRRD – Layout Flexibility.

26 of 34

Full Dataflow Flexibility Powered by NEST

NEST Supports Flexible Dataflow via Three-level Arbitrary Reduction over PEs!

More details in the paper!

26

  • Level 3: Global Temporal Reduction
  • Level 1: Local Temporal Reduction
  • Level 2: Column-wise Spatial Reduction

27 of 34

Full Layout Flexibility Powered by BIRRD

27

  • N-in BIRRD has 2log(N) pipeline stages

2log(N)

  • Arbitrary Reduction: accumulation of any selected inputs data.
  • Arbitrary Reorder: reorder accumulated data to any output ports.
  • Each stage has N/2 22 switches (Egg ).

28 of 34

Illustrations of BIRRD “Reordering in Reduction” Functionality

Arbitrary data shuffling among banks

Specifying different addresses per bank

28

Col-wise Reorder

Row-wise Reorder

Arbitrary Reorder Layout Flexibility

29 of 34

Outlines

29

  • Background and Motivation
  • Challenges and FEATHER Contributions
  • FEATHER - Switch Dataflows with Negligible Overhead
  • Evaluation
  • Summary

30 of 34

Comparison to SoTA (On FPGA ZCU104 Board)

30

  • Results normalized to FEATHER’s value (normalized Throughput/PE)
  • 2.65× higher over Xilinx DPU run on ZCU 104 FPGA for ResNet50.
  • 3.91× higher over GEMMINI runs on AWS f1.2xlarge using FireSim
  • 4.56× higher over Edge TPU runs on USB accelerator attached to Raspberry Pi 4B.

Takeaway: FEATHER achieves 2.65~4.56× throughput/PE because of flexible dataflows.

31 of 34

Comparison to SoTA (LayoutLoop)

31

Takeaway: FEATHER achieves 1.18~2.89× for supporting low-cost arbitrary reordering.

  • 2.89× speedup over NVDLA – No reorder uses suboptimal dataflow.
  • 1.7× speedup over SIGMA-like – Off-chip reorder introduces extra latency
  • 1.18× speedup over Medusa – Line-rotation reorders line, leading to bank conflicts.
  • 1.36× speedup over TPU-like – Column-wise accessing from Tranpose is NOT sufficient

32 of 34

Resources Evaluation of BIRRD and FEATHER

32

  • 16 16 FEATHER (@1 GHz) only uses 44% area of SIGMA-256
    • 16-input BIRRD could do the job of 256-input FAN → √N Area Saving!
    • Simplify Distribution NoC into Point-to-Point connection.

33 of 34

Outlines

33

  • Background and Motivation
  • Challenges and FEATHER Contributions
  • FEATHER - Switch Dataflows with Negligible Overhead
  • Evaluation
  • Summary

34 of 34

    • Performance: Reordering In Reduction to hide latency

Summary

  • FEATHER: first (dataflow, layout) co-switching architecture.

34

Thank you!

Code

  • Results:
    • 2.65✕/3.91✕/4.56✕ throughput/PE than Xilinx DPU/Gemmini/Edge TPU
    • 1.18~2.89✕ speedup and 1.3~6.43✕ higher energy efficiency than SotAs.
    • Area efficient: using only 44% area of SIGMA-256.
    • Peak clock frequency of 1.5 GHz (TSMC 28 nm)

Reorder in Reduction (RIR)

  • (Dataflow, layout) mismatch -> bank conflict slowdown!
    • Implementation:
      • BIRRD NoC for Layout Switching
      • NEST array for dataflow switching
    • Functionality: Arbitrary Reordering to ensure zero bank conflict