1 of 34

FEATHER: A Reconfigurable Accelerator

with Data Reordering Support

for Low-Cost On-Chip Dataflow Switching

Jianming Tong⁺ Anirudh Itagi⁺Prasanth Chatarasi^* Tushar Krishna⁺

⁺ ^*

Contact: jianming.tong@gatech.edu tushar@ece.gatech.edu

ISCA 2024

github

2 of 34

Outlines

2

Background and Motivation
Challenges and FEATHER Contributions
FEATHER - Switch Dataflows with Negligible Overhead
Evaluation
Summary

3 of 34

Why Do We Need AI Accelerators?

3

Computation Challenge:

Trillions of Computations

Energy Challenge:

Heavy Data Movement

→ Need a lot of parallel compute

Billions of Parameters

→ Need to reduce data movement

4 of 34

Spatial (or Dataflow) Accelerators

4

Trillions of Computations

Heavy Data Movement

Spread computations across sea of ALUs

Compute’s Energy Problem [1]

Reuse Data within Array via Local Storage and Direct Communication

[1] Horowitz et al, ISSCC 2014 [2] Han et al, EIE, ISCA 2016 [3] Chen et al, Eyeriss, ISSCC 2016 [4] Shao et al, Simba, MICRO 2019

EIE [2]

Eyeriss [3]

Simba [4]

Local

Storage

Control

5 of 34

How to leverage Data Reuse? – Dataflow!

5

[5] Kao et al , SIGMETRICS 2022

(1,16,64,4,4,2,2)

P

O

T

Dataflow: fine-grained compute and memory scheduling over space and time [5]

Reuse over space

Reuse over time

10²⁶

choices!

Dataflow

Tiling

= Mapping

+

, MC

NMCHWRS

To achieve max data reuse when mapping workloads to sea of ALUs, the key is to explore dataflow.

[click] The dataflow is defined as the fine-grained compute and memory scheduling over space and time.

[click] For example, dataflow of convolution could be represented as nested loop.

[click] The order of dimensions defines the data reuse over time

[click] and parallelism of dimensions defines data reuse over space.

[click] When filling each dimension with a specific tiling value, the nested loop becomes an actual mapping [click] that could be executed on the realistic AI accelerator. We use mapping and dataflow interchangeably in the rest of the talk.

However, the design space for dataflow is extremely large. Even a single convolution layer comes with 10^{26} [click] different dataflows choice!

6 of 34

Performance Impact of Dataflows

Fixed Dataflow – Error Bar indicates performance of different choices.

Adaptively switching dataflows → 2x theoretical speedup

[6] Parashar et al, ISPASS 2019 [7] Qin et al, HPCA 2019 [8] Martínez et al, ASPLOS 2020.

Lots of accelerators focus on enabling flexible dataflows (SIGMA[7], Flexagon[8]),

No Global Optimal Dataflow!

but data layout in on-chip buffer is being ignored!

Evaluation

via

Timeloop[6]

7 of 34

Data Layout in On-chip Buffer

7

Layout: fine-grain organization of data in on-chip buffer.

[Start, End, Step]

for ht in [0,4,1]:

for wt in [0,4,1]:

for ct in [0,31,16]:

C

W

H

# Inter-line dimension order (aka row)

# Intra-line dimension order (aka col)

Inter line

Intra line

8 of 34

Data Layout in On-chip Buffer

8

Layout: fine-grain organization of data in on-chip buffer.

[Start, End, Step]

for ht in [0,4,1]:

for wt in [0,4,1]:

for ct in [0,32,16]:

C

W

H

0

0:15

# Inter-line dimension order (aka row)

# Intra-line dimension order (aka col)

9 of 34

Data Layout in On-chip Buffer

9

Layout: fine-grain organization of data in on-chip buffer.

[Start, End, Step]

for ht in [0,4,1]:

for wt in [0,4,1]:

for ct in [0,32,16]:

C

W

H

0

16:31

# Inter-line dimension order (aka row)

# Intra-line dimension order (aka col)

10 of 34

Data Layout in On-chip Buffer

10

Layout: fine-grain organization of data in on-chip buffer.

[Start, End, Step]

for ht in [0,4,1]:

for wt in [0,4,1]:

for ct in [0,32,16]:

C

W

H

0

1

0:15

# Inter-line dimension order (aka row)

# Intra-line dimension order (aka col)

11 of 34

Data Layout in On-chip Buffer

11

Layout: fine-grain organization of data in on-chip buffer.

[Start, End, Step]

for ht in [0,4,1]:

for wt in [0,4,1]:

for ct in [0,32,16]:

C

W

H

0

4

0:15

# Inter-line dimension order (aka row)

# Intra-line dimension order (aka col)

12 of 34

Data Layout in On-chip Buffer

12

Layout: fine-grain organization of data in on-chip buffer.

[Start, End, Step]

for ht in [0,4,1]:

for wt in [0,4,1]:

for ct in [0,32,16]:

C

W

H

0

4

16:31

# Inter-line dimension order (aka row)

# Intra-line dimension order (aka col)

13 of 34

Data Layout in On-chip Buffer

13

Layout: fine-grain organization of data in on-chip buffer.

[Start, End, Step]

for ht in [0,4,1]:

for wt in [0,4,1]:

for ct in [0,32,16]:

for w in [wt,wt+1,1]:

for h in [ht,ht+1,1]:� for c in [ct,ct+16,1]:

Hx1

Wx1

Cx16

3

HWC_Wx1Hx1Cx16 →HWC_Cx16 (Omit dimensions w/o parallelism)

H

C

W

Layout Terminology: <inter_line_loop>_<intra_line_loop>

# Inter-line dimension order (aka row)

# Intra-line dimension order (aka col)

14 of 34

Performance Impact of Dataflows considering layout

“Dataflow Switching” under real layout → Bank conflict. Upto 128✕ slowdown!

Fixed (Dataflow, Layout) Pair – No Global Optimal Dataflow (Error Bar)

Enable “Dataflow Switching” assuming ideal Layout – Theoretical 2✕ Speedup

Let’s take an Example!

Evaluation

via

Timeloop

15 of 34

“Concordant” Dataflow-Layout → Optimal Performance

15

Layout: HWC_Cx16

Layer 1 - Large C

Level-0 Tile

for ct in [0,32,16):

for mt in [0,32,16):

for ht in [0,5,1):

for wt in [0,5,1):

Level-1 Tile

for h in [0,1,1):

for w in [0,1,1):

parfor c in [0,16,16):

parfor m in [0,16,4):

Dataflow (CM Parallelism)

Layout (Channel Last)

Inter-line (HWC)

for ht in [0,5,1):

for wt in [0,5,1):

for ct in [0,32,16):

Intra-line (Cx16)

for h in [0,1,1):

for w in [0,1,1):

for c in [0,16,16):

Concordant Layout achieves theoretical compute utilization of dataflow.

16 of 34

“Discordant” Dataflow-Layout → Performance Slowdown

16

Level-0 Tile

for ct in [0,8,1):

for mt in [0,32,16):

for ht in [0,56,1):

for wt in [0,56,16):

Level-1 Tile

for h in [0,1,1):

for c in [0,1,1):

parfor w in [0,16,16):

parfor m in [0,16,4):

Dataflow (WM Parallelism)

Layer 2 - Large HW

Layout: HWC_Wx2Cx8

Bank Conflicts!

(2 ports for 8 rows)

Inter-line (HWC)

for ht in [0,56,1):

for wt in [0,56,2):

for ct in [0,8,8):

Intra-line (Wx2Cx8)

for h in [0,1,1):

for w in [0,2,2):

for c in [0,8,8):

Layout (Channel Last)

Dataflow (WM parallelism) gives 100% utilization but suffers from 4✕ bank-conflict slowdown!

Discordant

17 of 34

Dataflow Switching with Data Layout Reordering

17

Inter-line (HWC)

for ht in [0,56,1):

for wt in [0,56,2):

for ct in [0,8,8):

Intra-line (WX2CX8)

for h in [0,1,1):

for w in [0,2,2):

for c in [0,8,8):

Layout (Channel Last)

Inter-line (HWC)

for ht in [0,56,1):

for ct in [0,8,1):

for wt in [0,56,16):

Intra-line (Wx16)

for h in [0,1,1):

for c in [0,1,1):

for w in [0,16,16):

Layout (Row Major)

Change!

Co-Switch (Dataflow, Layout) for optimal performance, requiring On-chip Reordering!

Level-0 Tile

for ct in [0,8,1):

for mt in [0,32,16):

for ht in [0,56,1):

for wt in [0,56,16):

Level-1 Tile

for h in [0,1,1):

for c in [0,1,1):

parfor w in [0,16,16):

parfor m in [0,16,4):

Dataflow (WM Parallelism)

Layout: HCW_Wx16

Reordering

Bank Conflict

Resolve!

18 of 34

Performance Impact of Dataflows considering layout

“Dataflow Switching” under real layout → Bank conflict. Upto 128✕ slowdown!

Fixed (Dataflow, Layout) Pair – No Global Optimal Dataflow (Error Bar)

Enable “Dataflow Switching” assuming ideal Layout – Theoretical 2✕ Speedup

Co-switching (dataflow, layout) per layer crucial to obtain optimal performance.

Evaluation

via

Timeloop

19 of 34

Outlines

19

Background and Motivation
Challenges and FEATHER Contributions
FEATHER - Switch Dataflows with Negligible Overhead
Evaluation
Summary

20 of 34

What Reordering is Needed and How to Implement it?

20

Co-Switch (Dataflow, Layout) for optimal performance, requiring On-chip Reordering!

What kind of reordering is needed? (Functionality)

How to implement reordering with minimal latency overhead? (Performance)

21 of 34

Functionality: Overview of Available Reordering Techniques

21

Data

Provided

Per cycle

Function: FEATHER picks Arbitrary Reordering to ensure bank conflict free for all dataflows!

Reordering transforms the layout, thereby enabling finer-grained data access per cycle

Google TPUv4 [9], MTIA [10]

[9] Jouppi et al, ISCA 2023 [10] Firoozshahian et al, ISCA 2023

Concordant Dataflow Space

A SRAM Bank has 2 ports, enabling access of up-to 2 rows.

FEATHER

To understand the functionality demand, let’s take an overview of all available reordering techniques.

[click] A SRAM bank has 2 ports, [click] it can provide 2 rows of data per cycle.

[click] Reordering transforms the layout, thereby enabling finer-grained data access per cycle

For example, [click] layout transpose [click] enables accessing 2 rows or 2 columns of data per cycle.

[click] Line Rotate moves one row from one bank to another bank to [click] enable accessing 3 rows per cycle.

[click] In-row reorder changes the order of data within a single row.

[click] Arbitrary reorder enables arbitrary accessing of 8 data from a bank.

[click] While reordering techniques proposed by prior work only support a subset without bank conflict.

[click] Among all reordering techniques, arbitrary reordering is the only one to ensure zero bank conflict for all dataflow choices!

[click] Therefore, FEATHER chooses arbitrary reordering.

22 of 34

Performance: Reordering Implementation Overview

22

Implement Insight: RIR hides reordering latency behind reduction.

Prior Arts, e.g. TPUv4[9], MTIA[10], Perform On-chip Reorder After Reduction (RAR)

[9] Jouppi et al, ISCA 2023 [10] Firoozshahian et al, ISCA 2023

Extra Reorder Overhead

Read�Layout 1

Write�Layout 1

Write�Layout 2

Read�Layout 1

Write�Layout 2

Compute & Reduction

Compute & Reorder in Reduction (RIR)

Reorder

RAR

RIR

23 of 34

Overview of FEATHER Contributions

23

Functional: Arbitrary Reordering

Performance: Reorder in Reduction (RIR) with Negligible Reorder Latency

Layout Switching – Reconfigurable NoC (BIRRD)

What (Dataflow, Layout) to switch?

Dataflow Switching

Scalable Reconfigurable 2D Compute Array (NEST)

LayoutLoop: Co-Search Optimal (Dataflow, Layout) Per Layer

First Accelerator to CoSwitch (Dataflow, Layout) Per Layer !

24 of 34

Outlines

24

Background and Motivation
Challenges and FEATHER Contributions
FEATHER - Switch Dataflows with Negligible Overhead
Evaluation
Summary

25 of 34

FEATHER Overview - Co-Switch (Dataflow, Layout) Per Layer

25

Read from StaB → Change Layout in Compute →Write to StaB (New Layout)
Co-Switch (Dataflow, Layout) per iteration of compute pipeline

Iteration

NEST – Dataflow Flexibility; BIRRD – Layout Flexibility.

26 of 34

Full Dataflow Flexibility Powered by NEST

NEST Supports Flexible Dataflow via Three-level Arbitrary Reduction over PEs!

More details in the paper!

26

Level 3: Global Temporal Reduction

Level 1: Local Temporal Reduction

Level 2: Column-wise Spatial Reduction

27 of 34

Full Layout Flexibility Powered by BIRRD

27

N-in BIRRD has 2log(N) pipeline stages

2log(N)

Arbitrary Reduction: accumulation of any selected inputs data.
Arbitrary Reorder: reorder accumulated data to any output ports.

Each stage has N/2 2✕2 switches (Egg ).

28 of 34

Illustrations of BIRRD “Reordering in Reduction” Functionality

Arbitrary data shuffling among banks

Specifying different addresses per bank

28

Col-wise Reorder

Row-wise Reorder

Arbitrary Reorder Layout Flexibility

29 of 34

Outlines

29

Background and Motivation
Challenges and FEATHER Contributions
FEATHER - Switch Dataflows with Negligible Overhead
Evaluation
Summary

30 of 34

Comparison to SoTA (On FPGA ZCU104 Board)

30

Results normalized to FEATHER’s value (normalized Throughput/PE)
2.65× higher over Xilinx DPU run on ZCU 104 FPGA for ResNet50.
3.91× higher over GEMMINI runs on AWS f1.2xlarge using FireSim
4.56× higher over Edge TPU runs on USB accelerator attached to Raspberry Pi 4B.

Takeaway: FEATHER achieves 2.65~4.56× throughput/PE because of flexible dataflows.

31 of 34

Comparison to SoTA (LayoutLoop)

31

Takeaway: FEATHER achieves 1.18~2.89× for supporting low-cost arbitrary reordering.

2.89× speedup over NVDLA – No reorder uses suboptimal dataflow.
1.7× speedup over SIGMA-like – Off-chip reorder introduces extra latency
1.18× speedup over Medusa – Line-rotation reorders line, leading to bank conflicts.
1.36× speedup over TPU-like – Column-wise accessing from Tranpose is NOT sufficient

We also model other candidates without end-to-end deployment flow into Layoutloop for comparison.

[click] FEATHER achieves 2.89× speedup over NVDLA

[click] 1.7× speedup over SIGMA-like

[click] 1.18× speedup over Medusa

[click] 1.36× speedup over TPU-like

because of supporting low-cost arbitrary reordering.

[click] FEATHER achieves 2.89× speedup over NVDLA because it compensates on suboptimal dataflow when flexible dataflow is not supported.

[click] 1.7× speedup over SIGMA-like because Off-chip reorder introduces extra latency

[click] 1.18× speedup over Medusa because Line-rotation reorders line, leading to bank conflicts.

[click] 1.36× speedup over TPU-like – Column-wise accessing from Tranpose is NOT sufficient.

Therefore. FEATHER achieves 1.18~2.89× for supporting low-cost arbitrary reordering.

32 of 34

Resources Evaluation of BIRRD and FEATHER

32

16 ✕ 16 FEATHER (@1 GHz) only uses 44% area of SIGMA-256

16-input BIRRD could do the job of 256-input FAN → √N Area Saving!

Simplify Distribution NoC into Point-to-Point connection.

33 of 34

Outlines

33

Background and Motivation
Challenges and FEATHER Contributions
FEATHER - Switch Dataflows with Negligible Overhead
Evaluation
Summary

34 of 34

Performance: Reordering In Reduction to hide latency

Summary

FEATHER: first (dataflow, layout) co-switching architecture.

34

Thank you!

Code

Results:

2.65✕/3.91✕/4.56✕ throughput/PE than Xilinx DPU/Gemmini/Edge TPU
1.18~2.89✕ speedup and 1.3~6.43✕ higher energy efficiency than SotAs.
Area efficient: using only 44% area of SIGMA-256.
Peak clock frequency of 1.5 GHz (TSMC 28 nm)

Reorder in Reduction (RIR)

(Dataflow, layout) mismatch -> bank conflict slowdown!

Implementation:

BIRRD NoC for Layout Switching
NEST array for dataflow switching

Functionality: Arbitrary Reordering to ensure zero bank conflict

In summary, in this work, we find that

[click] (dataflow, layout) mismatching could leads to significant slowdowns due to bank conflicts!

[click] We propose FEATHER, the first ASIC architecture to enable (dataflow, layout) co-switching.

[click] layout flexibility is powered by BIRRD.

[click] Functionality-wise: BIRRD ensures arbitrary reordering to ensure zero bank conflict for all dataflows.

[click] Implementation-wise: BIRRD implement reordering in reduction to hide reordering latency.

[click] To achieve dataflow flexibility, FEATHER proposes NEST.

[click] Our evaluations show that FEATHER delivers higher throughput,

[click] runs fasters and

[click] more resource efficient

[click] and scalable

Our code is released.

With that, I’m happy to answer any questions.

[click] Thank you!

In summary, in this work, we find that

[click] (dataflow, layout) mismatching could leads to a up-to 128x bank conflict slowdown!

[click] We propose FEATHER, the first ASIC architecture to enable (dataflow, layout) co-switching.

[click] To achieve layout flexibility, FEATHER proposes Butterly Interconnect for Reorder in Reduction in Dataflow (BIRRD). BIRRD enables arbitrary reordering in functionality and hides the latency of reordering behind reduction through reordering in reduction.

[click] To achieve dataflow flexibility, FEATHER proposes Neural Engine with Spatial forwarding and Temporal reduction (NEST).

[click] Our evaluations show that FEATHER achieves 2.65x/3.91x/4.56x higher e2e throughput per PE than Xilinx DPU, Gemmini and Edge TPU, separately.

[click] Further, it achieves 1.18~2.89x latency speedup and 1.3~6.43x higher energy efficiency than SotAs.

[click] FEATHER is area/power efficient and only comes at 44% area of SIGMA of the same 256 PEs.

[click] FEATHER achieves 1.5 GHz peak clock under TSMC 28nm without long-wire congestion.

Our code is released at the link in the QR code.

With that, I’m happy to answer any questions.