1 of 47

Toy CGRA Final Presentation

Charles Tsao and Po-Han Chen

2 of 47

Outline

  • Background
    • AHA Project Introduction
    • CGRA Architecture
  • Chip Bring-up
    • Top-level Chip Interface
    • SoC Test
    • Configuration Test
    • IO Test / Full Test
  • Measurement

3 of 47

Why Do We Need CGRA?

4 of 47

AHA: Agile Hardware Design Flow Using CGRA

  • CGRA generation : quickly creates a CGRA
  • Application mapping : accelerates dense linear algebra applications on the CGRA

5 of 47

CGRA Architecture

P

M

M

M

P

P

P

P

P

P

M

Switch Box

Connection Box

Processing Element Core

Memory Core

5

6 of 47

Processing Element Core

6

7 of 47

Memory Core

Dual-Port Wide-Fetch (2) SRAM

With Static Scheduling Controllers

7

8 of 47

Toy CGRA Architecture

configuration bitstream

8

9 of 47

Toy CGRA Architecture

configuration bitstream

9

10 of 47

Physical Design Results

Gate Count

848k (NAND-2)

Memory

8 KB dp-SRAM

Frequency

50 MHz

Power (Leakage)

0.08 mW

11 of 47

Top-Level Block Diagram

User Project Area

  • 34 GPIO pins can be configured as input/output in mgmt C code

  • 32 bits of config data are driven by wishbone�
  • 32 bits of config address are driven by wishbone�
  • CGRA stall driven by wishbone

  • Configure input clk and rst to come from IO or WB using logic analyzer port�
  • Reading out config data or message for debugging driven by wishbone

12 of 47

Top-Level WB, LA Connections

wbs_req = wbs_cyc_i && wbs_stb_i

wbs_wen = !wbs_ack_o && wbs_we_i && wbs_req

wbs_ren = !wbs_ack_o && !wbs_we_i && wbs_req

is_addr = wbs_adr_i == ADDR

is_wdata = wbs_adr_i == WRITE_DATA

is_rdata = wbs_adr_i == READ_DATA

stall = wbs_adr_i == STALL

User Project Area

wbs_dat_i

wbs_wen && is_addr

config_addr

config_wen

addr_reg

wbs_wen && is_wdata

config_data

wdata_reg

rdata_reg

wbs_ren &&

is_rdata

wbs_dat_o

read_config_data

CGRA

stall

stall

wbs_dat_i[3:0]

stall_reg

Output Acknowledge FSM

wbs_ack_o

config_ren

rst

io_in[35]

wbs_rst_i

wbs_wen && is_wen

wbs_ren &&

is_ren

wen_reg

wbs_dat_i[0]

ren_reg

wbs_dat_i[0]

12

13 of 47

PicoRV32 Software Interface

  • Using caravel board firmware from efabless to control mgmt area with C code
  • Cross-compiled and flashed onto PicoRV32

13

14 of 47

Testing Setup

  • RISC-V GNU toolchain, PicoRV32 firmware, and caravel_board firmware installed on Ubuntu Linux

  • Board is powered and flashed from micro-USB

  • io[6] used for outputting to serial display

  • Chip frequently stopped working
    • Mgmt SoC would hang when trying to communicate with CGRA
    • revived by brushing with paintbrush and heating with hairdryer
    • Suspect a hold time violation (?)

14

15 of 47

SoC Bringup Test

  • Simple C test to gain familiarity with system
  • Prints msg on a serial display using a while loop

15

16 of 47

Configuration Test

  • Successfully tested write to all configuration registers and write to SRAM macro

  • Test spans 6 different configurations due to only 2kb memory on PicoRV32
    • 1000 registers each

  • Not all 32 bits of register are always used
  • Range given by hi and lo
    • encode write
    • decode read

16

17 of 47

IO Test + Full Test

  • FPGA used to toggle GPIO�
  • 2 pmods used for toggling 10 IO signals for initial test

  • LPC FMC breakout board used to toggle all 34 IO signals

  • Currently mgmt area hangs when trying to communicate with CGRA

  • Full test: configured conv_3_3 bitstream but because of above could not verify

17

18 of 47

Future Work

Test Name

Basic

Board

Tests

Sanity check power rails

Sanity check clocks

PicoRV32 operational tests

Test Flash

Wishbone

Tests

R/W wishbone address space

Test loading bitstream using wishbone

FPGA

Figure out how to configure FPGA to send IO to CGRA

Test R/W of IO through FPGA, short input and output to ensure data written correctly

Configure output IO pins to come from a single tile, ensure each tile processes their data in -> out as expected

Application tests

Test addone application

Test small gaussian application

Test [more complex applications]

Measure power consumption

Measure performance

18

19 of 47

Limitations

  • Array is small, can only run tiny applications�
  • Maximum clock frequency of 50 MHz limited by GPIO
    • should have implemented a global buffer�
  • Configuration depends on harness SoC
    • should have created a bypass path�
  • Testing requires significant effort in setting up FPGA and �application toolchain
    • should have a built-in pixel generating circuits to enable self-testing

19

20 of 47

Key Contributions and Takeaways

  • First chip that validates the technology portability of agile hardware design flow

  • Can accelerate more than one application on this chip

  • Helped improve open-source toolchains by debugging with efabless

  • Learned a lot, but slow iteration due to lack of experience and limited time

  • Should’ve created backup methods for testing

Enabled following year’s EE372 students to suffer less :’)

20

21 of 47

Toy CGRA Backup Slides

Charles Tsao and Po-Han Chen

22 of 47

Application Flow

Software

Hardware

tbg.py / fault

Caravel SoC (FPGA, C mgmt)

Interconnect_tb.sv

garnet.v

xcelium

Output comparison

CGRA

Output comparison

22

23 of 47

Physical Design Results (PE)

Gate Count

14k (NAND-2)

Memory

-

Frequency

50 MHz

Power (Leakage)

21.5 nW

650 nm

180 nm

24 of 47

Physical Design Results (MEM)

Gate Count

50k (NAND-2)

Memory

1 KB dp-SRAM

Frequency

50 MHz

Power (Leakage)

80 + 9615 nW

650 nm

800 nm

25 of 47

Physical Design Results (Whole array layout)

SB in

SB out

passthrough

26 of 47

Physical Design Results (Whole array)

27 of 47

Outline

  • Motivation, CGRA Introduction
  • Block Diagrams
  • High-level DSLs
  • Interconnects
  • PE Tile Architecture
    • Area and performance estimates
  • Memory Tile Architecture
    • Area and performance estimates
  • Application Dataflow
  • Verification and Evaluation
  • Future Tasks

28 of 47

Motivation

General Purpose

Processor

Specialized

Accelerator

CPU

ASIC

EE272A

Resnet-18

Accelerator

AMD

Ryzen 5800X

CGRA

Stanford AHA

Garnet

FPGA

Xilinx

Virtex-7

General purpose processor versus Specialized accelerator

Energy Efficiency

28

29 of 47

Motivation

General Purpose

Processor

Specialized

Accelerator

CPU

ASIC

EE272A

Resnet-18

Accelerator

AMD

Ryzen 5800X

CGRA

Stanford AHA

Garnet

FPGA

Xilinx

Virtex-7

  • CGRA that can accelerate a variety of image processing applications
  • Specifically, Convolutional Neural Networks

Energy Efficiency

EE272B

Toy

CGRA

29

30 of 47

General CGRA Architecture

  • Tile_PE
    • PE Unit
    • Interconnects
      • Connection Boxes
      • Switch Boxes
  • Tile_MemCore
    • Memory Units (SRAM, reg)
    • Interconnects
      • Connection Boxes
      • Switch Boxes
  • Tile_IO_core
    • Connects 16 bit and 1 bit inputs to leftmost tiles

MEM

PE

MEM

PE

PE

PE

PE

MEM

PE

CGRA

Tile _MemCore

Tile_PE

Y Tiles

X Tiles

IO

IO

IO

30

31 of 47

High Level DSLs

  • Our CGRA hardware is implemented using 3 pythonic hardware DSLs:
    • PEak for PEs
    • Lake for Memories
    • Canal for Interconnects

  • Each specification serves as a single source of truth
    • Verify design functionally
    • Verify design hardware
    • Generate rewrite rules from CoreIR for mapper

PEak: PE Generator

Lake: Memory Generator

Canal: Interconnect Generator

High-Level DSLs

Fault (HVL)

Magma (HDL)

CoreIR

Low-Level DSLs

CGRA Verilog

31

32 of 47

High Level DSLs

  • Our CGRA hardware is implemented using 3 hardware DSLs:
    • PEak for PEs
    • Lake for Memories
    • Canal for Interconnects

  • Each specification serves as a single source of truth
    • Verify design functionally
    • Verify design hardware
    • Generate rewrite rules from CoreIR for mapper

  • Halide: high-level language for applications to map and run on CGRA

32

33 of 47

Toy CGRA Architecture

IO

IO

33

34 of 47

Top-Level Block Diagram

User Project Area

  • 34 Input and output IO tiles are driven by the 38 GPIO pads�
  • 32 bits of config data are driven by wishbone�
  • 32 bits of config address are driven by wishbone�
  • Stall driven by remaining IO or logic analyzer ports�
  • Reading out config data for debugging driven by wishbone

35 of 47

Interconnects

  • Connection Boxes feed inputs into the PE
    • 5 connection box modules in Tile_PE and Tile_MemCore
      • Three 1 bit inputs
      • Two 16 bit inputs�
  • Switch Boxes route outputs to CB or out of accelerator
    • 2 switch box modules in Tile_PE and Tile_MemCore
      • 1 bit
      • 16 bit

  • Routes are configured through bitstream (which is input into configuration ports)

35

36 of 47

Baseline PE Tile

4 tiles

10 tiles

36

37 of 47

PE Architecture (with LUT)

  • Input data comes from two 16b CBs�
  • Bit input Data three 1b CBs�
  • Muxes in green can be configured for direct connection or pipelined input�
  • Cond used for condition codes
    • (e.g. zero, overflow, negative, etc.)�
  • Look up table - like FPGA

  • Output goes to switch box

37

38 of 47

PE Architecture (optimized)

  • Input data comes from two 16b CBs�
  • Bit input Data three 1b CBs
  • Muxes in green can be configured for direct connection or pipelined input�
  • Cond used for condition codes
    • (e.g. zero, overflow, negative, etc.)

  • Look up table - like FPGA

  • Output goes to switch box

38

39 of 47

PE Tile - Area Distribution

PE Tile

Speed

83MHz (12ns)

Area

0.13 mm2

Main components:

  • PE
  • Switch Boxes

39

40 of 47

Memory Tile

  • Programmable access patterns
  • High level for-loop description low level access address/schedule

Compiler

Tool Chain

Configuration

Bitstream

MEM

40

41 of 47

Memory Tile

  • Programmable access patterns
  • High level for-loop description low level access address/schedule

Compiler

Tool Chain

Configuration

Bitstream

MEM

41

42 of 47

Memory Tile - Architecture

42

43 of 47

Memory Tile - Wide Memory

  • Wide memory → reduce memory access energy (nJ/bit)

43

44 of 47

Memory Tile - Area Distribution

Memory Tile

Speed

83MHz (12ns)

Area

0.6 mm2

Main components:

  • Address/Scheduling Controls
  • Config Registers
  • 2 Dual-port Memory Macros

44

45 of 47

Chip Physical Layout and Spec

Technology

skywater 130nm

Frequency

80 MHz

Memory Size

16 kB

Area available

10.27 mm2

Area estimated

9.2 mm2

Array Size

4 x 10 (32 PE + 8 MEM)

PE area

0.13 x 32 = 4.27 mm2

Mem Area

0.6 x 8 = 4.8 mm2

45

46 of 47

Application Flow

Software

tbg.py / fault

Hardware

Wishbone Configuration through PicoRV32

Interconnect_tb.sv

garnet.v

xcelium

Output comparison

CGRA

Output comparison

Toggle IO w/ FPGA

46

47 of 47

Verification

  • Unit Tests
    • PE Tile
    • Memory Tile
  • Applications (Verified)
    • Pointwise Brightening
    • Gaussian Blurring
    • 2D convolution of various kernel sizes
      • 1x1, 2x2, 3x3...
  • Applications (Constructing)
    • Resnet
    • Mobilenet
    • VGG
  • Testibility
    • Config circuitry supports read-back check

47