Toy CGRA Final Presentation
Charles Tsao and Po-Han Chen
Outline
Why Do We Need CGRA?
AHA: Agile Hardware Design Flow Using CGRA
CGRA Architecture
P
M
M
M
P
P
P
P
P
P
M
Switch Box
Connection Box
Processing Element Core
Memory Core
5
Processing Element Core
6
Memory Core
Dual-Port Wide-Fetch (2) SRAM
With Static Scheduling Controllers
7
Toy CGRA Architecture
configuration bitstream
8
Toy CGRA Architecture
configuration bitstream
9
Physical Design Results
Gate Count | 848k (NAND-2) |
Memory | 8 KB dp-SRAM |
Frequency | 50 MHz |
Power (Leakage) | 0.08 mW |
Top-Level Block Diagram
reference: https://github.com/efabless/caravel
User Project Area
Top-Level WB, LA Connections
wbs_req = wbs_cyc_i && wbs_stb_i
wbs_wen = !wbs_ack_o && wbs_we_i && wbs_req
wbs_ren = !wbs_ack_o && !wbs_we_i && wbs_req
is_addr = wbs_adr_i == ADDR
is_wdata = wbs_adr_i == WRITE_DATA
is_rdata = wbs_adr_i == READ_DATA
stall = wbs_adr_i == STALL
User Project Area
wbs_dat_i
wbs_wen && is_addr
config_addr
config_wen
addr_reg
wbs_wen && is_wdata
config_data
wdata_reg
rdata_reg
wbs_ren &&
is_rdata
wbs_dat_o
read_config_data
CGRA
stall
stall
wbs_dat_i[3:0]
stall_reg
Output Acknowledge FSM
wbs_ack_o
config_ren
rst
io_in[35]
wbs_rst_i
wbs_wen && is_wen
wbs_ren &&
is_ren
wen_reg
wbs_dat_i[0]
ren_reg
wbs_dat_i[0]
12
PicoRV32 Software Interface
13
Testing Setup
14
SoC Bringup Test
15
Configuration Test
16
IO Test + Full Test
17
Future Work
| Test Name |
Basic Board Tests | Sanity check power rails |
Sanity check clocks | |
PicoRV32 operational tests | |
Test Flash | |
Wishbone Tests | R/W wishbone address space |
Test loading bitstream using wishbone | |
FPGA | Figure out how to configure FPGA to send IO to CGRA |
Test R/W of IO through FPGA, short input and output to ensure data written correctly | |
Configure output IO pins to come from a single tile, ensure each tile processes their data in -> out as expected | |
Application tests | Test addone application |
Test small gaussian application | |
Test [more complex applications] | |
Measure power consumption | |
Measure performance |
18
Limitations
19
Key Contributions and Takeaways
Enabled following year’s EE372 students to suffer less :’)
20
Toy CGRA Backup Slides
Charles Tsao and Po-Han Chen
Application Flow
Software
Hardware
tbg.py / fault
Caravel SoC (FPGA, C mgmt)
Interconnect_tb.sv
garnet.v
xcelium
Output comparison
CGRA
Output comparison
22
Physical Design Results (PE)
Gate Count | 14k (NAND-2) |
Memory | - |
Frequency | 50 MHz |
Power (Leakage) | 21.5 nW |
650 nm
180 nm
Physical Design Results (MEM)
Gate Count | 50k (NAND-2) |
Memory | 1 KB dp-SRAM |
Frequency | 50 MHz |
Power (Leakage) | 80 + 9615 nW |
650 nm
800 nm
Physical Design Results (Whole array layout)
SB in
SB out
passthrough
Physical Design Results (Whole array)
Outline
Motivation
General Purpose
Processor
Specialized
Accelerator
CPU
ASIC
EE272A
Resnet-18
Accelerator
AMD
Ryzen 5800X
CGRA
Stanford AHA
Garnet
FPGA
Xilinx
Virtex-7
General purpose processor versus Specialized accelerator
Energy Efficiency
28
Motivation
General Purpose
Processor
Specialized
Accelerator
CPU
ASIC
EE272A
Resnet-18
Accelerator
AMD
Ryzen 5800X
CGRA
Stanford AHA
Garnet
FPGA
Xilinx
Virtex-7
Energy Efficiency
EE272B
Toy
CGRA
29
General CGRA Architecture
MEM
PE
MEM
PE
PE
PE
PE
MEM
PE
CGRA
Tile _MemCore
Tile_PE
Y Tiles
X Tiles
IO
IO
IO
30
High Level DSLs
PEak: PE Generator
Lake: Memory Generator
Canal: Interconnect Generator
High-Level DSLs
Fault (HVL)
Magma (HDL)
CoreIR
Low-Level DSLs
CGRA Verilog
31
High Level DSLs
32
Toy CGRA Architecture
IO
IO
33
Top-Level Block Diagram
reference: https://github.com/efabless/caravel
User Project Area
Interconnects
35
Baseline PE Tile
4 tiles
10 tiles
36
PE Architecture (with LUT)
37
PE Architecture (optimized)
38
PE Tile - Area Distribution
PE Tile
Speed | 83MHz (12ns) |
Area | 0.13 mm2 |
Main components:
39
Memory Tile
Compiler
Tool Chain
Configuration
Bitstream
MEM
40
Memory Tile
Compiler
Tool Chain
Configuration
Bitstream
MEM
41
Memory Tile - Architecture
42
Memory Tile - Wide Memory
43
Memory Tile - Area Distribution
Memory Tile
Speed | 83MHz (12ns) |
Area | 0.6 mm2 |
Main components:
44
Chip Physical Layout and Spec
Technology | skywater 130nm |
Frequency | 80 MHz |
Memory Size | 16 kB |
Area available | 10.27 mm2 |
Area estimated | 9.2 mm2 |
Array Size | 4 x 10 (32 PE + 8 MEM) |
PE area | 0.13 x 32 = 4.27 mm2 |
Mem Area | 0.6 x 8 = 4.8 mm2 |
45
Application Flow
Software
tbg.py / fault
Hardware
Wishbone Configuration through PicoRV32
Interconnect_tb.sv
garnet.v
xcelium
Output comparison
CGRA
Output comparison
Toggle IO w/ FPGA
46
Verification
47