1 of 26

Chiplets for HPC

Open Chiplet Economy

OCP Sponsored Tutorial

Chiplet Summit Feb 6th 2024

1:00 pm to 5:00 pm

Santa Clara California, USA

George Michelogiannakis, LBNL

Material credit: John Shalf, LBNL

2 of 26

HPC’s Future if we Don’t Change Course

Connect. Collaborate. Accelerate.

3 of 26

Specialization is Nature’s Way

Powerful General Purpose

Many Lighter Weight

(post-Dennard scarcity)

Many Different Specialized

(Post-Moore Scarcity)

Xeon, Power

Intel KNL, AMD, Cavium/Marvell, GPUs

Apple, Google, Amazon, AWS

Connect. Collaborate. Accelerate.

4 of 26

We Have to Understand The Market

Follow the money

Connect. Collaborate. Accelerate.

5 of 26

Domain Specific Compute Driven by Hyperscalars

Neil Thompson

Connect. Collaborate. Accelerate.

6 of 26

Opportunity for HPC: New Economic Model

Open Chiplets Marketplace is forming (ODSA and UCIexpress)

    • Licensable IP and assembly by 3rd party lowers that barrier
    • Leverage the economic model being created by HyperScale

Leverage this baseline and extend to support HPC

    • Smaller incremental cost for HPC to “play”
    • HPC has become “too small to attack the city”

80:20 Rule: Focus open efforts on what uniquely benefits HPC

    • Build up a library of reusable accelerators for HPC.
    • Interoperability for sustainability: Interoperate with commercial IP where it exists and focus on open the 20% that doesn’t make commercial sense to license

6

Mark Seager 2010

7 of 26

Architecture Specialization for Science

Materials

Density Functional Theory (DFT)

Use O(n) algorithm

Dominated by FFTs

FPGA or ASIC

CryoEM Accelerator

LBNL detector

750 GB / sec

Custom ASIC near detector

Genomics Accelerator

String matching

Hashing

2-8bit (ACTG)

FPGA

Digital fluid Accelerator

3D integration

Petascale chip

1024-layers

General / special HPC solution

Connect. Collaborate. Accelerate.

8 of 26

Algorithm-Driven Design of Programmable Hardware Accelerators

25%+ of DOE workload is Density Functional Theory (DFT)

  • What: Design the hardware acceleration around the target algorithm/application

  • Why: Huge opportunities to improve performance density and efficiency
    • FFT hardware accelerator 50x-100x faster than GPU (using SPIRAL generator)

  • How: Target Density Functional Theory
    1. Large fraction of the DOE workload
    2. Mature code base and algorithm
    3. LS3DF formulation minimizes off-chip communication and scales O(N)

Example: LS3DF/Density Functional Theory (DFT)

9 of 26

The DFT kernel for each fragment �Communication Avoiding LS3DF Formulation – Scales O(N)

O(N2 Log(N))

Comm bound if non-local

O(N3)

Compute-bound

TSQR & Choelesky

LS3DF O(N) Algorithm Formulation

Minimizes off-chip Communication

Compute Intensive Kernels

Targeted for HW Specialization

CGRA FPGA or

Chiplet

We just designed hardware

How do we integrate in a system?

10 of 26

Chiplets Make Specialization Accessible for HPC

From DARPA CHIPS

See the multi-agency chiplets workshop at https://sites.google.com/lbl.gov/chiplets-workshop-2023/home

Connect. Collaborate. Accelerate.

11 of 26

More Flexible and Lower Cost

Connect. Collaborate. Accelerate.

12 of 26

12

13 of 26

Standardized die-to-die (D2D) Physical Layer Interfaces (ODSA)

13

13

D2D

D2D

Blue Cheetah supplies the IP for the Die-to-Die (D2D) Phy.

14 of 26

A protocol: UCIe

Uses CXL or PCIe

I/O attach with PCIe/CXL.io

• Memory use cases: CXL.mem

• Accelerator use cases: CXL.cache

https://www.nextplatform.com/2022/03/02/

industry-behemoths-back-intels-universal-chiplet-interconnect/

https://www.snia.org/sites/default/files/PM-Summit/2022/PMCS22-Park-CXL-and-UCIe.pdf

Connect. Collaborate. Accelerate.

15 of 26

ODSA: Open Domain Specific Architecture�Creating an Open Chiplet Marketplace

15

16 of 26

Photonic MCM for High Escape Bandwidth for Remote Memory

16

Comb Laser Source with

DWDM Silicon Photonics

Wide-and Slow for high speed links

Photonic SiP

clk

data

TIA

clk

data

TIA

clk

clk

clk

data

R

C

clk

data

TIA

clk gen

Silicon waveguide

Silicon waveguide

MCM: Multi chip module

17 of 26

Project38: HPC Improvements Through Innovative Architecture�Cross-agency architectural exploration

Project 38 (P38) is a set of vendor-agnostic architectural explorations involving DOD, the DOE Office of Science, and NNSA

  • Mission:
  • Demonstrate high performance IUSG node -- codesigned to accelerate GraphBLAS
  • Demonstrate modular integration of LBNL/ANL IUSG + commercial IP using Open Chiplets
  • Create new capability for the USG to rapidly assemble/prototype server-class chip designs

Accomplishments thus far

  • Released integration platform (MoSAIC)
    • Abstract model to RTL to chiplets or FPGAs
  • Created end-to-end cost model for chiplet integration
  • Chisel FFT, sparse matrix multiply, and TSGR generators
  • GraphBLAS Accelerator ISA for RISC-V (GISA)
  • AMD collaboration showed benefit of sparse matrix/tensor accel

Look for the project 38 poster!

Affordable heterogeneous

co-integration using chiplets

Chiplet Integration

HBM-DRAM

(Micron)

Server-Class RISC-V

Processor (Ventana)

Photonic Links

(Columbia)

Recoding Engine (UC/ANL)

GraphBLAS Accelerator

(LBNL)

18 of 26

Questions?

Connect. Collaborate. Accelerate.

19 of 26

One Challenge is Escape Bandwidth

  • Good News: Extend bandwidth density and lower power/bit
  • Bad News: Limited (~2cm) reach
    • Cannot get outside of the package (but we need to)

19

20 of 26

Chiplet Bandwidth Roadmap (5 generations of BW doubling)

20

21 of 26

Package Limited Bandwidth

21

Source: J. Poulton, Nvidia

Its been a problem for years,

But we need to claw this back

for disaggregation to work

22 of 26

Rapid Prototyping of HPC Data Analytics Engine using Open/Modular Chiplets

Motivation

Our Team

Our Vision: Leveraging the ODSA Open Chiplets Ecosystem for Rapid Prototyping using Mixed IUSG + Commercial Chiplets

  • MPW prototyping of chip designs necessarily have small chips – lower performance
  • Many necessary subsystems (memory controllers, PCIe) better supplied from commercial IP.
  • Need in-package integration (2.5D co-packaging) to meet bandwidth requirements

Our Mission

  • Demonstrate high performance IUSG node -- codesigned to accelerate GraphBLAS
  • Demonstrate modular integration of LBNL/ANL IUSG + commercial IP using Open Chiplets
  • Create new capability for the USG to rapidly assemble/prototype server-class chip designs

Berkeley Laboratory

John Shalf, Thom Popovici, Anastasiia Butko, Cy Chan, Patricia Gonzalez, George Michelogiannakis, Nirmal Patra

Argonne National Laboratory

Valerie Taylor, Ray Bair, Jose Monsalve Diaz, Dawson Fox

University of Chicago Columbia University

Andrew Chien Keren Bergman

PNNL

Antonino Tumeo, Roberto Gioiosa, Jim Ang

High Performance Modular Components

DRAM Layers

Photonic Layer

Logic Layer

To be filled out

Chiplet Integration

HBM-DRAM

(Micron)

Server-Class RISC-V

Processor (Ventana)

Photonic Links

(Columbia)

Chiplet Integration for Modularity and Scalability

Scalable IUSG computing systems comprised of small chiplet building blocks Sustained scalability!

High-bandwidth, energy-efficient silicon photonic building blocks…

Compatible with CMOS microelectronics!

Recoding Engine (UC/ANL)

GraphBLAS Accelerator

(LBNL)

Enabling Technologies

23 of 26

Package Performance is Pin Limited

23

Source: J. Poulton, Nvidia

High SERDES rates run

counter to end of

Dennard Scaling

24 of 26

Datacenters: Worsten climate change without ultra-energy-efficiency� And data movement dominates that power consumption

  • January 2021 SRC report projects datacenter energy growth rates will lead to ~25% consumption of planetary energy by 2040.
  • Data movement is a dominant contributor to that power consumption

24

Gordon Keeler

DARPA

Source: Gordon Keeler (DARPA)

25 of 26

What is a Chiplet?

25

Solder Microbumps

26 of 26

Different Options

Connect. Collaborate. Accelerate.