Chiplets for HPC
Open Chiplet Economy
OCP Sponsored Tutorial
Chiplet Summit Feb 6th 2024
1:00 pm to 5:00 pm
Santa Clara California, USA
George Michelogiannakis, LBNL
Material credit: John Shalf, LBNL
HPC’s Future if we Don’t Change Course
Connect. Collaborate. Accelerate.
Specialization is Nature’s Way
Powerful General Purpose
Many Lighter Weight
(post-Dennard scarcity)
Many Different Specialized
(Post-Moore Scarcity)
Xeon, Power
Intel KNL, AMD, Cavium/Marvell, GPUs
Apple, Google, Amazon, AWS
Connect. Collaborate. Accelerate.
We Have to Understand The Market
Follow the money
Connect. Collaborate. Accelerate.
Domain Specific Compute Driven by Hyperscalars
Neil Thompson
Connect. Collaborate. Accelerate.
Opportunity for HPC: New Economic Model
Open Chiplets Marketplace is forming (ODSA and UCIexpress)
Leverage this baseline and extend to support HPC
80:20 Rule: Focus open efforts on what uniquely benefits HPC
6
Mark Seager 2010
Architecture Specialization for Science
Materials
Density Functional Theory (DFT)
Use O(n) algorithm
Dominated by FFTs
FPGA or ASIC
CryoEM Accelerator
LBNL detector
750 GB / sec
Custom ASIC near detector
Genomics Accelerator
String matching
Hashing
2-8bit (ACTG)
FPGA
Digital fluid Accelerator
3D integration
Petascale chip
1024-layers
General / special HPC solution
Connect. Collaborate. Accelerate.
Algorithm-Driven Design of Programmable Hardware Accelerators
25%+ of DOE workload is Density Functional Theory (DFT)
Example: LS3DF/Density Functional Theory (DFT)
The DFT kernel for each fragment �Communication Avoiding LS3DF Formulation – Scales O(N)
O(N2 Log(N))
Comm bound if non-local
O(N3)
Compute-bound
TSQR & Choelesky
LS3DF O(N) Algorithm Formulation
Minimizes off-chip Communication
Compute Intensive Kernels
Targeted for HW Specialization
CGRA FPGA or
Chiplet
We just designed hardware
How do we integrate in a system?
Chiplets Make Specialization Accessible for HPC
From DARPA CHIPS
See the multi-agency chiplets workshop at https://sites.google.com/lbl.gov/chiplets-workshop-2023/home
Connect. Collaborate. Accelerate.
More Flexible and Lower Cost
Connect. Collaborate. Accelerate.
12
Standardized die-to-die (D2D) Physical Layer Interfaces (ODSA)
13
13
D2D
D2D
Blue Cheetah supplies the IP for the Die-to-Die (D2D) Phy.
A protocol: UCIe
Uses CXL or PCIe
I/O attach with PCIe/CXL.io
• Memory use cases: CXL.mem
• Accelerator use cases: CXL.cache
https://www.nextplatform.com/2022/03/02/
industry-behemoths-back-intels-universal-chiplet-interconnect/
https://www.snia.org/sites/default/files/PM-Summit/2022/PMCS22-Park-CXL-and-UCIe.pdf
Connect. Collaborate. Accelerate.
ODSA: Open Domain Specific Architecture�Creating an Open Chiplet Marketplace
15
Photonic MCM for High Escape Bandwidth for Remote Memory
16
Comb Laser Source with
DWDM Silicon Photonics
Wide-and Slow for high speed links
Photonic SiP
clk
data
TIA
clk
data
TIA
clk
clk
clk
data
R
C
clk
data
TIA
clk gen
Silicon waveguide
Silicon waveguide
MCM: Multi chip module
Project38: HPC Improvements Through Innovative Architecture�Cross-agency architectural exploration
Project 38 (P38) is a set of vendor-agnostic architectural explorations involving DOD, the DOE Office of Science, and NNSA
Accomplishments thus far
Look for the project 38 poster!
Affordable heterogeneous
co-integration using chiplets
Chiplet Integration
HBM-DRAM
(Micron)
Server-Class RISC-V
Processor (Ventana)
Photonic Links
(Columbia)
Recoding Engine (UC/ANL)
GraphBLAS Accelerator
(LBNL)
Questions?
Connect. Collaborate. Accelerate.
One Challenge is Escape Bandwidth
19
Chiplet Bandwidth Roadmap (5 generations of BW doubling)
20
Package Limited Bandwidth
21
Source: J. Poulton, Nvidia
Its been a problem for years,
But we need to claw this back
for disaggregation to work
Rapid Prototyping of HPC Data Analytics Engine using Open/Modular Chiplets
Motivation
Our Team
Our Vision: Leveraging the ODSA Open Chiplets Ecosystem for Rapid Prototyping using Mixed IUSG + Commercial Chiplets
Our Mission
Berkeley Laboratory
John Shalf, Thom Popovici, Anastasiia Butko, Cy Chan, Patricia Gonzalez, George Michelogiannakis, Nirmal Patra
Argonne National Laboratory
Valerie Taylor, Ray Bair, Jose Monsalve Diaz, Dawson Fox
University of Chicago Columbia University
Andrew Chien Keren Bergman
PNNL
Antonino Tumeo, Roberto Gioiosa, Jim Ang
High Performance Modular Components
DRAM Layers
Photonic Layer
Logic Layer
To be filled out
Chiplet Integration
HBM-DRAM
(Micron)
Server-Class RISC-V
Processor (Ventana)
Photonic Links
(Columbia)
Chiplet Integration for Modularity and Scalability
Scalable IUSG computing systems comprised of small chiplet building blocks Sustained scalability!
High-bandwidth, energy-efficient silicon photonic building blocks…
Compatible with CMOS microelectronics!
Recoding Engine (UC/ANL)
GraphBLAS Accelerator
(LBNL)
Enabling Technologies
Package Performance is Pin Limited
23
Source: J. Poulton, Nvidia
High SERDES rates run
counter to end of
Dennard Scaling
Datacenters: Worsten climate change without ultra-energy-efficiency� And data movement dominates that power consumption
24
Gordon Keeler
DARPA
Source: Gordon Keeler (DARPA)
What is a Chiplet?
25
Solder Microbumps
Different Options
Connect. Collaborate. Accelerate.