1 of 17

Requirements for System Architecture and Operational Software �Pre-summit Intensive

John Shalf, Dan Reed, Barb Helland, John Towns

September 30, 2024

University of Illinois Urbana-Champaign

X-SHiELD

Max Plank, DYAMOND

SCREAM

MPAS

NeuralGCM

DALL·E (OpenAI) visualization of supercomputer capable of running a km-scale simulation of the coupled earth system

🡨 And DALL-E’s vision of the motherboard of said system.

2 of 17

Outline

2

28.09.2024

What we know and “agree” on

ESM Model

Requirements

(HW Model)

Next Steps

& open research questions

3 of 17

What we Know and Agree On

3

28.09.2024

What we know and “agree” on

ESM Model

Requirements

(HW Model)

Next Steps

& open research questions

4 of 17

What we know and agree on

  • We are translating ESM model requirements into hardware requirements for an HPC system that can satisfy those requirements
    • Gathering the engineering requirements for such a system
    • Not pre-deciding CPU vs. GPU, Custom vs. COTS… its just requirements
  • The atmospheric component is the dominant consumer of computational resources
    • This is a coupled model (atmosphere, land, ocean, ice)
    • But for simplicity we will use the dominant consumer of resources (atmosphere) to set requirements
  • We are not going to weigh in on AI climate model vs. physics-based climate model
    • That is a discussion best left to the AI intensive and the climate model experts
    • This activity is just engineering requirements derived from EXISTING 1km-capable physics-based climate models
  • And we have strawman for model requirements and an analytic (Excel) model to translate to HPC system requirements
    • Outlined on the next pages

5 of 17

ESM Model Requirements

5

28.09.2024

What we know and “agree” on

ESM Model

Requirements

(HW Model)

Next Steps

& open research questions

6 of 17

Building a Spreadsheet Hardware Model from ESM Metrics

  • Domain Decomposition

  • #layers in model

  • #atmospheric columns as function of resolution

  • Stepsize as function of resolution

  • Simulated Years per Day (SYD) target(s)
    • And/or ensemble size (concurrent ensembles)

  • #variables per gridpoint (prognostic & diagnostic variables)
  • PDE Stencil

7 of 17

Strawman ESM Model Requirements. from Intensive#1 ESM Survey Results, Intensive2, and SCREAM

  • #prognostic and diagnostic variables
    • u,v,w,q,t,z,o3,aerosols: for simplicity 10 variables (E3SM has 16)
  • How many levels
    • 256 vertical layers (high-top model up to lower mesosphere)
  • Horizontal resolution
    • ~1.25km resolution
  • Ensemble size
    • 100 realizations per model per scenario (note survey was ~10)
  • Production Rate
    • ~10-30 SYPD

8 of 17

Strawman Machine Analytic Model

  • Model Res: For simplicity assume: 10 variables, 100 levels, 1km res
    • Results in 500M columns and 2TB state vector. (note SCREAM has 16 variables)
  • Rate Assumption: 1SYPD (356x time compression) and 10s physical time per simulation step (note SCREAM target is 20s per tracer step)
    • Result is 25ms per timestep
  • IO Assumption: Read/write entire state once per timestep (4TB total)
    • 160TB/s system IO bandwidth
  • Observations Regarding present day systems
    • O(500x) H100 would be needed per ensemble member
    • O(500x) H100 should be able to achieve 10SYPD on present day systems
  • Reality check:
    • System with 160 GH200: 5km model, 90 layer, 40s timestep achieves 215 SDPD .
    • For 1.25km/90layer/20s res, that is a production rate of 1925 SPDP per MW or 450 SPDP
    • So we are a bit off the mark, but not completely crazy

9 of 17

Memory Requirements Model

  • Step Size for 1.25 KM model (step size scales to square of resolution)
    • SCREAM: 20 seconds simulated time per step
    • CU Red Code: 7 seconds simulated time per step
    • Simplified Assumption: 10 seconds simulated time per step
  • Step Rate (@10 seconds simulated time per step)
    • For 10 SYPD 🡪 2.7mS/step 🡪 ~40PB/s global BW 🡪~10k MPI Ranks (7MW)
    • For 100 SYPD 🡪 100k MPI Ranks (to meet time target) 🡪 70MW
  • Memory Capacity Assumptions
    • 10 variables (SCREAM 16 variables) x 125 levels (100-256 levels in practice)
    • Global surface area is 510 million square kilometers 🡪 325M columns
    • 4k-8k Bytes/column 🡪 3 Terabytes global
    • For 10SYPD rate: 275 MB/MPI Rank (not a lot of work for each socket to do!!!)

10 of 17

Telemetry from the SCREAM code

  • Aggressive Loop Fusion can push up CI
    • Requires much larger on-chip cache
    • 3D integrated caches on market and more to come (so its an option that was not available until recently)
  • Run chemistry concurrently with dycore

11 of 17

Communication Requirements Model

  • Model Assumptions
    • 20 halo exchanges of 10 variables per step (corroborated by SCREAM code)
    • 1mS/step to achieve 10SYPD
    • 1uS overhead per message sent
  • Bandwidth requirements under different scenarios
    • 20% time spent in un-overlapped communication 48 Gigabytes/sec injection BW per node
    • 100% overlap with computation requires only 10 Gigabytes/sec injection per node
    • MsgSize for 10SYPD 10k-way domain decomp -> 18MB/message (safely in the bandwidth bound zone)
  • 50GB/s injection bandwidth is very achievable. (inter-node communication is not a core challenge)

12 of 17

Scaling Data from Scream Code

Messages are pretty large

256 halo exchanges per traces step

13 of 17

IO Requirements Model

  • Hourly: Surface air temperature, precipitable water, integrated water vapor transport, surface specific humidity
  • 15 minutes: precipitation, surface wind components
  • Full set of daily and monthly variables per the CMIP6 DECK protocols.
  • Parallel IO Subsystem usable ingest rate of ~160TB/s

This is demanding, but very feasible with existing technology

Likely challenges are in the parallel IO software

14 of 17

Challenges of Load Imbalance for Coupled Codes

15 of 17

Next Steps and Open Research Questions

15

28.09.2024

What we know and “agree” on

ESM Model

Requirements

(HW Model)

Next Steps

& open research questions

16 of 17

Next Steps and Open Questions

  • Hardware Metrics
    • We currently have SYPD, memory BW/size, IO requirements, and communication models
    • Need to collect memory bandwidth/latency (roofline), communications to corroborate the model.
    • Anything else?
  • Code Analysis
    • Compare against projections that others have made using instrumenting the ICON code. (can they share the raw data?)
    • Instrument the SCREAM code to get another sample/example km-capable code to estimate the roofline requirements & identify where we are not hitting the roofline
    • Measure load imbalance at sync points
    • What additional code analysis should be considered (load imbalance for example?)
  • Code Optimizations and their effect on hardware requirements (w/Topic#1 ESM)
    • Aggressive loop Fusion (10x reduced memory BW, but requires large on-chip caches)
    • Concurrent physics and atmospheric chemistry

17 of 17

Questions?