1 of 16

National Research Platform�- A Status Update -��Frank Würthwein�Director, San Diego Supercomputer Center��April 11th 2023

2 of 16

Long Term Vision

  • Create an Open National Cyberinfrastructure that allows the federation of CI at all ~4,000 accredited, degree granting higher education institutions, non-profit research institutions, and national laboratories.
      • Open Science
      • Open Data
      • Open Source
      • Open Infrastructure

2

Openness for an Open Society

Open Compute

Open Storage & CDN

Open devices/instruments/IoT, …?

3 of 16

Community vs Funded Projects

3

Community with

Shared Vision

Lot’s of funded projects that

contribute to this shared

vision in different ways.

We want you to …

… grow NRP.

… build on NRP.

NRP is “owned” and “built” by the community for the community

4 of 16

A single Kubernetes Cluster Across the World

Rotating Storage

4000 TB

Feb 9, 2023

NRP passed its NSF acceptance review in February 2023

5 of 16

Cyberinfrastructure Stack

5

HTCondor/OSG

Hardware

IPMI, Firmware, BIOS

Kubernetes

Admiralty

SLURM

NRP operates at all layers of the stack, from IPMI up

    • IPMI reduces TCO and lower threshold to entry
    • Kubernetes allows service deployments
      • Also the natural layer for application container deployment
    • Admiralty allows K8S federation with folks who want control
      • Including cloud integration to access TPUs & other cloud only architectures
    • HTCondor allows NRP to show up as a “site” in OSG

The layer you integrate at depends on

    • Control you want
    • Effort you can afford

6 of 16

Cyberinfrastructure Stack

6

HTCondor/OSG

Hardware

IPMI, Firmware, BIOS

Kubernetes

Admiralty

SLURM

NRP operates at all layers of the stack, from IPMI up

    • IPMI reduces TCO and lowers threshold to entry
    • Kubernetes allows service deployments
      • Also the natural layer for application container deployment
    • Admiralty allows K8S federation with folks who want control
      • Including cloud integration to access TPUs & other cloud only architectures
    • HTCondor allows NRP to show up as site in OSG
  • Under-resourced institutions
  • Network providers and their POPs
  • CS & ECE faculty specialized on:
    • AI/ML => gaming GPUs
    • systems R&D

All of these find it difficult to

justify staff to support all layers

7 of 16

Hardware on NRP is Global

7

NRP integrates hardware in USA, EU, and Asia

8 of 16

Grafana Graphs Nautilus Namespaces Usage�Calendar 2022 GPUs

900

AI/ML is largest “domain” both in # of namespaces & # of GPU-hours

9 of 16

Usage by K8S Namespace

9

osg-opportunistic

ucsd-haosulab

osg-icecube

ucsd-ravigroup

cms-ml

braingeneers

Let’s look at some

example science

10 of 16

ML Inference as a Service on NRP

Raghav Kansal (grad. Stud. UCSD) runs ~1,000 CPU jobs calling out to

~10 GPUs on NRP for inference for his ML model in his thesis analysis.

80M events inferenced, sending 1.3TB of data from CPUs to GPUs in 3h

The ML model is too large to fit into the DRAM of the CPUs.

Fastest way to get the job done is “ML Inference as a service” on NRP

~4MB/s output from GPUs

~200MB/s input to GPUs

Raghav & colleagues are

4th largest GPU users in 2022

157,571 GPU-Hours

Peaking at 130 GPU

Experimental Particle Physics

cms-ml namespace

11 of 16

NRP Bringing Machine Learning �to Building Virtual Worlds, �Including Robotics and Autonomous Vehicles

  • Goal: Train Robots That Can Manipulate Arbitrary Objects
    • Open Drawer, Turn Faucet, Stack Cube, Pull Chair, Pour Water, Pick And Place, Hang Ropes, Make Dough, …

(video)

12 of 16

A Major Project in UCSD’s Hao Su Lab�is Large-Scale Robot Learning

  • We Build A Digital Twin of The Real World in Virtual Reality (VR) For Object Manipulation

  • Agents Evolve In VR
    • Specialists (Neural Nets) Learn Specific Skills �by Trial and Error
    • Generalists (Neural Nets) Distill Knowledge �to Solve Arbitrary Tasks

  • On Nautilus:
    • Hundreds of specialists have been trained

    • Each specialist is trained in millions of environment variants
    • ~10,000 GPU hours per run

585,170 GPU-Hours

Peaking at 150 GPUs

2nd largest consumer

of GPU power in 2022

13 of 16

UCSD’s Ravi Group: How to Create Visually Realistic�3D Objects or Dynamic Scenes in VR or the Metaverse

Source: Prof. Ravi Ramamoorthi, UCSD

ML Computing Transforms a Series of 2D Images

Into a 3D View Synthesis

200,000 GPU-Hours

Peaking at 122 GPUs

4th largest GPU

consumer in 2022

14 of 16

Machine Learning-Based�Neural Radiance Fields for View Synthesis (NeRFs) Are Transformational!

BY JARED LINDZON

NOVEMBER 10, 2022

A neural radiance field (NeRF) is �a fully-connected neural network �that can generate

novel views of complex 3D scenes,

based on a partial set of 2D images.

https://datagen.tech/guides/synthetic-data/neural-radiance-field-nerf/

Source: Prof. Ravi Ramamoorthi, UCSD

https://youtu.be/hvfV-iGwYX8

15 of 16

Summary & Conclusions

  • PRP ended, and was replaced by NRP
    • Passed NSF Acceptance Review for PNRP in February 2023
    • Significant new capabilities via Cat-II system “PNRP”
      • PNRP provides ops effort for Nautilus for the future
    • # of GPUs available double in 2022.
      • new GPUs (A10, 3080, 3090, A100) much more powerful than older GPUs
    • # of FPGAs increase from a few to a few dozen in 2022.
    • # of data caches grow by 50% in 22/23

=> more consistent coverage across USA

    • Data volume served expected to grow substantially in 23/24/25.
      • How much? As yet too hard to predict.
  • Hoping to recruit new partners to build FAIR capabilities on top of Data Federation within the next 5 years.
  • Hoping to expand NRP into sensor networks using 5G & 6G in the next 10 years.

15

16 of 16

Acknowledgements

  • This work was partially supported by the NSF grants OAC-1541349, OAC-1826967, OAC-2030508, OAC-1841530, OAC-2005369, OAC-21121167, CISE-1713149, CISE-2100237, CISE-2120019, OAC-2112167

16