1 of 34

What’s Up in the Cloud?

A Cloud of Resources Overview &

Project Eureka, Adaptable Cyber-Infrastructure at Scale

Connect!

Boyd Wilson

Boyd at omnibond.com

linkedin.com/in/boydwilson

x.com/boydwilson

Instagram.com/boydfryguy

2 of 34

Leadership

Team

  • Over 40 years combined experience in facilitating and supporting academic and corporate research in using the tools and technologies of advanced computing�
  • Experience at the working and technical, project and executive management levels at Clemson, Purdue and Miami Universities and the National Center for Supercomputing Applications located at UIUC�
  • Expertise in software development, systems integration, operations, applications support, data transmission, identity and access management, customer relations and research facilitation and engagement�
  • CoFounded ACI-REF http://www.aciref.org and CaRCC http://carcc.org
  • Over three decades of funded projects from NSF, DoD, DoE, NSA, NIST and DARPA�
  • Presidential Fellow & CSTAAC Committee Member at

Omnibonda customer-focused software engineering and support company

3 of 34

Software Products

  • Identity & Security Management
    • Passwordless MFA with OmniPasskey for Federations using Shibboleth
    • Catalyst
    • NetIQ Identity Manager Connectors
    • Thousands of customers, sold through Novell/Micro Focus/OpenText, since early 2000’s�
  • Computer Vision & AI (realtime HPC)
    • TrafficVision - AI based Automated Incident Detection (AID) & Data from existing cameras on roadways
    • BayTracker - Retail Vehicle Tracking and Timing
    • Port Observer - Drayage Queuing, AIS, Dashboard for Ports�
  • Cloud HPC and Storage Orchestration
    • CloudyCluster
    • OrangeFS maintainers including Linux Kernel
    • projectEureka
    • Custom Cloud <-> On-Prem Integration

Omnibonda customer-focused software engineering and support company

4 of 34

Data -

The Oil of the AI Generation

5 of 34

NSF Global Instrument Data (100’s of PB / yr)

6 of 34

Weather Station Data

7 of 34

A Cloud of CI Resources

8 of 34

Network Foundation

Image:Internet2

9 of 34

Campus Locations

Image: Andrew C. Comrie + IPEDS (2020)

10 of 34

Sampling of University CI Capabilities

Clemson

Princeton

Oklahoma

Nodes

1,786

1,492

919

Cores

34,916

90,000

29,428

GPUs

850

423

89

Storage (PB)

2.2

18

12.8

11 of 34

NSF CI Resources

Image:NSF

12 of 34

Sampling of NSF Resource Capabilities

JetStream 2

IU, ASU, Cornell, UH, TACC

Stampede 3

TACC

Bridges-2

PSC

Nodes

506

1,858

576

Cores

49,152

140,000

65,344

GPUs

360

40

280

Storage (PB)

14

13

15

13 of 34

Public Cloud Provider Locations

Image: Peter Alguacil

+ Atomia

14 of 34

Example: AWS Scaling

https://aws.amazon.com/blogs/aws/natural-language-processing-at-clemson-university-1-1-million-vcpus-ec2-spot-instances/

15 of 34

Example: Google Cloud Scaling

Google HPC Blog Post

Kevin Kissell

Technical Director,

Office of the CTO

Nov 18, 2019

Google HPC Blog Post

Kevin Kissell, Technical Director,

Office of the CTO

Urgent HPC can Burst Affordably to the Cloud

  • 133,573 GCP Instances at peak
  • 2,138,000 vCPUs at peak
  • 6,022,964 vCPU hours

Processed 2,479,396 hours (~256TB) of video data

  • ~4 hours of runtime
  • ~1M vCPU within an hour
  • ~1.5M vCPU within 1.5 hours
  • 2.13M vCPU within 3 hours

Total Cost: $52,598.64 USD

Average cost of $0.008 USD per vCPU hour

16 of 34

Accessing Resources

17 of 34

NSF Resource Access

Access (Traditional HPC)

OSGPool (High Throughput)

  • portal.osg-htc.org

NRP (K8s Hypercluster for Containers)

18 of 34

AWS Cloud HPC Resource Options

Amazon FSx for Lustre

  • AWS Console Based Setup

Parallel Cluster

  • CLI Setup & UI Setup
  • Slurm or AWS Batch
  • Cloud Formation

Partner/Other HPC Solutions

  • CloudyCluster HPC with Open OnDemand by Omnibond
  • AlcesFlight
  • Ronin (User-UI for Parallel Cluster)
  • Research and Engineering Studio by AWS

AWS Supports:

  • Variety of Instance types
    • X64, GPU, ARM
  • Placement Policies
  • Up to 3200 Gbps network (No IB)

19 of 34

Google Cloud HPC Resource Options

Partner HPC Options

  • From: cloud.google.com/hpc

Cloud HPC Toolkit

  • From: cloud.google.com/hpc-toolkit

Highlevel steps:

  • Clone the Cloud HPC Toolkit GitHub repository
  • Build the Cloud HPC Toolkit binary
  • Create the HPC deployment folder
  • Deploy the HPC cluster using Terraform
  • Slurm uses the power-on/off feature to launch HPC instance types configured on a partition basis

Google Cloud Supports:

  • Variety of Instance types
    • X64, GPU, ARM, TPU
  • Placement Policies
  • Up to 1000 Gbps network (no IB)

20 of 34

Azure Cloud HPC Resource Options

Partner HPC Options

  • Azure doesn’t list partners easily found on their website

Azure CycleCloud

  • Doc link
  • Create a VM from CycleCloud Image

Azure Cloud Supports:

  • Variety of Instance types (X64, GPU, ARM)
  • Placement Policies
  • Up to 200 Gbps network over IB

21 of 34

Help from CaRCC (Campus Research Computing Consortium)

carcc.org

Join the People Network

22 of 34

A work in progress update

Vision &�Use Cases

23 of 34

Project Eureka Vision

  • Interactive Applications
    • Applications & Launchers
    • API Applets & Saas Apps
    • Project Focused
  • Computational Apps
    • Compute Anywhere (HPC, AI, & Beyond)
    • Enable Cloud Specialties
    • Simplify Compute and Storage Interactions
  • Storage Integration
    • Integrate Diverse Storage Resources
    • Collaborate First
    • Project Level Data Lifecycle

Storage

Interactive

Compute

Enabling

Moments

24 of 34

Multi-Cloud/Edge Architecture

K8s

K8s

Job Routing�Elastic Scratch

Data Staging�Elastic Compute

Job Routing

Elastic Scratch

Data Staging

Elastic Compute

Job Routing

Elastic Scratch

Data Staging

Elastic Compute

Job Routing

Elastic Scratch

Data Staging

Elastic Compute

Job Routing

Data Staging

Job Routing�Elastic Scratch

Data Staging�Elastic Compute

Job Routing

Data Staging

OSPool

25 of 34

Interactive Application Integration Use Case

Interactive, HPC, HTC applications

On-prem, edge, or in the cloud

OmniSched

Instance/Storage�Provisioning�& Job Execution

K8s

26 of 34

Eureka User Experience

Kevin Kissell, Technical Director,

Office of the CTO

Open OnDemand Deployments

27 of 34

User-Level Security Architecture (Based on Open OnDemand)

Per User NGiNX

(Runs as each individual user)

Eureka-UI

HTTPS/WSS

Server Frontend

(runs as Apache User)

Functions

User Authentication

Reverse Proxy

HTTPS/WSS

Passenger

IPC Sockets

HTTPS/WSS

Per User JWT

Jobs run as the user

OmniSched

28 of 34

Identity Architecture (Using AWS as an Example)

Shibboleth

Entra

29 of 34

Interactive Apps Demo

30 of 34

Bursting &

Data Staging

31 of 34

Elastic On-Prem -> Cloud Use Case

Elastic Compute�& Storage

based on Job Directives

Jobs

OmniSched

On-Prem

Standard Jobs

Burst Jobs

Elastic Scratch

Elastic Compute

Data Staging

K8s

32 of 34

Job Directives

CloudyCluster

projectEureka

33 of 34

Storage Manager

Demo

34 of 34

What’s Up in the Cloud?

Thank You!�

Questions?

A Cloud of Resources Overview &

Project Eureka, Adaptable Cyber-Infrastructure at Scale

Please Connect!

Boyd Wilson

Boyd at omnibond.com

linkedin.com/in/boydwilson

x.com/boydwilson

Instagram.com/boydfryguy