1 of 23

Exploring Clouds for Acceleration of Science

NSF Award #190444�________________________

Cornell Cloud Forum 2021

Presented by: Ananya Ravipati

2 of 23

Agenda

  • Project outline
  • 6 Projects
  • Learnings
  • Wrap-up & Questions

2

3 of 23

Leadership

  • Howard Pfeffer – President and CEO of Internet2 – PI on the project
  • Jamie Sunderland – Former Director of Service Development at Internet2 – Project manager of E-CAS

NSF approached Internet2 to lead this in cooperative effort

3

4 of 23

Why E-CAS?

  • Illustrate the viability of cloud platforms to do academic research
  • Identify gaps
  • Document the findings for the research community

4

5 of 23

Exploring Clouds for Acceleration of Science. NSF Award #1904444

5

Scope of the project:

Acceleration of Science:

  • End-to-End Performance (e.g., wall clock and data-movement)
  • Number of concurrent simulations or workflows
  • The ability to process near real-time streaming data

Innovation:

  • Explore the use of heterogeneous hardware resources such as CPUs, GPUs, and FPGAs to support and extend application workflows
  • Integrate and optimize the cloud services for research

6 of 23

E-CAS: Exploring Clouds for Acceleration of Science

NSF Award #1904444: $3.2M

6

Purdue

Building Clouds�Urban Climate Modeling

UWMadison

IceCube Astronomical�Neutrino �Detector

SDSC

Bursting CIPRES Phylogenetics Science Gateway

MIT

Accelerated Machine Learning

George

BioCompute Objects in Galaxy

Washington

SUNYDownstate

Deciphering the Brain’s Neural Code

https://internet2.edu/ecas

7 of 23

Timeline

Feb 2019 Academic review of proposals

Dec 2018 – Jan 2019

Call for proposals

Mar 2019 Phase 1 subaward contracts

Apr 2019 – Apr 2020

6x Phase 1

2x Phase 2

Aug 2020 – Aug 2021

Apr 2020 Reports

& Presentations on Phase 1 projects

Sep 2021 Final reports, project wrap-up

May 2020 Academic review of Phase 1 projects

July 2020 Phase 2 subaward contracts

NCE

Phase 1 video presentations and reports available at

8 of 23

SDSC: CIPRES Phylogenetic Analysis Gateway

CIPRES Science Gateway

  • Highly accessed across all fields of Biology
  • They provide Phylogenetics tree codes

XSEDE – Comet Supercomputer

Cloud Burst to AWS

  1. Eliminate long queue wait times for 4-GPU nodes (41 hours)
  2. V100 GPU (on AWS) is 1.4xfaster than local P100 GPU
  3. Long runtimes as long as 22 days

9 of 23

UW Madison: Cloud Computing for the IceCube Neutrino Observatory

The IceCube Neutrino observatory

Science from disciplines including astrophysics, particle physics, and geophysical sciences, operating continuously, and being simultaneously sensitive to the whole sky

Open Science Grid

  1. Provisioned more than 50,000 GPUs from 3 providers across 24 regions – SC’19
  2. Identified that using spot instances/preemptible instances can reduce their cost per scan (reconstruction) from $1000/scan to few hundreds

10 of 23

GWU: BioCompute Objects in Galaxy Science Gateway

Galaxy

  • Bioinformatics platform

BCO

  • Tools and actions built into a workflow that can be executed on platforms supporting HTS workflows

  1. Public instance of Galaxy with BCOs on AWS:

https://galaxy.aws.biochemistry.gwu.edu/static/bco_tour.html

  1. Prototype for Cost

11 of 23

x

Y

Z

Purdue: Building Typology for Urban Climate Modeling

  • Modeling, simulating, and developing scenarios related to future cities
  • Develop Urban canopy parameters
  • Combines Computer Vision, Procedural Modeling, and Machine Learning techniques (e.g., Convolutional Neural Networks)

  • Photo2Building – Containerize for reproducibility
  • WRF limitations – Weather research and forecasting tool

12 of 23

SUNY Downstate: Deciphering the Brain’s Neural Code

  • Brain simulations
  • Understand the circuits that makes brain function

Each simulation requires 100 compute cores – on GCP they ran 100K compute simultaneously

Algorithm optimization using 1.8 million core hours over 2 weeks

Run a single job for ~10 days

Slurm-GCP multi-user clusters

Containerization

13 of 23

MIT: Heterogeneous Computing of LHC Data

Large Hadron Collider data rates about ~1 Petabits/second

Only data related 1 in 100000 collisions is analyzed

MLaaS to design

Design CPU/GPU density required for the LHC HLT

Train Graph Neural networks

Extended similar approaches to different areas of High Energy Physics

Gravitational Wave analysis

14 of 23

Project Learnings

    • Technical - Performance, support, skills
    • Business - Value, contracting, financial tracking
    • Programmatic - Staffing, timing, funding

14

15 of 23

Technical

  • Complex large-scale simulations in cloud computing environments have great advantage given the configurations are optimized
  • No queue times
  • Scalable compute resources
  • Placement groups, Bulk API for reduced latency are all great but also generate huge bills
  • Spot instances
  • Be Mindful of latency
  • checkpoints
  • fault tolerant code
  • Large scalability requires proper elasticity and cost management
  • Tagging
  • Resource management
  • Large scalability requires institutional backing and negotiation on restrictions
  • No tweaking hyperthreading settings
  • New features/hardware offer performance improvements, but often with little documentation or support
  • Data movement costs need to be considered (using load balancer for ingesting data)

15

16 of 23

Technical

  • Built for complex “microservices”, for example MLaaS, including high-speed training
  • Elastic orchestration and clean-up of resources can be automated (Terraform)

  • Good for community contributions and access

  • Trade-offs:
  • Complexity of choice
  • skills and experience
  • abstraction from hardware
  • cost prediction and management

16

17 of 23

Business and related processes

  • Commercial cloud is a commercial relationship. Complex ecosystem of dealing with individual stakeholders from funding agencies, legal, business ..etc
  • But committed spend can drive significant bonus credits
  • Institutions will have their own preferred processes and money flows with local legal liability, compliance and financial incentives
  • Product complexity and micro-charging increase price complexity which in-turn makes cost estimation and value calculation more difficult

17

18 of 23

Programmatic

  • All teams would have preferred more funding for staff/postgrad support

  • Overlap between phases, staff continuity, continuation of resources

  • “Shovel ready” - mostly only R1s applied

  • Would a paid “research support” technical team from providers help?

18

19 of 23

Survey on ‘Research Computing use of Cloud Platforms

  • As part of this project, we conducted a survey
    • Aimed at CIO’s + VPR’s

19

20 of 23

What are the main opportunities/benefits of using cloud for research

  • Can scale and provision resources quickly as needed
  • Ability to run very large simulations that use many thousands of cores
  • Can use pre-built models and functions without technical knowledge

20

21 of 23

What are the main difficulties when using cloud for research?

  • Lack of research specific training (most training is enterprise focused)
  • Spend control and fear of over-spend
  • Not enough budget/credits to complete everything

21

22 of 23

Conclusions

  • The range of hardware and services available is both a benefit and a difficulty

  • Documentation and support can be difficult at new services but is often very good for the basics

  • Technical skills and teamwork. Fostering a “Continuous DevOps” and ”Infrastructure as Code” culture and making accounting second nature will aid in success

  • Skilled staff are required to make use of cloud funds/resources

22

23 of 23

Thank you�

23

Contact info– aravipati@internet2.edu

con