1 of 18

NCAR HPC Users Group (NHUG)

January 2023

January 3, 2023

Hosted by

NCAR’s Computational & Information Systems Lab

High Performance Computing Division

Consulting Services Group

This material is based upon work supported by the National Center for Atmospheric Research, which is a major facility sponsored by the National Science Foundation under Cooperative Agreement No. 1852977.

2 of 18

Agenda

2

NCAR HPC Users Group - Agenda

  • Welcome to 2023!!

  • Upcoming Events
    • January 17-20: HPC System Maintenance

  • Derecho Status

  • Introducing Virtual Consulting Services

  • Upcoming Changes: GPU Accounting

  • Round Table

3 of 18

NHUG Communication Channels

3

NCAR HPC Users Group

wiki

https://bit.ly/3Bc04Wh

Slackhttp://ncarhpcusergroup.slack.com�join ��Check out �#cheyenne-users

#casper-users

Slidohttp://www.slido.com/nhug

(coming back soon)

4 of 18

Upcoming Events

  • HPC System Maintenance: January 17-20

NCAR’s HPC resources will be unavailable to users on January 17-20 while CISL staff perform network reconfiguration for the Derecho supercomputer, and urgent maintenance on Cheyenne’s cooling infrastructure. Progress will be communicated through Notifier emails during the week.

CISL’s high-performance network must be reconfigured to accommodate Derecho installation. The network maintenance will make all HPC Resources - Cheyenne, Casper, Campaign Storage, GLADE, Quasar, Stratus, and JupyterHub - temporarily unavailable during the beginning of the maintenance window. This network reconfiguration is estimated to require the first day of the outage, after which the majority of services will be restored, with the exception of the Cheyenne compute nodes.

The remainder of the outage will focus on replacing the cooling infrastructure working fluid inside the Cheyenne compute node racks. This outage is expected to have minimal user-level impact (we are intentionally minimizing any software changes), and CISL staff will verify resource functionality before releasing the systems to the user community.

https://arc.ucar.edu/articles/362

4

NCAR HPC Users Group - Upcoming Events

5 of 18

Upcoming Events

  • Dask ½-day tutorial February 6th

Planning underway for a joint CISL/ESDS sponsored Dask workshop introducing basic and advanced usage, troubleshooting, best practices, etc…

  • NCAR/NOAA Open Hackathon

NCAR, NOAA, NVIDIA and OpenACC.org are hosting an Open Hackathon on February 21 (an online-only introduction) and February 28 through March 2 at the NCAR Mesa Lab in Boulder (hybrid format).

Details: https://arc.ucar.edu/articles/359

5

NCAR HPC Users Group - Upcoming Events

6 of 18

NWSC-3 Project Status

  • Derecho final components being delivered to NWSC this week!!
    • Hardware Manufacturing complete:
      • Split into 3 shipments of multiple cabinets each
      • Both shipments of CPU racks arrived at NWSC early December
      • Factory integration of filesystem, all GPU racks, and subset of CPU racks occured in mid-December
    • NCAR HPC Network reconfiguration required to support Derecho integration, planned during January 17-20 outage

  • Derecho user environment and documentation in development, heavily leveraging Gust experience to date
    • Gust Documentation: https://bit.ly/Gust-test-system
    • Spack-based user software environment has been redeployed (22.12) using Spack v0.19

  • NCAR staff continue to work closely with the HPE team to resolve application porting issues as they arise:
    • Resolving known issues with compilers for specific applications,
    • Maturing the Cray MPI environment for tighter integration with PBS,
    • Best practices for GPU applications with and without managed memory, etc…

6

NCAR HPC Users Group - NWSC-3 Project Status

7 of 18

Derecho CPU Rack Installation - December 2022

7

8 of 18

Derecho - Production & Deployment Schedule

8

Delivery/Task Item

Ship Date/Start Date

Four Compute Rack Hardware

11/18/2022

Remaining Production System Hardware

12/16/2022

Production System - Factory Trial

12/20/2022

Production System Installation

01/03/2023

Production System - Acceptance Testing

Jan & Feb 2023

Solution Acceptance

04/27/2023

ASD Project

05/01/2023

Open Derecho for Production

07/01/2023

NCAR HPC Users Group - NWSC-3 Project Status

9 of 18

Virtual Consulting: New for 2023

  • CISL’s Consulting Services Group (CSG) is pleased to announce the availability of virtual consulting services by appointment:

  • Intent is to augment the primary ticket support mechanism with a scheduled, one-on-one support model:
    • “Next Availability” scheduling available for general queries, or
    • Schedule a particular consultant, intended as a follow-on to an existing ticket

  • Integrated with Google Calendar & Meet, ideal for problems that benefit from screen sharing.

9

Many thanks to ESDS for sharing their virtual office hours experience & infrastructure

10 of 18

GPU Accounting

  • Since August 2022, sam.ucar.edu has tracked both CPU and GPU hours used by PBS jobs

  • CISL is planning to enforce separate GPU hours charging on Casper beginning February 13, 2023:
    • This will allow us to ensure functionality of accounting infrastructure prior to Derecho deployment.
    • This change will also help better protect GPU resources from misuse, and better engage groups with an interest in GPU resource utilization.

  • In 2022, 159 unique users from 103 projects used the ‘gpgpu’ queue on Casper.
    • Casper GPU Hours will be readily available, much like CPU hours currently.
    • Intent is to grant initial GPU hours to major projects consistent with 2022 usage; with additional hours being granted upon request.

10

11 of 18

Casper gpgpu 2022 usage

11

12 of 18

SAM: GPU Tables

  • See sam.ucar.edu for updated accounting information including GPU usage

  • 2023 allocations will be estimated from 2022 usage and updated upon request

12

13 of 18

Casper V100 Usage: % of All GPUs (2022)

13

14 of 18

Casper V100 Usage: % of Each Node’s GPUs (2022)

14

15 of 18

Questions, Comments, Feedback??��Thank You!!

15

NCAR HPC Users Group - Wrap Up

16 of 18

Backup

16

NCAR HPC Users Group - Wrap Up

17 of 18

Casper V100 Usage: % of Each GPU (2022)

17

18 of 18

NCAR’s High-Performance Computing, Data, & Analysis Resources: 2023

2017

2023

HPC Systems

mid-2023

SGI/HPE

4032 Nodes, 145,152 Cores, 313 TB total memory, 4.79 PFlop/s

#21 Supercomputer in the world at debut, #109 presently

Cray/HPE

2570 Nodes, 323,712 CPU Cores, 680 TB total memory, 3.5X performance vs Cheyenne

328 NVidia A100 GPUs providing 20% of overall performance, 19.87 PFlop/s (projected)

Data Analysis & Visualization

High Performance Storage

Casper: heterogeneous system for data analysis & viz.

- 75 High-Throughput Computing nodes

- 9 visualization nodes with accelerated graphics

- 10 dense GPU nodes for AI/ML, Code Development

- 4 nodes for Research Data processing

- 2 1.5TB large memory nodes

GLADE & Campaign Storage

- 132PB long-term, online storage

- 17,464 hard drives

- 56 servers

CISL develops specialized visualization software & services for Earth Science applications

Derecho ‘scratch’ Storage

- 60PB short-term storage

- Principally supports HPC jobs

- 5,088 hard drives

- 24 servers

Stratus Object Storage

- 5PB object storage system

- 588 hard drives

- 6 servers

https://geocat.ucar.edu

http://projectpythia.org

Quasar Tape Library

- 35PB long term archival storage

- 22 IBM TS1160 tape drives

- 1774 20TB tape cartridges

- 216 hard drives

- 2PB disk cache

- 5 data mover servers

NCAR HPC Users Group - 2023 HPC Resource Overview