1 | Session Date | Session Time | Session Title | Presentation Time | Live | Submission ID | Submission Title | Contributors | Presenters | Abstract/Description |
---|---|---|---|---|---|---|---|---|---|---|
2 | 2021-05-03 | 08:15-08:45 | Acceptance and Testing | 08:15-08:30 | Yes | pap105 | Acceptance Testing the Chicoma HPE Cray EX Supercomputer | Kody Everson, Paul Ferrell, Jennifer Green, Francine Lapid, Daniel Magee, Jordan Ogas, Calvin Seamons, Nicholas Sly | Jennifer Green | Since the installation of MANIAC I in 1952, Los Alamos National Laboratory (LANL) has been at the forefront of addressing global crises using state-of-the-art computational resources to accelerate scientific innovation and discovery. This generation faces a new crisis in the global COVID-19 pandemic that continues to damage economies, health, and wellbeing; LANL is supplying high-performance computing (HPC) resources to contribute to the recovery from the impacts of this virus. Every system that LANL's HPC Division installs requires an understanding of the intended workloads for the system, the specifications and expectations of performance and reliability for supporting the science, and a testing plan to ensure that the final installation has met those requirements. Chicoma, named for a mountain peak close to Los Alamos, NM, USA, is a new HPC system at LANL purchased for supporting the Department of Energy's Office of Science Advanced Scientific Computing Research (ASCR) program. It is intended to serve as a platform to supply molecular dynamics simulation computing cycles for epidemiological modeling, bioinformatics, and chromosome/RNA simulations as part of the 2020 Coronavirus Aid, Relief, and Economic Security (CARES) Act. Chicoma is among the earliest installations of the HPE Cray EX supercomputer running the HPE-Cray Programming Environment (PE) for Shasta Architecture. This paper documents the Chicoma acceptance test-suite, configurations and tuning of the tests under Pavilion, presents the results of acceptance testing, and concludes with a discussion of the outcomes of the acceptance testing effort on Chicoma and future work. |
3 | 2021-05-03 | 08:15-08:45 | Acceptance and Testing | 08:30-08:45 | Yes | pap111 | A Step Towards the Final Frontier: Lessons Learned from Acceptance Testing of the First HPE/Cray EX 3000 System at ORNL | Veronica G. Vergara Larrea, Reuben Budiardja, Paul Peltz, Jeffery Niles, Christopher Zimmer, Daniel Dietz, Christopher Fuson, Hong Liu, Paul Newman, James Simmons | Veronica G. Vergara Larrea, Reuben Budiardja | In this paper, we summarize the deployment of the Air Force Weather (AFW) HPC11 system at Oak Ridge National Laboratory (ORNL) including the process followed to successfully complete acceptance testing of the system. HPC11 is the first HPE/Cray EX 3000 system that has been successfully deployed to production in a federal facility. HPC11 consists of two identical 800-node supercomputers, Fawbush and Miller, with access to two independent and identical Lustre parallel file systems. HPC11 is equipped with Slingshot 10 interconnect technology and relies on the HPE Performance Cluster Manager (HPCM) software for system configuration. ORNL has a clearly defined acceptance testing process used to ensure that every new system deployed can provide the necessary capabilities to support user workloads. We worked closely with HPE and AFW to develop a set of tests for the United Kingdom’s Meteorological Office’s Unified Model (UM) and 4DVAR. We also included benchmarks and applications from the Oak Ridge Leadership Computing Facility (OLCF) portfolio to fully exercise the HPE/Cray programming environment and evaluate the functionality and performance of the system. Acceptance testing of HPC11 required parallel execution of each element on Fawbush and Miller. In addition, careful coordination was needed to ensure successful acceptance of the newly deployed Lustre file systems alongside the compute resources. In this work, we present test results from specific system components and provide an overview of the issues identified, challenges encountered, and the lessons learned along the way. |
4 | 2021-05-03 | 09:00-09:30 | Storage and I/O 1 | 09:00-09:15 | Yes | pres109 | New data path solutions from HPE for HPC simulation, AI, and high performance workloads | Lance Evans, Marc Roskow | Lance Evans, Marc Roskow | HPE is extending its HPC storage portfolio to include an IBM Spectrum Scale based solution. The HPE solution will leverage HPE servers and the robustness of Spectrum scale to address the increasing demand for “enterprise” HPC systems. IBM Spectrum Scale is an enterprise-grade parallel file system that provides superior resiliency, scalability and control. IBM Spectrum Scale delivers scalable capacity and performance to handle demanding data analytics, content repositories and technical computing workloads. Storage administrators can combine flash, disk, cloud, and tape storage into a unified system with higher performance and lower cost than traditional approaches. Leveraging HPE Proliant servers, we will deliver a range of storage (NSD), protocol, and data mover servers with a granularity that addresses small AI systems to large HPC scratch spaces with exceptional cost and flexibility. HPE is working with an emerging HPC data path software entrant known as DAOS. Designed from the ground up through a collaboration between Argonne National Labs and Intel, DAOS reduces I/O friction arising from increasingly complex and competing HPC IO workloads, and exposes the full performance potential of next-generation fabrics, persistent memory, and flash media. DAOS improves both latency and concurrency via an efficient scale-out software-defined storage architecture, without relying on centralized metadata or locking services. Applications can interact with DAOS either through its native object interface or application-specific middleware keeping core functionality lean. HPE’s initial solution bundle combines DAOS with Proliant servers, Infiniband or Slingshot networking, and HPE system management software to deploy DAOS for maximum productivity. |
5 | 2021-05-03 | 09:00-09:30 | Storage and I/O 1 | 09:15-09:30 | Yes | pres103 | Lustre and Spectrum Scale: Simplify parallel file system workflows with HPE Data Management Framework | Mark Wiertalla | Lustre and Spectrum Scale: Simplify parallel file system workflows with HPE Data Management Framework The expanded use of Lustre and Spectrum Scale across supercomputing environments is creating a new set of data management challenges at scale – indexing, finding, moving and protecting data. Additionally, the tool kits have evolved independently around each of them – marginally supported opensource tools for Lustre, as well as licensed products & services from IBM for Spectrum Scale. Cray users who have been using RobinHood, HPSS, and other tools to supplement their chosen PFS – and sometimes both – will want to attend this session to learn how HPE Data Management Framework can solve these old and new problems using a single, scalable software stack. In this session, co-hosted by a hands-on development director and a seasoned solution architect, we will explore the revolutionary architecture behind HPE Data Management Framework that enables Exascale data management solutions for high-performance computing. | |
6 | 2021-05-03 | 09:45-10:15 | Storage and I/O 2 | 09:45-10:00 | Yes | pap103 | h5bench: HDF5 I/O Kernel Suite for Exercising HPC I/O Patterns | Tonglin Li, Houjun Tang, Qiao Kang, John Ravi, Quincey Koziol, Suren Byna | Suren Byna | Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputing systems. With massive amounts of data being produced or consumed by compute nodes, efficient parallel I/O is essential. I/O benchmarks play an important role in this process, however, there is a scarcity of good I/O benchmarks that are representative of current workloads on HPC systems. Towards creating representative I/O kernels from real world applications, we have created h5bench, a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. These include I/O operations (read, write, metadata), data locality (contiguous or strided in memory or in storage), array dimensionality (1D arrays, 2D meshes, 3D cubes), I/O modes (synchronous and asynchronous), and processor type (CPUs and GPUs). In this paper, we present the observed performance of h5bench executed along all these dimensions on two Cray systems: Cori at NERSC using both the DataWarp burst buffer and a Lustre file system, and Theta at Argonne Leadership Computing Facility (ALCF) using a Lustre file system. These performance measurements help to find performance bottlenecks, identify root causes of any poor performance, and optimize I/O performance. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful not only to the CUG community but also to the broader supercomputing community. |
7 | 2021-05-03 | 09:45-10:15 | Storage and I/O 2 | 10:00-10:15 | Yes | pap120 | Architecture and Performance of Perlmutter's 35 PB ClusterStor E1000 All-Flash File System | Glenn K. Lockwood, Nicholas J. Wright | Glenn K. Lockwood | NERSC's newest system, Perlmutter, features a 35 PB all-NVMe Lustre file system build on HPE Cray ClusterStor E1000. In this paper, we will present the architecture of the Perlmutter file system starting with its node-level design that balances SSD, PCIe, and Slingshot performance, then discussing the high-level network integration. We also demonstrate early Lustre performance measurements on ClusterStor E1000 for both traditional dimensions of I/O performance (peak bulk-synchronous bandwidth and metadata rates) and non-optimal workloads endemic to production HPC (low-concurrency, misaligned, and incoherent I/O). These results are compared to the performance of the disk-based Lustre and NVMe burst buffer of NERSC's previous-generation system, Cori, to illustrate where all-NVMe provides unique new capabilities for parallel I/O. |
8 | 2021-05-04 | 08:05-08:52 | System Analytics and Monitoring | 08:05-08:20 | Yes | pap107 | Integrating System State and Application Performance Monitoring: Network Contention Impact | Jim Brandt, Tom Tucker, Simon Hammond, Ben Schwaller, Ann Gentile, Kevin Stroup, Jeanine Cook | Jim Brandt | Discovering and attributing application performance variation in production HPC systems requires continuous concurrent information on the state of the system and applications, and of applications’ progress. Even with such information, there is a continued lack of understanding of how time-varying system conditions relate to a quantifiable impact on application performance. We have developed a unified framework to obtain and integrate, at run time, both system and application information to enable insight into application performance in the context of system conditions. The Lightweight Distributed Metric Service (LDMS) is used on several significant large-scale Cray platforms for the collection of system data and is planned for inclusion on several upcoming HPE systems. We have developed a new capability to inject application progress information into the LDMS data stream. The consistent handling of both system and application data eases the development of storage, performance analytics, and dashboards. We illustrate the utility of our framework by providing runtime insight into application performance in conjunction with network congestion assessments on a Cray XC40 system with a beta Programming Environment being used to prepare for the upcoming ACES Crossroads system. We describe possibilities for application to the Slingshot network. The complete system is generic and can be applied to any *-nix system; the system data can be obtained by both generic and system-specific data collection plugins (e.g., Aries vs Slingshot counters); and no application changes are required when the injection is performed by a portability abstraction layer, such as that employed by kokkos. |
9 | 2021-05-04 | 08:05-08:45 | System Analytics and Monitoring | 08:20-08:35 | Yes | pap115 | trellis — An Analytics Framework for Understanding Slingshot Performance | Madhu Srinivasan, Dipanwita Mallick, Kristyn Maschhoff | Madhu Srinivasan, Dipanwita Mallick, Kristyn Maschhoff | The next generation HPE Cray EX and HPE Apollo supercomputers with Slingshot interconnect are breaking new ground in the collection and analysis of system performance data. The monitoring frameworks on these systems provide visibility into Slingshot's operational characteristics through advanced instrumentation and transparency into real-time network performance. There still exists, however, a wide gap between the volume of telemetry generated by Slingshot and a user's ability to assimilate and explore this data to derive critical, timely, and actionable insights about fabric health, application performance, and potential congestion scenarios. In this work, we present trellis --- an analytical framework built on top of Slingshot monitoring APIs. The goal of trellis is to provide system-administrators and researchers insight into network performance, and its impact on complex workflows that include both AI and traditional simulation workloads. We also present a visualization interface, built on trellis, that allows users to interactively explore through various levels of the network topology over specified time windows, and gain key insights into job performance and communication patterns. We demonstrate these capabilities on an internal Shasta development system and visualize Slingshot's innovative congestion-control and adaptive-routing in action. |
10 | 2021-05-04 | 08:05-08:45 | System Analytics and Monitoring | 08:35-08:50 | Yes | pap121 | AIOps: Leveraging AI/ML for Anomaly Detection in System Management | Sergey Serebryakov, Jeff Hanson, Tahir Cader, Deepak Nanjundaiah, Joshi Subrahmanya | Sergey Serebryakov | HPC datacenters rely on set-points and dashboards for system management, which leads to thousands of false alarms. Exascale systems will deploy thousands of servers and sensors, produce millions of data points per second, and be more prone to management errors and equipment failures. HPE and the National Renewable Energy Lab (NREL) are using AI/ML to improve data center resiliency and energy efficiency. HPE has developed and deployed in NREL’s production environment (since June 2020), an end-to-end anomaly detection pipeline that operates in real-time, automatically, and at massive scale. In the paper, we will provide detailed results from several end-to-end anomaly detection workflows either already deployed at NREL, or to be deployed soon. We will describe the upcoming AIOps release as a technology preview with HPCM 1.5, plans for future deployment with Cray System Manager, and potential use as an Edge processor (inferencing engine) for HPE’s InfoSight analytics platform. |
11 | 2021-05-04 | 08:05-08:45 | System Analytics and Monitoring | pre-recorded | No | pap119 | Real-time Slingshot Monitoring in HPCM | Priya K, Prasanth Kurian, Jyothsna Deshpande | Priya K, Prasanth Kurian, Jyothsna Deshpande | HPE Performance Cluster Manager (HPCM) software is used to provision, monitor, and manage HPC cluster hardware and software components. HPCM has a centralized monitoring infrastructure for persistent storage of telemetry and alerting on these metrics based on thresholds. Slingshot fabric management and monitoring is the new feature in HPCM Monitoring infrastructure. Slingshot Telemetry (SST) monitoring framework in HPCM is used for collecting and storing Slingshot fabric health and performance telemetry. Real time telemetry information gathered by SST is used for fabric health monitoring, real-time analytics, visualization, and alerting. This solution is capable of both vertical and horizontal scalability, handling huge volumes of telemetry data. Flexible and extensible model of the SST collection agent makes it easy to collect metrics at different granularities and intervals. Visualization dashboards are designed to suit different use cases, giving a complete view of fabric health. |
12 | 2021-05-04 | 08:05-08:45 | System Analytics and Monitoring | pre-recorded | No | pap116 | Analytic Models to Improve Quality of Service of HPC Jobs | Saba Naureen, Prasanth Kurian, Amarnath Chilumukuru | Saba Naureen, Prasanth Kurian, Amarnath Chilumukuru | A typical High Performance Computing (HPC) cluster comprises of components such as CPU, Memory, GPU, Ethernet, Fabric, storage, racks, cooling devices and switches. A cluster usually consists of 1000’s of compute nodes interconnected using an Ethernet network for management tasks and Fabric network for data traffic. Job scheduler need to be aware of the health & availability of the cluster components in order to deliver high performance results. Since the failure of any component will adversely impact the overall performance of a job, identifying the issues or outages is very critical for ensuring the desired Quality of Service (QoS) is achieved. We showcase an analytics based model implemented as part of HPE Performance Cluster Manager (HPCM) that gathers and analyzes the telemetry data pertaining to the various cluster components like the racks, enclosures, cluster nodes, storage devices, Fabric switches, Cooling Distribution Unit (CDU), ARC (Adaptive Rack Cooling), Chassis Management Controller (CMC), fabric, power supplies & system logs. This real-time status information based on the telemetry data is utilized by the Job Schedulers to perform scheduling tasks effectively. This enables schedulers to take smart decisions and ensure that it schedules jobs only on healthy nodes, thus preventing job failures and wastage of computational resources. Our solution enables HPC job schedulers to be health-aware resulting in improving the reliability of the clusters and improve overall customer experience. |
13 | 2021-05-04 | 09:05-09:50 | Systems Support | 09:05-09:15 | Yes | pap102 | Blue Waters System and Component Reliability | Brett Bode, David King, Celso Mendes, William Kramer, Saurabh Jha, Roger Ford, Justin Davis, Mark Dalton, Steven Dramstad | Brett Bode | The Blue Waters system, installed in 2012 at NCSA, has the largest component count of any system Cray has built. Blue Waters includes a mix of dual-socket CPU (XE) and single-socket CPU, single GPU (XK) nodes. The primary storage is provided by Cray’s Sonexion/ClusterStor Lustre storage system delivering 35PB (raw) storage at 1TB/sec. The statistical failure rates over time for each component including CPU, DIMM, GPU, disk drive, power supply, blower, etc and their impact on higher level failure rates for individual nodes and the systems as a whole are presented in detail, with a particular emphasis on identifying any increases in rate that might indicate the right-side of the expected bathtub curve has been reached. Strategies employed by NCSA and Cray for minimizing the impact of component failure, such as the preemptive removal of suspect disk drives, are also presented. |
14 | 2021-05-04 | 09:05-09:50 | Systems Support | 09:15-09:30 | Yes | pap110 | Configuring and Managing Multiple Shasta Systems: Best Practices Developed During the Perlmutter Deployment | James Botts, Zachary Crisler, Aditi Gaur, Douglas Jacobsen, Harold Longley, Alex Lovell-Troy, Dave Poulsen, Eric Roman, Chris Samuel | Douglas Jacobsen | The perlmutter supercomputer and related test systems provide an early look at Shasta system management and our ideas on best practices for managing Shasta systems. The cloud-native software and ethernet-based networking on the system enable tremendous flexibility in management policies and methods. Based on work performed using Shasta 1.3 and previewed 1.4 releases, NERSC has developed, in close collaboration with HPE through the perlmutter System Software COE, methodologies for efficiently managing multiple Shasta systems. We describe how we template and synchronize configurations and software between systems and orchestrate manipulations of the configuration of the managed system. Key to this is a secured external management system that provides both a configuration origin for the system and an interactive management space. Leveraging this external management system we simultaneously create a systems-development environment as well as secure key aspects of the Shasta system, enabling NERSC to rapidly deploy the perlmutter system. |
15 | 2021-05-04 | 09:05-09:50 | Systems Support | 09:30-09:45 | Yes | pap109 | Slurm on Shasta at NERSC: adapting to a new way of life | Christopher Samuel, Douglas M Jacobsen, Aditi Gaur | Christopher Samuel | Shasta, with its heady mix of kubernetes, containers, software defined networking and 1970s batch computing provides a vast array of new concepts, strategies and acronyms for traditional HPC administrators to adapt to. NERSC has been working through this maze to take advantage of the new capabilities that Shasta brings in order to provide a stable and expected interface for traditional HPC workloads on Perlmutter whilst also taking advantage of Shasta and new abilities in Slurm to provide more modern interfaces and capabilities for production use. This paper discusses the decisions that have been made regarding the deployment of Slurm on Perlmutter at NERSC, how we are faring and what is still in development as well as how this is all tied up with Shasta’s own development over the months. |
16 | 2021-05-04 | 09:05-09:50 | Systems Support | pre-recorded | No | pap106 | Declarative automation of compute node lifecycle through Shasta API integration | J. Lowell Wofford | J. Lowell Wofford | Using the Cray Shasta system available at Los Alamos National Laboratory, we have experimented with integrating with various components of the HPE Cray Shasta software stack through the provided APIs. We have integrated with a LANL open-source software project, Kraken, which provides distributed state-based automation to provide new automation and management features to the Shasta system. We have focused on managing Shasta compute node lifecycle with Kraken, providing automation to node operations such as node image, kernel and configuration management. We examine the strengths and challenges of integrating with the Shasta APIs and discuss possibilities for further API integrations. |
17 | 2021-05-04 | 09:05-09:50 | Systems Support | pre-recorded | No | pres107 | Cray EX Shasta v1.4 System Management Overview | Harold Longley | Harold Longley | How do you manage a Cray EX (Shasta) system? This overview describes the Cray System Management software in the Shasta v1.4 release. This release has introduced new features such as booting management nodes from images, product streams, and configuration layers. The foundation of containerized microservices orchestrated by Kubernetes on the management nodes provides a highly available and resilient set of services to manage the compute and application nodes. Lower level hardware control is based on the DMTF Redfish standard enabling higher level hardware management services to control and monitor components and manage firmware updates. The network management services enable control of the high speed network fabric. The booting process relies upon preparation of images and configuration as well as run-time interaction between the nodes and services while nodes boot and configure. All microservices have published RESTful APIs for those who want to integrate management functions into their existing DevOps environment. The v1.4 software includes the cray CLI and the SAT (System Administration Toolkit) CLI which are clients that use these services. Identity and access management protect critical resources, such as the API gateway. Non-administrative users access the system either through a multi-user Linux node (User Access Node) or a single-user container (User Access Instance) managed by Kubernetes. Logging and telemetry data can be sent from the system to other site infrastructure. The tools for collection, monitoring, and analysis of telemetry and log data have been improved with new alerts and notifications. |
18 | 2021-05-04 | 09:05-09:50 | Systems Support | pre-recorded | No | pres111 | Managing User Access with UAN and UAI | Harold Longley, Alex Lovell-Troy, Gregory Baker | Harold Longley | User Access Nodes (UANs) and User Access Instances (UAIs) represent the primary entry point for users on a Cray EX system to develop, build, and execute their applications on the Cray EX compute nodes. The UAN is a traditional, multi-user Linux node. The UAI is a dynamically provisioned, single user container which can be customized to the user’s needs. This presentation will describe the state of the Shasta v1.4 software for user access with UAN and UAI, provisioning software products for users, providing access to shared filesystems, granting and revoking authentication and authorization, logging of access, and monitoring of resource utilization. |
19 | 2021-05-04 | 09:05-09:50 | Systems Support | pre-recorded | No | pap112 | User and Administrative Access Options for CSM-Based Shasta Systems | Alex Lovell-Troy, Sean Lynn, Harold Longley | Alex Lovell-Troy, Harold Longley | Cray System Management (CSM) from HPE is a cloud-like control system for High Performance Computing. CSM is designed to integrate the Supercomputer with multiple datacenter networks and provide secure administrative access via authenticated REST APIs. Access to the compute nodes and to the REST APIs may need to follow different network paths which has network routing implications. This paper outlines the flexible network configurations and guides administrators planning their Shasta/CSM systems. Site Administrators have configuration options for allowing users and administrators to access the REST APIs from outside. They also have options for allowing applications running on the compute nodes to access these same APIs. This paper is structured around three themes. The first theme defines a layer2/layer3 perimeter around the system and addresses upstream connections to the site network. The second theme deals primarily with layer 3 subnet routing from the network perimeter inward. The third theme deals with administrative access control at various levels of the network as well as user-based access controls to the APIs themselves. Finally, this paper will combine the themes to describe specific use cases and how to support them with available administrative controls. |
20 | 2021-05-04 | 09:05-09:50 | Systems Support | pre-recorded | No | pres105 | HPE Ezmeral Container Platform: Current And Future | Thomas Phelan | Thomas Phelan | The HPE Ezmeral Container Platform is the industry's first enterprise-grade container platform for both cloud-native and distributed non cloud-native applications using the open-source Kubernetes container orchestrator. Ezmeral enables true hybrid cloud operations across any location: on-premises, public cloud, and edge. Today, the HPE Ezmeral Container Platform is largely used for enterprise AI/ML/DL applications. However, the industry is starting to see a convergence of AI/ML/DL and High Performance Computer (HPC) workloads. This session will present an overview of the HPE Ezmeral Container Platform - its architecture, features, and usecases. It will also provide a look into the future product roadmap where the platform will support HPC workloads as well. |
21 | 2021-05-05 | 08:00-08:45 | Applications and Performance (ARM) | 08:00-08:15 | Yes | pap122 | An Evaluation of the A64FX Architecture for HPC Applications | Andrei Poenaru, Tom Deakin, Simon McIntosh-Smith, Si Hammond, Andrew Younge | Andrei Poenaru | In this paper, we present some of the first in-depth, rigorous, independent benchmark results for the A64FX, the processor at the heart of Fugaku, the current #1 supercomputer in the world, and now available in Apollo 80 guise. The Isambard and Astra research teams have combined to perform this study, using a combination of mini-apps and application benchmarks to evaluate A64FX's performance for both compute- and bandwidth-bound scenarios. The study uniquely had access to all four major compilers for A64FX: Cray, Arm, GNU and Fujitsu. The results showed that the A64FX is extremely competitive, matching or exceeding contemporary dual-socket x86 servers. We also report tuning and optimisation techniques which proved essential for achieving good performance on this new architecture. |
22 | 2021-05-05 | 08:00-08:45 | Applications and Performance (ARM) | 08:15-08:30 | Yes | pap117 | Vectorising and distributing NTTs to count Goldbach partitions on Arm-based supercomputers | Ricardo Jesus, Tomás Oliveira e Silva, Michèle Weiland | Ricardo Jesus | In this paper we explore the usage of SVE to vectorise number-theoretic transforms (NTTs). In particular, we show that 64-bit modular arithmetic operations, including modular multiplication, can now be efficiently implemented with SVE instructions. The vectorisation of NTT loops and similar code structures involving 64-bit modular operations was not possible in previous Arm-based SIMD architectures, since these architectures lacked crucial instructions to efficiently implement modular multiplication. We test and evaluate our SVE implementation on an A64FX processor in an HPE Apollo 80 system. Furthermore, we implement a distributed NTT for the computation of large-scale exact integer convolutions. We evaluate this transform on HPE Apollo 70 and Cray XC50 systems, where we demonstrate good scalability to thousands of cores. Finally, we describe how these methods can be utilised to count the number of Goldbach partitions of the even numbers to large limits. We present some preliminary results concerning this last problem, in particular the curve of the even numbers up to 2^40 whose number of partitions is larger than the number of partitions of all previous integers. |
23 | 2021-05-05 | 08:00-08:45 | Applications and Performance (ARM) | 08:30-08:45 | Yes | pap118 | Optimizing a 3D multi-physics continuum mechanics code for the HPE Apollo 80 System | Vince Graziano, David Nystrom, Howard Pritchard, Brandon Smith, Brian Gravelle | Vince Graziano | We present results of a performance evaluation of a LANL 3D multi-physics continuum mechanics code - Pagosa - on an HPE Apollo 80 system. The Apollo 80 features the Fujitsu A64FX ARM processor with Scalable Vector Extension (SVE) support and high bandwidth memory. This combination of SIMD vector units and high memory bandwidth offers the promise of realizing a significant fraction of the theoretical peak performance for applications like Pagosa. In this paper we present performance results of the code using the GNU, ARM, and CCE compilers, analyze these compilers’ ability to vectorize performance critical loops when targeting the SVE instruction set, and describe code modifications to improve the performance of the application on the A64FX processor. |
24 | 2021-05-05 | 09:00-09:45 | Applications and Performance | 09:00-09:15 | Yes | pap114 | Optimizing the Cray Graph Engine for Performant Analytics on Cluster, SuperDome Flex, Shasta Systems and Cloud Deployment | Christopher Rickett, Kristyn Maschhoff, Sreenivas Sukumar | Christopher Rickett, Kristyn Maschhoff | We present updates to the Cray Graph Engine, a high performance in-memory semantic graph database, which enable performant execution across multiple architectures as well as deployment in a container to support cloud and as-a-service graph analytics. This paper discusses the changes required to port and optimize CGE to target multiple architectures, including Cray Shasta systems, large shared-memory machines such as SuperDome Flex (SDF), and cluster environments such as Apollo systems. The porting effort focused primarily on removing dependences on XPMEM and Cray PGAS and replacing these with a simplified PGAS library based upon POSIX shared memory and one-side MPI, while preserving the existing Coarray-C++ CGE code base. We also discuss the containerization of CGE using Singularity and the techniques required to enable container performance matching native execution. We present early benchmarking results for running CGE on the SDF, Infiniband clusters and Slingshot interconnect-based Shasta systems. |
25 | 2021-05-05 | 09:00-09:45 | Applications and Performance | 09:15-09:30 | Yes | pap104 | Real-Time XFEL Data Analysis at SLAC and NERSC: a Trial Run of Nascent Exascale Experimental Data Analysis | Johannes P. Blaschke, Aaron S. Brewster, Daniel Paley, Derek Mendez, Nicholas K. Sauter, Deborah Bard | Johannes P. Blaschke | X-Ray scattering experiments using Free Electron Lasers (XFEL) are a powerful tool to determine the molecular structure, and function of unknown samples (such as COVID-19 viral proteins). XFEL experiments are a challenge to computing in two ways: i) due to the high cost of running XFELs, a fast turnaround time from data acquisition to data analysis is essential to make informed decisions on experimental protocols; ii) data collection rates are growing exponentially, requiring new scalable algorithms. Here we report our experiences from the two experiments at LCLS during September 2020. Raw data was analyzed on NERSC’s Cori system, using the super-facility paradigm: our workflow automatically moves raw data between LCLS and NERSC, where it is analyzed (using CCTBX). We achieved real time data analysis with a 20 min turnaround time from data acquisition to full molecular reconstruction -- sufficient time for the experiment’s operators to make informed decisions between shots. |
26 | 2021-05-05 | 09:00-09:45 | Applications and Performance | 09:30-09:45 | Yes | pap108 | Early Experiences Evaluating the HPE/Cray Ecosystem for AMD GPUs | Veronica G. Vergara Larrea, Reuben Budiardja, Wayne Joubert | Veronica G. Vergara Larrea, Reuben Budiardja, Wayne Joubert | Since deploying the Titan supercomputer in 2012, the Oak Ridge Leadership Computing Facility (OLCF) has continued to support and promote GPU-accelerated computing among its user community. Summit, the flagship system at the OLCF --- currently number 2 in the most recent TOP500 list --- has a theoretical peak performance of approximately 200 petaflops. Because the majority of Summit’s computational power comes from its 27,648 GPUs, users must port their applications to one of the supported programming models in order to make efficient use of the system. Looking ahead to Frontier, the OLCF’s exascale supercomputer, users will need to adapt to an entirely new ecosystem which will include new hardware and software technologies. First, users will need to familiarize themselves with the AMD Radeon GPU architecture. Furthermore, users who have been previously relying on CUDA will need to transition to the Heterogeneous-Computing Interface for Portability (HIP) or one of the other supported programming models (e.g., OpenMP, OpenACC). In this work, we describe our initial experiences in porting three applications or proxy apps currently running on Summit to the HPE/Cray ecosystem to leverage the compute power from AMD GPUs: minisweep, GenASiS, and Sparkler. Each one is representative of current production workloads utilized at the OLCF, different programming languages, and different programming models. We also share lessons learned from challenges encountered during the porting process and provide preliminary results from our evaluation of the HPE/Cray Programming Environment and the AMD software stack using these key OLCF applications. |
27 | 2021-05-06 | 07:00-09:00 | PEAD | pre-recorded | No | pap113 | Update of Cray Programming Environment | John Levesque | John Levesque | Over the past year, the Cray Programming Environment (CPE) engineers have been hard at work on numerous projects to make the compiler and tools easier to use and to interact well with the new GPU systems. This talk will cover those facets of development and will give a futures perspective to where CPE is going. We recognize that CPE is the only programming environment that gives applications developers a portable development interface across all the popular nodes and GPU options. The one major complaint is that CPE is a strict standard forcing compiler which makes it incompatible with Intel and GNU which allow non-standard extensions. This complaint is being addressed. We are also modifying the software to be usable with newer software components like containers and SPACK. Additionally, CPE will be supported on HPE systems beyond the traditional Cray systems. Finally there are numerous new products being developed for Coral 2 systems, which will be beneficial to the entire HPE community. |
28 | 2021-05-05 | 09:00-09:45 | Applications and Performance | pre-recorded | No | pres101 | Convergence of AI and HPC at HLRS. Our Roadmap. | Denns Hoppe | Denns Hoppe | The growth of artificial intelligence (AI) is accelerating. AI has left research and innovation labs, and nowadays plays a significant role in everyday lives. The impact on society is graspable: autonomous driving cars produced by Tesla, voice assistants such as Siri, and AI systems that beat renowned champions in board games like Go. All these advancements are facilitated by powerful computing infrastructures based on HPC and advanced AI-specific hardware, as well as highly optimized AI codes. Since several years, HLRS is engaged in big data and AI-specific activities around HPC. The road towards AI at HLRS began several years ago with installing a Cray Urika-GX for processing large volumes of data. But due to the isolated platform and, for HPC users, different usage concept, uptake of this system was lower than expected. This drastically changed recently with the installation of a CS-Storm equipped with powerful GPUs. Since then, we are also extending our HPC system with GPUs due to a high customer demand. We foresee that the duality of using AI and HPC on different systems will soon be overcome, and hybrid AI/HPC workflows will be eventually possible. In this talk, I will give a brief overview about our research project CATALYST to engage with researchers and SMEs, as well as present exciting case studies from some of our customers that leverage AI. This will be put in the overall AI strategy of HLRS including lessons learned throughout the years on different Cray/HPE systems such as the Urika-GX. |
29 | 2021-05-05 | 09:00-09:45 | Applications and Performance | pre-recorded | No | pres102 | Porting Codes to LUMI | Georgios Markomanolis | Georgios Markomanolis | LUMI is a new upcoming EuroHPC pre-exascale supercomputer with a peak performance of a bit over 550 petaflop/s by HPE Cray. Many countries of LUMI consortium will have access to this system among other users. It is known that this system will be based on the next generation of AMD Instinct GPUs and this is a new environment for all of us. In this presentation, we discuss the AMD ecosystem, we present with examples the procedure to convert CUDA codes to HIP, among also how to port Fortran codes with hipfort. We discuss the utilization of other HIP libraries and we demonstrate a performance comparison between CUDA and HIP. We explore the challenges that scientists will have to handle during their application porting and also we provide step-by-step guidance. Finally, we will discuss the potential of other programming models and the workflow that we follow to port codes depending on their readiness for GPUs and the used programming language. |
30 | 2021-05-05 | 07:45-07:50 | Sponsor Talk: NVIDIA | 07:45-07:50 | Yes | The Unreasonable Effectiveness of Fortran Standard Parallelism in NWChem | Jeff Hammond | Jeff Hammond | NWChem is a widely used quantum chemistry code that supports essentially all computing platforms used by scientists, from PCs to the biggest supercomputers. Distributed memory parallelism in NWChem is addressed by Global Arrays, which usually sits on top of MPI. While NWChem has adopted CUDA and OpenMP for GPUs in some modules, it lacks holistic support for GPUs, and suffers from the inconsistent quality of OpenMP 5 compilers. This talk will describe our positive experience with Fortran standard parallelism for GPUs, as found in the NVIDIA HPC compilers. | |
31 | 2021-05-04 | 10:00-10:15 | Sponsor Talk: Intel | 10:00-10:15 | Yes | Performance Made Flexible | Bob Burroughs | Bob Burroughs | Performance made flexible is a core attribute enabled by 3rd Gen Intel® Xeon® Scalable Processors. Todays discussion will focus on unique value that included technologies unlock for HPC customers. |