1 of 1

Scaling Millions of Quantum Chemical Calculations

Scientific Achievement

We present a high-performance, scalable, ensemble management framework for performing data-intensive quantum chemical electronic structure calculations for organic molecules. This framework provides abstractions to plug different first-principles-based semi-empirical methods and executes them efficiently at large scale on HPC systems.

Significance and Impact

The ensemble framework was used to generate four publicly available molecular datasets that contain first-principles calculations for up to 10 million organic molecules. These datasets are used to train neural networks on the Frontier and Perlmutter supercomputers for predicting molecular properties.

The figure shows the architecture of the workflow management framework for running large ensembles of molecular calculations that generate a large number of files. First-principle (FP) calculations are dynamically distributed among the CPU cores to efficiently utilize compute resources. The framework's ‘Data Plane’ transparently creates a separate staging area on the compute node for every FP calculation. All output files are automatically redirected to the faster staging area, and final files are copied to the slower, shared, parallel file system (PFS) upon completion. This greatly reduces the overhead on the PFS, thereby allowing the processing of multiple thousands of molecules concurrently. The framework required no modifications to the FP applications used in this work.

Technical Approach

The framework combines dynamic task distribution with efficient data management techniques to manage large data
Scientists can easily plug new methods and run the ensemble workflow at scale

PI(s)/Facility Lead(s): Scott Klasky, ORNL

Collaborating Institutions: Oak Ridge National Laboratory

ASCR Program: SciDAC RAPIDS2

ASCR PM: Kalyan Perumalla

Publication(s) for this work: [Accepted for publication] Kshitij Mehta et. al. “Scaling Ensembles of Data-Intensive Quantum Chemical Calculations for Millions of Molecules”, IPDPSW 2024

Datasets: [1] 10.13139/OLCF/1890227, [2] 10.13139/OLCF/1907919, [3] 10.13139/OLCF/2318314, [4] 10.13139/OLCF/2318313

LOCAL LAB POC:

Kshitij Mehta and Scott Klasky, ORNL

TALKING POINTS:

A software framework for scalable molecular ensemble calculations for large systems has been developed
The framework employs dynamic task distribution for efficient resource utilization and a data plane for efficiently managing the large overhead of files created during execution
The framework has been used to generate publicly available datasets that contain first principle quantum chemical calculations from 100,000 up to over 10 million molecules
These datasets are being used to train HydraGNN, a graph convolutional neural network developed at ORNL, for predicting molecular properties

METADATA:

Name of the associated awarded project: SciDAC RAPIDS2

PI name(s): Scott Klasky (ORNL) and Robert Ross (ANL)

Name of the program manager: Kalyan Perumalla

CITATIONS:

(Accepted for publication) Kshitij Mehta et. al. “Scaling Ensembles of Data-Intensive Quantum Chemical Calculations for Millions of Molecules”, IPDPSW 2024
Publicly available datasets: [1] 10.13139/OLCF/1890227, [2] 10.13139/OLCF/1907919, [3] 10.13139/OLCF/2318314, [4] 10.13139/OLCF/2318313