1 of 1

MDLoader: A Hybrid Model-driven Data Loader for�Distributed Deep Neural Networks Training

Scientific Achievement

Significance and Impact

Technical Approach

The MDLoader proposes a hybrid in-memory data loader for distributed deep neural networks that improves training performance at scale. We developed a framework that utilizes multiple communication methods orchestrated by an accurate performance estimator.

Characterization the data loading performance with various communication patterns for varying mini-batch sizes, dataset characteristics, and run scale for GNN-based models.
Development of multi-backend shuffle models based on one-sided communication and collective communication mechanisms
Development of a performance estimator to automatically select the best communication backend.

Choosing the best data loading method is influenced by numerous factors, including the mini-batch size and the number of processes. MDLoader supports a performance portability multi-backend implementation that adapts to various dataset training requirements. We demonstrated its effectiveness in accelerating Graph Neural Network (GNN)-based material science AI workloads, such as HydaGNN and Open Catalyst.

PI(s)/Facility Lead(s): Khaled Ibrahim (LBNL), Lenny Oliker (LBNL)

Collaborating Institutions: ORNL

ASCR Program: SciDAC

ASCR PM: Hal Finkel�Publication(s) for this work: J. Bae et al., “MDLoader: A Hybrid Model-driven Data Loader for Distributed Deep Neural Networks Training”, IPDPSW, May 2024.

These graphics show the advantage of various communication backends for improving the training performance of HydraGNN and OpenCatalyst GNN-based models on NERSC’s Perlmutter supercomputer. End-to-end training performance is improved by up to 2.8x compared to single backend loaders.

LOCAL LAB POC:

Khaled Ibrahim, kzibrahim@lbl.gov

TALKING POINTS:

In-memory distributed storage that keeps the dataset in the local memory of each computing node is widely adopted over file-based I/O for its rapid speed.
Loading GNN datasets is notoriously challenging due to the dataset composition of small, irregular graphs.
Depending on the scale of the run, dataset characteristics, and training hyperparameters, one communication backend can be advantageous over the other.
Selecting one-sided communication or collective communication for optimal performance depends on a variety of factors, including the volume of data, the scale at which training is performed, and the network bandwidth.
MDLoader is a hybrid in-memory data loader for distributed deep neural networks. MDLoader introduces a model-driven performance estimator that automatically switches between one-sided and collective communication at runtime.
MDLoader generally aims to improve performance portability for various dataset training by predicting more suitable communication methods for AI models.
The middle Figure shows the model-based multi-backend communication developed in this work. The Figure on the right shows the performance advantage of using each of the backends for runs up to 2048 GPUs with various training parameters on Perlmutter.
The developed performance model automates the selection of the best communication backend for training under various run conditions.

METADATA:

Name of the associated awarded project: RAPIDS

PI name(s): Rob Ross�Name of the program manager: Hal Finkel�

CITATIONS:

J. Bae et al., “MDLoader: A Hybrid Model-driven Data Loader for Distributed Deep Neural Networks Training”, Proceedings of the 38th International Parallel and Distributed Processing System Workshop (IPDPSW), San Francisco, CA, May 2024. (To appear)

AWARDS:

REPRODUCIBILITY:

BACKGROUND:

Scalable data management techniques are crucial to effectively process large volumes of scientific data on HPC platforms for distributed deep learning (DL) model training. Because of the need to access data randomly and frequently in stochastic optimizers, in-memory distributed storage that keeps the dataset in the local memory of each computing node is widely adopted over file-based I/O for its rapid speed.

�Processes can then use either one-sided communication or collective communication to fetch data from remote processes. However, selecting one-sided communication or collective communication for optimal performance depends on a variety of factors, including (i) the volume of data, (ii) scale at which training is performed, and (iii) the network bandwidth. Empirical analysis shows that collective communication can achieve higher performance for larger mini-batch sizes and/or fewer processes, whereas one-sided communication performs better at larger scales.

�In this project, we propose a hybrid in-memory data loader for distributed deep neural networks called MDLoader. MDLoader introduces a model-driven performance estimator that automatically switches between one-sided and collective communication at runtime. The novelty of MDLoader is not only the performance modeling but also the insight regarding multi-backend communication methods for optimal data loading performance in contrast with the mono-backend communication method.�