MDLoader: A Hybrid Model-driven Data Loader for�Distributed Deep Neural Networks Training
Scientific Achievement
Significance and Impact
Technical Approach
The MDLoader proposes a hybrid in-memory data loader for distributed deep neural networks that improves training performance at scale. We developed a framework that utilizes multiple communication methods orchestrated by an accurate performance estimator.
Characterization the data loading performance with various communication patterns for varying mini-batch sizes, dataset characteristics, and run scale for GNN-based models.
Development of multi-backend shuffle models based on one-sided communication and collective communication mechanisms
Development of a performance estimator to automatically select the best communication backend.
Choosing the best data loading method is influenced by numerous factors, including the mini-batch size and the number of processes. MDLoader supports a performance portability multi-backend implementation that adapts to various dataset training requirements. We demonstrated its effectiveness in accelerating Graph Neural Network (GNN)-based material science AI workloads, such as HydaGNN and Open Catalyst.
PI(s)/Facility Lead(s): Khaled Ibrahim (LBNL), Lenny Oliker (LBNL)
Collaborating Institutions: ORNL
ASCR Program: SciDAC
ASCR PM: Hal Finkel�Publication(s) for this work: J. Bae et al., “MDLoader: A Hybrid Model-driven Data Loader for Distributed Deep Neural Networks Training”, IPDPSW, May 2024.
These graphics show the advantage of various communication backends for improving the training performance of HydraGNN and OpenCatalyst GNN-based models on NERSC’s Perlmutter supercomputer. End-to-end training performance is improvedby up to 2.8x compared to single backend loaders.