Adaptive Optimization of Model Training for Multiple GPUs
in collaboration with ECP ExaLearn
Scientific Achievements
Developed an adaptive stochastic gradient descent (SGD) that dynamically schedules the work among multiple GPUs to minimize imbalance and reduce model replica staleness Significance and Impacts
Through careful handling of heterogeneity, our adaptive SGD outperforms state of the art methods for extreme-scale multi-label classification problems in extensive tests Research details
Dynamically allocate batch sizes based on recently observed performance Adaptively schedule batches to guarantee GPU utilization while minimizing the lags among different local models Normalize the local models in the model merging process to best utilize the contributions from all GPUs (Right) In several different tests, adaptive SGD took shorter time to achieve the same accuracy than other state-of-art methods. The figure below shows the two examples with a large set of Amazon test data.
TALKING POINTS:
We introduce Adaptive SGD - an adaptive elastic model averaging stochastic gradient descent algorithm for heterogeneous multi-GPUs - characterized by dynamic scheduling, adaptive batch size scaling, and normalized model merging:
Instead of statically assigning batches to GPUs, in dynamic scheduling batches are allocated based on the relative GPU processing speed. This process is controlled by fixing the number of training samples processed between two model merging stages. Since execution-driven allocation can lead to a different number of model updates across GPUs- which is problematic for accuracy - batch size scaling assigns larger batches to the faster GPUs and smaller batches to the slower ones, with the goal to arrive at a steady state in which all the GPUs perform the same number of model updates. The batch sizes are continuously updated following a linear function that quantifies the deviation from the expected number of updates, while guaranteeing a minimum degree of GPU utilization and imposing strict bounds on model replica staleness. Normalized model merging computes optimal weights for every GPU based on the assigned batches. The underlying principle is to prioritize the replicas updated more frequently and - secondarily - with gradients derived from larger batch sizes. In order to increase the importance of the relevant model replicas, perturbation is added to the normalized weights when the replicas are well-regularized. We implement Adaptive SGD in a new framework for sparse deep learning on multiple GPUs and compare its performance against four alternative methods. Due to the careful handling of heterogeneity - which allows a more thorough exploration of the optimization space - Adaptive SGD outperforms all the competitors in time-to-accuracy. In fact, Adaptive SGD always achieves the highest accuracy among all the algorithms. Moreover, as we increase the number of GPUs, Adaptive SGD exhibits both faster time-to-accuracy and less epochs to achieve an accuracy target. This confirms its superior scalability.
METADATA:
PI name(s): R Ross (ANL), L Oliker (LBNL) SciDAC RAPIDS2
Name of the program manager: Lali Chatterjee
CITATIONS:
Y. Ma, F. Rusu, K. Wu and A. Sim, "Adaptive Optimization for Sparse Data on Heterogeneous GPUs," 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) , Lyon, France, 2022, pp. 1088-1097, doi: 10.1109/IPDPSW55747.2022.00177.
AWARDS:
None
REPRODUCIBILITY:
No
BACKGROUND:
Motivated by extreme multi-label classification applications, we consider training deep learning models over sparse data in multi-GPU servers. The variance in the number of non-zero features across training batches and the intrinsic GPU heterogeneity combine to limit accuracy and increase the time to convergence. We address these challenges with Adaptive SGD, an adaptive elastic model averaging stochastic gradient descent algorithm for heterogeneous multi-GPUs that is characterized by dynamic scheduling, adaptive batch size scaling, and normalized model merging. Instead of statically partitioning batches to GPUs, batches are routed based on the relative processing speed. Batch size scaling assigns larger batches to the faster GPUs and smaller batches to the slower ones, with the goal to arrive at a steady state in which all the GPUs perform the same number of model updates. Normalized model merging computes optimal weights for every GPU based on the assigned batches such that the combined model achieves better accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy and is scalable with the number of GPUs.