1 of 1

Adaptive Optimization of Model Training for Multiple GPUs

in collaboration with ECP ExaLearn

Ma, Rusu, Wu, and Sim. IPDPSW, 2022, pp. 1088—1097.

DOI: 10.1109/IPDPSW55747.2022.00177

Scientific Achievements

  • Developed an adaptive stochastic gradient descent (SGD) that dynamically schedules the work among multiple GPUs to minimize imbalance and reduce model replica staleness

Significance and Impacts

  • Through careful handling of heterogeneity, our adaptive SGD outperforms state of the art methods for extreme-scale multi-label classification problems in extensive tests

Research details

  • Dynamically allocate batch sizes based on recently observed performance
  • Adaptively schedule batches to guarantee GPU utilization while minimizing the lags among different local models
  • Normalize the local models in the model merging process to best utilize the contributions from all GPUs

(Right) In several different tests, adaptive SGD took shorter time to achieve the same accuracy than other state-of-art methods. The figure below shows the two examples with a large set of Amazon test data.