Background and Motivation
Common practice is to randomly reshuffle examples each epoch of training
The CD-GraB Algorithm
Running on one server and two workers; workers do not share data examples
Parallel Herding and Balancing
Yes! By distributing GraB across parallel workers, with an order server that coordinates worker gradients to determine the next epoch’s permutation
Coordinated and Distributed GraB (CD-GraB),
Experiments
CD-GraB: Coordinating Distributed Example Orders for Provably Accelerated Training
A. Feder Cooper* Wentao Guo* Khiem Pham* Tiancheng Yuan Charlie Ruan Yucheng Lu Chris De Sa
Cornell University
Theoretical Guarantees
Two results: Assuming (1) bounded gradient variance, (2) bounded gradient heterogeneity, and (3) smoothness; and additionally (4) P.L.
We achieve a linear speedup in convergence rate in the number of workers m
Simulated ablations on LeNet and CIFAR-10
Assumptions | Our results (distributed setting) | Prior results (centralized setting) |
Smooth, non-convex (1, 2, & 3) | | |
+ Polyak-Łojasiewicz (P.L.) (1, 2, 3, & 4) | | |
The Gradient Balancing Algorithm (GraB) [1] shows there are provably better [1] (optimal [2]!) permutation orders that lead to greater scalability
GraB
Can we get the scalability of distributed training and
provably faster permutation example orders?
References
[1] Lu et al. (2022). GraB: Finding Provably Better Data Permutations than Random Reshuffling
[2] Cha et al. (2023). Tighter Lower Bounds for Shuffling SGD: Random Permutations and Beyond
[3] Welling (2009). Herding dynamical weights to learn.
[4] Harvey and Samadi (2014). Near-Optimal Herding.
[5] Dwivedi and Mackey (2021). Kernel thinning.
[6] Alweiss et al. (2021). Discrepancy minimization via a self-balancing walk.
# total examples | # workers | # examples per worker | # epochs |
N | m | n | T |
* m = 1
Convergence rate experiments
Logistic regression on mortgage data (HMDA 2017 NY subset)
LSTM on Wikitext-2
Autoregressive MLP on M4 Weekly