1 of 1

Background and Motivation

Common practice is to randomly reshuffle examples each epoch of training

The CD-GraB Algorithm

Running on one server and two workers; workers do not share data examples

Parallel Herding and Balancing

Yes! By distributing GraB across parallel workers, with an order server that coordinates worker gradients to determine the next epoch’s permutation

Coordinated and Distributed GraB (CD-GraB),

  1. formulates a parallel herding objective

  1. leverages insights from kernel thinning [5] to solve the objective with an online pair balancing algorithm [6] (doesn’t use a stale mean)

  1. lifts GraB [1] to distributed settings

Experiments

CD-GraB: Coordinating Distributed Example Orders for Provably Accelerated Training

A. Feder Cooper* Wentao Guo* Khiem Pham* Tiancheng Yuan Charlie Ruan Yucheng Lu Chris De Sa

Cornell University

Theoretical Guarantees

Two results: Assuming (1) bounded gradient variance, (2) bounded gradient heterogeneity, and (3) smoothness; and additionally (4) P.L.

We achieve a linear speedup in convergence rate in the number of workers m

Simulated ablations on LeNet and CIFAR-10

Assumptions

Our results

(distributed setting)

Prior results

(centralized setting)

Smooth, non-convex (1, 2, & 3)

+ Polyak-Łojasiewicz (P.L.) (1, 2, 3, & 4)

The Gradient Balancing Algorithm (GraB) [1] shows there are provably better [1] (optimal [2]!) permutation orders that lead to greater scalability

GraB

  1. leverages stale stochastic gradients from previous epochs to guide ordering in the next epoch

  1. connects permutation-based example ordering to vector herding [3] and balancing [4] to prove its improved convergence rate

  1. only applies to centralized settings

Can we get the scalability of distributed training and

provably faster permutation example orders?

References

[1] Lu et al. (2022). GraB: Finding Provably Better Data Permutations than Random Reshuffling

[2] Cha et al. (2023). Tighter Lower Bounds for Shuffling SGD: Random Permutations and Beyond

[3] Welling (2009). Herding dynamical weights to learn.

[4] Harvey and Samadi (2014). Near-Optimal Herding.

[5] Dwivedi and Mackey (2021). Kernel thinning.

[6] Alweiss et al. (2021). Discrepancy minimization via a self-balancing walk.

# total examples

# workers

# examples per worker

# epochs

N

m

n

T

* m = 1

Convergence rate experiments

Logistic regression on mortgage data (HMDA 2017 NY subset)

LSTM on Wikitext-2

Autoregressive MLP on M4 Weekly