EF21 with Bells & Whistles:
Practical Algorithmic Extensions of Modern Error Feedback
1
Federated Learning One World Seminar (FLOW)
October 13, 2021
Ilyas Fatkhullin
Ilyas Fatkhullin, Igor Sokolov, Eduard Gorbunov, Zhize Li, Peter Richtárik
EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback
arXiv:2110.03294, 2021
Co-authors
2
Igor Sokolov
Eduard Gorbunov
Zhize Li
Peter Richtárik
KAUST
KAUST
KAUST
MIPT & Yandex
Training a Federated Learning Model
3
Training a Federated Learning Model
4
# devices
Training a Federated Learning Model
5
# devices
Loss on data stored on device
Training a Federated Learning Model
6
# devices
# model parameters / features
Loss on data stored on device
Training a Federated Learning Model
7
# devices
# model parameters / features
Loss on data stored on device
Heterogeneous data regime:
The datasets are allowed to be different
DGD: Distributed Gradient Descent
stepsize
DGD: Distributed Gradient Descent
9
stepsize
d-dimensional gradient
computed by device
DGD: Distributed Gradient Descent
10
d-dimensional gradient
computed by device
server
stepsize
3 devices ()
DGD: Distributed Gradient Descent
11
3 devices ()
server
stepsize
d-dimensional gradient
computed by device
not sparse!
DCGD: Distributed Compressed �Gradient Descent
12
DCGD: Distributed Compressed �Gradient Descent
13
Contractive compression operator
DCGD: Distributed Compressed �Gradient Descent
14
3 devices
server
Contractive compression operator
Popular examples:
DCGD: Distributed Compressed �Gradient Descent
15
3 machines
server
Contractive compression operator
DCGD Does Not Work in General!
Popular examples:
DCGD: Distributed Compressed �Gradient Descent
16
Contractive compression operator
Alexander Beznosikov, Samuel Horváth, Mher Safaryan, and Peter Richtárik
On biased compression for distributed learning
SpicyFL 2020: NeurIPS Workshop on Scalability, Privacy, and Security in Federated Learning, arXiv:2002.12410, 2020
Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, Martin Jaggi
Error Feedback Fixes SignSGD and other Gradient Compression Schemes, ICML 2019, arXiv:1901.09847, 2019
Counterexamples:
The key issue is with biased !
However, we still want to use biased .
A Popular Fix for DCGD is Error Feedback
17
Peter Richtárik, Igor Sokolov and Ilyas Fatkhullin
EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback, oral NeurIPS 2021, arXiv:2106.05203, 2021
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu
1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs, Interspeech, 2014
Sebastian U. Stich, Sai Praneeth Karimireddy
The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication, JMLR, 21(237):1-36, 2020, arXiv:1909.05350, 2019
original EF, first empirical study
Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, Martin Jaggi
Error Feedback Fixes SignSGD and other Gradient Compression Schemes, oral ICML 2019, arXiv:1901.09847, 2019
Prior works in non-convex regime
first analysis of EF
more works in between…
redesigned EF and improved analysis
18
EF and EF21 in the Single Node Setting
Peter Richtárik, Igor Sokolov and Ilyas Fatkhullin
EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback, oral NeurIPS 2021, arXiv:2106.05203, 2021
EF
EF21
19
EF and EF21 in the Single Node Setting
Peter Richtárik, Igor Sokolov and Ilyas Fatkhullin
EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback, oral NeurIPS 2021, arXiv:2106.05203, 2021
EF
EF21
If is deterministic, positively homogeneous and additive, then EF = EF21. Same holds for distributed variants.
or
20
Contributions: Extensions of EF21
(New) = was never analyzed before with error feedback/contractive compressors
(Better) = the analysis is superior to the best known analysis with original EF
(Better)
(New)
(New)
(Better)
(Better)
(New)
EF21
+
21
Extension 1
Stochastic approximation
22
EF21-SGD: EF21 with Stochastic Gradients
server
devices
23
EF21-SGD: EF21 with Stochastic Gradients
server
devices
Mini-batch of independent samples on device
24
EF21-SGD: EF21 with Stochastic Gradients
server
devices
Mini-batch of independent samples on device
Peter Richtárik, Igor Sokolov and Ilyas Fatkhullin
EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback, oral NeurIPS 2021, arXiv:2106.05203, 2021
25
EF21-SGD: EF21 with Stochastic Gradients
Ahmed Khaled and Peter Richtárik
Better theory for SGD in the nonconvex world, arXiv:2002.03329, 2020
26
EF21-SGD: EF21 with Stochastic Gradients
special case of bounded variance with
provably holds for popular samplings
Ahmed Khaled and Peter Richtárik
Better theory for SGD in the nonconvex world, arXiv:2002.03329, 2020
27
EF21-SGD: EF21 with Stochastic Gradients
EF21-SGD
28
EF21-SGD: EF21 with Stochastic Gradients
EF21-SGD
biased compressor
large batches
29
Extension 2
Variance reduction
EF21-PAGE: EF21 with Variance Reduction
# devices
EF21-PAGE: EF21 with Variance Reduction
# devices
# local
datapoints
EF21-PAGE: EF21 with Variance Reduction
# devices
Variance reduced methods: SAG, SVRG, SAGA, PAGE
33
EF21-PAGE: EF21 with Variance Reduction
server
devices
Zhize Li, Hongyan Bao, Xiangliang Zhang, Peter Richtárik
PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization, ICML, 2021, arXiv:2008.10898, 2020
PAGE method
34
EF21-PAGE: EF21 with Variance Reduction
server
devices
PAGE estimator on device
Zhize Li, Hongyan Bao, Xiangliang Zhang, Peter Richtárik
PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization, ICML, 2021, arXiv:2008.10898, 2020
PAGE method
35
EF21-PAGE: EF21 with Variance Reduction
server
devices
PAGE estimator on device
Zhize Li, Hongyan Bao, Xiangliang Zhang, Peter Richtárik
PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization, ICML, 2021, arXiv:2008.10898, 2020
optimal variance reduction for smooth, non-convex problems
PAGE method
36
EF21-PAGE: EF21 with Variance Reduction
server
devices
PAGE estimator on device
EF21 estimator on device
37
EF21-PAGE: EF21 with Variance Reduction
EF21-PAGE
Eduard Gorbunov, Konstantin Burlachenko, Zhize Li, Peter Richtárik
MARINA: Faster Non-Convex Distributed Learning with Compression,
ICML, 2021, arXiv:2102.07845, 2021
38
EF21-PAGE: EF21 with Variance Reduction
EF21-PAGE
# local
datapoints
Eduard Gorbunov, Konstantin Burlachenko, Zhize Li, Peter Richtárik
MARINA: Faster Non-Convex Distributed Learning with Compression,
ICML, 2021, arXiv:2102.07845, 2021
39
Extension 3
Partial participation
40
EF21-PP: EF21 with Partial Participation of Devices
server
devices
Zheng Qu and Peter Richtárik
Coordinate descent with arbitrary sampling ii: Expected separable overapproximation, arXiv:1412.8063, 2014
Arbitrary proper sampling
more about samplings
41
EF21-PP
EF21-PP: EF21 with Partial Participation of Devices
42
EF21-PP
EF21-PP: EF21 with Partial Participation of Devices
is probability of sampling a device in EF21-PP
43
Extension 4
Bidirectional compression
44
EF21-BC: EF21 with Bidirectional Compression
3 devices
server
d-dimensional parameter vector is sent to device
not sparse!
45
EF21-BC: EF21 with Bidirectional Compression
Hanlin Tang, Xiangru Lian, Chen Yu, Tong Zhang, and Ji Liu
DoubleSqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression, ICML 2020, arXiv:1905.05957, 2019
Bounded magnitude of error
(suboptimal)
DoubleSqueeze method
46
EF21-BC: EF21 with Bidirectional Compression
devices
M— compressor on Server/Master
— compressor on device/worker
server
server
&
devices
47
EF21-BC
EF21-BC: EF21 with Bidirectional Compression
EF21-BC
EF21-BC: EF21 with Bidirectional Compression
49
EF21-BC
EF21-BC: EF21 with Bidirectional Compression
DoubleSqueeze
improve
EF21-BC
50
Extension 5
Momentum
51
EF21-HB: EF21 with Heavy Ball Momentum
Boris T. Polyak
Some methods of speeding up the convergence of iteration methods,
USSR Computational Mathematics and Mathematical Physics 4(5):1-17, 1964
Diederik P Kingma and Jimmy Ba
Adam: A method for stochastic optimization,
arXiv:1412.6980, 2014
heavy ball momentum
momentum parameter
52
EF21-HB: EF21 with Heavy Ball Momentum
server
devices
momentum parameter
53
EF21-HB
EF21-HB: EF21 with Heavy Ball Momentum
54
EF21-HB
EF21-HB: EF21 with Heavy Ball Momentum
momentum parameter
55
Extension 6
Proximal setting
56
Proximal
update/subproblem:
Evaluation metric—
gradient mapping:
EF21-Prox: EF21 with Proximal Step
57
EF21-Prox: EF21 with Proximal Step
server
devices
58
EF21-Prox
EF21-Prox: EF21 with Proximal Step
Complexity: General Nonconvex Case
59
— compression parameter on devices,
— compression parameter on server
,
—variance of stoch. gradient
,
60
Experiments
Problem and Data
61
non-convex regularization
62
Experiment 1: Fast Convergence with Variance Reduction
(a) Convergence in epochs.
(b) Convergence in terms of total number of bits sent from Clients to the Server divided by .
63
Experiment 1: Fast Convergence with Variance Reduction
64
Experiment 2: Effect of Partial Participation of Devices
(a) Convergence in communication rounds.
(b) Convergence in terms of total number of bits sent from Clients to the Server divided by .
65
Experiment 2: Effect of Partial Participation of Devices
66
Deep Learning Experiments
Figure 3: Comparison of EF-SGD and EF21-SGD with EF-SGD-HB, EF21-SGD-HB, and EF21+-SGD-HB with tuned stepsizes applied to train ResNet 18 on CIFAR10.
67
The END