1 of 67

EF21 with Bells & Whistles:

Practical Algorithmic Extensions of Modern Error Feedback

Federated Learning One World Seminar (FLOW)

October 13, 2021

Ilyas Fatkhullin

Ilyas Fatkhullin, Igor Sokolov, Eduard Gorbunov, Zhize Li, Peter Richtárik

EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback

arXiv:2110.03294, 2021

2 of 67

Co-authors

Igor Sokolov

Eduard Gorbunov

Zhize Li

Peter Richtárik

KAUST

MIPT & Yandex

3 of 67

Training a Federated Learning Model

4 of 67

Training a Federated Learning Model

# devices

5 of 67

Training a Federated Learning Model

# devices

Loss on data stored on device

6 of 67

Training a Federated Learning Model

# devices

# model parameters / features

Loss on data stored on device

7 of 67

Training a Federated Learning Model

# devices

# model parameters / features

Loss on data stored on device

Heterogeneous data regime:

The datasets are allowed to be different

8 of 67

DGD: Distributed Gradient Descent

stepsize

9 of 67

DGD: Distributed Gradient Descent

stepsize

d-dimensional gradient

computed by device

10 of 67

DGD: Distributed Gradient Descent

d-dimensional gradient

computed by device

server

stepsize

3 devices ()

11 of 67

DGD: Distributed Gradient Descent

3 devices ()

server

stepsize

d-dimensional gradient

computed by device

not sparse!

12 of 67

DCGD: Distributed Compressed �Gradient Descent

13 of 67

DCGD: Distributed Compressed �Gradient Descent

Contractive compression operator

14 of 67

DCGD: Distributed Compressed �Gradient Descent

3 devices

server

Contractive compression operator

Popular examples:

Top-
Rand-
Sign

15 of 67

DCGD: Distributed Compressed �Gradient Descent

3 machines

server

Contractive compression operator

DCGD Does Not Work in General!

Popular examples:

Top-
Rand-
Sign

16 of 67

DCGD: Distributed Compressed �Gradient Descent

Contractive compression operator

Alexander Beznosikov, Samuel Horváth, Mher Safaryan, and Peter Richtárik

On biased compression for distributed learning

SpicyFL 2020: NeurIPS Workshop on Scalability, Privacy, and Security in Federated Learning, arXiv:2002.12410, 2020

Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, Martin Jaggi

Error Feedback Fixes SignSGD and other Gradient Compression Schemes, ICML 2019, arXiv:1901.09847, 2019

Counterexamples:

The key issue is with biased !

However, we still want to use biased .

17 of 67

A Popular Fix for DCGD is Error Feedback

Peter Richtárik, Igor Sokolov and Ilyas Fatkhullin

EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback, oral NeurIPS 2021, arXiv:2106.05203, 2021

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu

1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs, Interspeech, 2014

Sebastian U. Stich, Sai Praneeth Karimireddy

The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication, JMLR, 21(237):1-36, 2020, arXiv:1909.05350, 2019

original EF, first empirical study

Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, Martin Jaggi

Error Feedback Fixes SignSGD and other Gradient Compression Schemes, oral ICML 2019, arXiv:1901.09847, 2019

Prior works in non-convex regime

first analysis of EF

strong assumptions
weak rates

more works in between…

redesigned EF and improved analysis

18 of 67

EF and EF21 in the Single Node Setting

Peter Richtárik, Igor Sokolov and Ilyas Fatkhullin

EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback, oral NeurIPS 2021, arXiv:2106.05203, 2021

EF21

19 of 67

EF and EF21 in the Single Node Setting

Peter Richtárik, Igor Sokolov and Ilyas Fatkhullin

EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback, oral NeurIPS 2021, arXiv:2106.05203, 2021

EF21

If is deterministic, positively homogeneous and additive, then EF = EF21. Same holds for distributed variants.

20 of 67

Contributions: Extensions of EF21

(New) = was never analyzed before with error feedback/contractive compressors

(Better) = the analysis is superior to the best known analysis with original EF

stochastic approximation
variance reduction
partial participation
bidirectional compression
momentum
proximal setting

(Better)

(New)

(Better)

(New)

EF21

21 of 67

Extension 1

Stochastic approximation

22 of 67

EF21-SGD: EF21 with Stochastic Gradients

server

devices

23 of 67

EF21-SGD: EF21 with Stochastic Gradients

server

devices

Mini-batch of independent samples on device

24 of 67

EF21-SGD: EF21 with Stochastic Gradients

server

devices

Mini-batch of independent samples on device

Peter Richtárik, Igor Sokolov and Ilyas Fatkhullin

EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback, oral NeurIPS 2021, arXiv:2106.05203, 2021

25 of 67

EF21-SGD: EF21 with Stochastic Gradients

Ahmed Khaled and Peter Richtárik

Better theory for SGD in the nonconvex world, arXiv:2002.03329, 2020

26 of 67

EF21-SGD: EF21 with Stochastic Gradients

special case of bounded variance with

provably holds for popular samplings

Ahmed Khaled and Peter Richtárik

Better theory for SGD in the nonconvex world, arXiv:2002.03329, 2020

27 of 67

EF21-SGD: EF21 with Stochastic Gradients

EF21-SGD

28 of 67

EF21-SGD: EF21 with Stochastic Gradients

EF21-SGD

biased compressor

large batches

29 of 67

Extension 2

Variance reduction

30 of 67

EF21-PAGE: EF21 with Variance Reduction

# devices

31 of 67

EF21-PAGE: EF21 with Variance Reduction

# devices

32 of 67

# local

datapoints

EF21-PAGE: EF21 with Variance Reduction

# devices

Variance reduced methods: SAG, SVRG, SAGA, PAGE

33 of 67

EF21-PAGE: EF21 with Variance Reduction

server

devices

Zhize Li, Hongyan Bao, Xiangliang Zhang, Peter Richtárik

PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization, ICML, 2021, arXiv:2008.10898, 2020

PAGE method

34 of 67

EF21-PAGE: EF21 with Variance Reduction

server

devices

PAGE estimator on device

Zhize Li, Hongyan Bao, Xiangliang Zhang, Peter Richtárik

PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization, ICML, 2021, arXiv:2008.10898, 2020

PAGE method

35 of 67

EF21-PAGE: EF21 with Variance Reduction

server

devices

PAGE estimator on device

Zhize Li, Hongyan Bao, Xiangliang Zhang, Peter Richtárik

PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization, ICML, 2021, arXiv:2008.10898, 2020

optimal variance reduction for smooth, non-convex problems

PAGE method

36 of 67

EF21-PAGE: EF21 with Variance Reduction

server

devices

PAGE estimator on device

EF21 estimator on device

37 of 67

EF21-PAGE: EF21 with Variance Reduction

EF21-PAGE

Eduard Gorbunov, Konstantin Burlachenko, Zhize Li, Peter Richtárik

MARINA: Faster Non-Convex Distributed Learning with Compression,

ICML, 2021, arXiv:2102.07845, 2021

38 of 67

EF21-PAGE: EF21 with Variance Reduction

EF21-PAGE

# local

datapoints

Eduard Gorbunov, Konstantin Burlachenko, Zhize Li, Peter Richtárik

MARINA: Faster Non-Convex Distributed Learning with Compression,

ICML, 2021, arXiv:2102.07845, 2021

39 of 67

Extension 3

Partial participation

40 of 67

EF21-PP: EF21 with Partial Participation of Devices

server

devices

Zheng Qu and Peter Richtárik

Coordinate descent with arbitrary sampling ii: Expected separable overapproximation, arXiv:1412.8063, 2014

Arbitrary proper sampling

more about samplings

41 of 67

EF21-PP

EF21-PP: EF21 with Partial Participation of Devices

42 of 67

EF21-PP

EF21-PP: EF21 with Partial Participation of Devices

is probability of sampling a device in EF21-PP

43 of 67

Extension 4

Bidirectional compression

44 of 67

EF21-BC: EF21 with Bidirectional Compression

3 devices

server

d-dimensional parameter vector is sent to device

not sparse!

45 of 67

EF21-BC: EF21 with Bidirectional Compression

Hanlin Tang, Xiangru Lian, Chen Yu, Tong Zhang, and Ji Liu

DoubleSqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression, ICML 2020, arXiv:1905.05957, 2019

Bounded magnitude of error

(suboptimal)

DoubleSqueeze method

46 of 67

EF21-BC: EF21 with Bidirectional Compression

devices

M— compressor on Server/Master

— compressor on device/worker

server

devices

47 of 67

EF21-BC

EF21-BC: EF21 with Bidirectional Compression

48 of 67

EF21-BC

EF21-BC: EF21 with Bidirectional Compression

49 of 67

EF21-BC

EF21-BC: EF21 with Bidirectional Compression

DoubleSqueeze

improve

EF21-BC

50 of 67

Extension 5

Momentum

51 of 67

EF21-HB: EF21 with Heavy Ball Momentum

Boris T. Polyak

Some methods of speeding up the convergence of iteration methods,

USSR Computational Mathematics and Mathematical Physics 4(5):1-17, 1964

Diederik P Kingma and Jimmy Ba

Adam: A method for stochastic optimization,

arXiv:1412.6980, 2014

heavy ball momentum

momentum parameter

52 of 67

EF21-HB: EF21 with Heavy Ball Momentum

server

devices

momentum parameter

53 of 67

EF21-HB

EF21-HB: EF21 with Heavy Ball Momentum

54 of 67

EF21-HB

EF21-HB: EF21 with Heavy Ball Momentum

momentum parameter

55 of 67

Extension 6

Proximal setting

56 of 67

Proximal

update/subproblem:

Evaluation metric—

gradient mapping:

EF21-Prox: EF21 with Proximal Step

57 of 67

EF21-Prox: EF21 with Proximal Step

server

devices

58 of 67

EF21-Prox

EF21-Prox: EF21 with Proximal Step

59 of 67

Complexity: General Nonconvex Case

— compression parameter on devices,

— compression parameter on server

—variance of stoch. gradient

60 of 67

Experiments

61 of 67

Problem and Data

non-convex regularization

62 of 67

Experiment 1: Fast Convergence with Variance Reduction

(a) Convergence in epochs.

(b) Convergence in terms of total number of bits sent from Clients to the Server divided by .

63 of 67

Experiment 1: Fast Convergence with Variance Reduction

64 of 67

Experiment 2: Effect of Partial Participation of Devices

(a) Convergence in communication rounds.

(b) Convergence in terms of total number of bits sent from Clients to the Server divided by .

1 of 67

2 of 67

3 of 67

4 of 67

5 of 67

6 of 67

7 of 67

8 of 67

9 of 67

10 of 67

11 of 67

12 of 67

13 of 67

14 of 67

15 of 67

16 of 67

17 of 67

18 of 67

19 of 67

20 of 67

21 of 67

22 of 67

23 of 67

24 of 67

25 of 67

26 of 67

27 of 67

28 of 67

29 of 67

30 of 67

31 of 67

32 of 67

33 of 67

34 of 67

35 of 67

36 of 67

37 of 67

38 of 67

39 of 67

40 of 67

41 of 67

42 of 67

43 of 67

44 of 67

45 of 67

46 of 67

47 of 67

48 of 67

49 of 67

50 of 67

51 of 67

52 of 67

53 of 67

54 of 67

55 of 67

56 of 67

57 of 67

58 of 67

59 of 67

60 of 67

61 of 67

62 of 67

63 of 67

64 of 67

65 of 67

66 of 67

67 of 67