1 of 67

EF21 with Bells & Whistles:

Practical Algorithmic Extensions of Modern Error Feedback

1

Federated Learning One World Seminar (FLOW)

October 13, 2021

Ilyas Fatkhullin

Ilyas FatkhullinIgor SokolovEduard GorbunovZhize LiPeter Richtárik

EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback

arXiv:2110.03294, 2021

2 of 67

Co-authors

2

Igor Sokolov

Eduard Gorbunov

 Zhize Li

Peter Richtárik

KAUST

KAUST

KAUST

MIPT & Yandex

3 of 67

Training a Federated Learning Model

3

4 of 67

Training a Federated Learning Model

4

# devices

5 of 67

Training a Federated Learning Model

5

# devices

Loss on data stored on device

6 of 67

Training a Federated Learning Model

6

# devices

# model parameters / features

Loss on data stored on device

7 of 67

Training a Federated Learning Model

7

# devices

# model parameters / features

Loss on data stored on device

Heterogeneous data regime:

The datasets are allowed to be different

8 of 67

DGD: Distributed Gradient Descent

stepsize

9 of 67

DGD: Distributed Gradient Descent

9

stepsize

d-dimensional gradient

computed by device

10 of 67

DGD: Distributed Gradient Descent

10

d-dimensional gradient

computed by device

server

stepsize

3 devices ()

11 of 67

DGD: Distributed Gradient Descent

11

3 devices ()

server

stepsize

d-dimensional gradient

computed by device

not sparse!

12 of 67

DCGD: Distributed Compressed �Gradient Descent

12

13 of 67

DCGD: Distributed Compressed �Gradient Descent

13

Contractive compression operator

14 of 67

DCGD: Distributed Compressed �Gradient Descent

14

3 devices

server

Contractive compression operator

Popular examples:

  • Top-
  • Rand-
  • Sign

15 of 67

DCGD: Distributed Compressed �Gradient Descent

15

3 machines

server

Contractive compression operator

DCGD Does Not Work in General!

Popular examples:

  • Top-
  • Rand-
  • Sign

16 of 67

DCGD: Distributed Compressed �Gradient Descent

16

Contractive compression operator

Alexander Beznosikov, Samuel Horváth, Mher Safaryan, and Peter Richtárik

On biased compression for distributed learning

SpicyFL 2020: NeurIPS Workshop on Scalability, Privacy, and Security in Federated Learning, arXiv:2002.12410, 2020

Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, Martin Jaggi

Error Feedback Fixes SignSGD and other Gradient Compression Schemes, ICML 2019, arXiv:1901.09847, 2019

Counterexamples:

The key issue is with biased !

However, we still want to use biased .

17 of 67

A Popular Fix for DCGD is Error Feedback

17

Peter Richtárik, Igor Sokolov and Ilyas Fatkhullin

EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback, oral NeurIPS 2021, arXiv:2106.05203, 2021

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu

1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs, Interspeech, 2014

Sebastian U. StichSai Praneeth Karimireddy

The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication, JMLR, 21(237):1-36, 2020, arXiv:1909.05350, 2019

original EF, first empirical study

Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, Martin Jaggi

Error Feedback Fixes SignSGD and other Gradient Compression Schemes, oral ICML 2019, arXiv:1901.09847, 2019

Prior works in non-convex regime

first analysis of EF

  • strong assumptions
  • weak rates

more works in between…

redesigned EF and improved analysis

18 of 67

18

EF and EF21 in the Single Node Setting

Peter Richtárik, Igor Sokolov and Ilyas Fatkhullin

EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback, oral NeurIPS 2021, arXiv:2106.05203, 2021

EF

EF21

19 of 67

19

EF and EF21 in the Single Node Setting

Peter Richtárik, Igor Sokolov and Ilyas Fatkhullin

EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback, oral NeurIPS 2021, arXiv:2106.05203, 2021

EF

EF21

If is deterministic, positively homogeneous and additive, then EF = EF21. Same holds for distributed variants.

or

20 of 67

20

Contributions: Extensions of EF21

(New) = was never analyzed before with error feedback/contractive compressors

(Better) = the analysis is superior to the best known analysis with original EF

  1. stochastic approximation
  2. variance reduction
  3. partial participation
  4. bidirectional compression
  5. momentum
  6. proximal setting

(Better)

(New)

(New)

(Better)

(Better)

(New)

EF21

+

21 of 67

21

Extension 1

Stochastic approximation

22 of 67

22

EF21-SGD: EF21 with Stochastic Gradients

server

devices

23 of 67

23

EF21-SGD: EF21 with Stochastic Gradients

server

devices

Mini-batch of independent samples on device

24 of 67

24

EF21-SGD: EF21 with Stochastic Gradients

server

devices

Mini-batch of independent samples on device

Peter Richtárik, Igor Sokolov and Ilyas Fatkhullin

EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback, oral NeurIPS 2021, arXiv:2106.05203, 2021

25 of 67

25

EF21-SGD: EF21 with Stochastic Gradients

Ahmed Khaled and Peter Richtárik

Better theory for SGD in the nonconvex world, arXiv:2002.03329, 2020

26 of 67

26

EF21-SGD: EF21 with Stochastic Gradients

special case of bounded variance with

provably holds for popular samplings

Ahmed Khaled and Peter Richtárik

Better theory for SGD in the nonconvex world, arXiv:2002.03329, 2020

27 of 67

27

EF21-SGD: EF21 with Stochastic Gradients

EF21-SGD

28 of 67

28

EF21-SGD: EF21 with Stochastic Gradients

EF21-SGD

biased compressor

large batches

29 of 67

29

Extension 2

Variance reduction

30 of 67

EF21-PAGE: EF21 with Variance Reduction

# devices

31 of 67

EF21-PAGE: EF21 with Variance Reduction

# devices

32 of 67

# local

datapoints

EF21-PAGE: EF21 with Variance Reduction

# devices

Variance reduced methods: SAG, SVRG, SAGA, PAGE

33 of 67

33

EF21-PAGE: EF21 with Variance Reduction

server

devices

Zhize Li, Hongyan Bao, Xiangliang Zhang, Peter Richtárik

PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization, ICML, 2021, arXiv:2008.10898, 2020

PAGE method

34 of 67

34

EF21-PAGE: EF21 with Variance Reduction

server

devices

PAGE estimator on device

Zhize Li, Hongyan Bao, Xiangliang Zhang, Peter Richtárik

PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization, ICML, 2021, arXiv:2008.10898, 2020

PAGE method

35 of 67

35

EF21-PAGE: EF21 with Variance Reduction

server

devices

PAGE estimator on device

Zhize Li, Hongyan Bao, Xiangliang Zhang, Peter Richtárik

PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization, ICML, 2021, arXiv:2008.10898, 2020

optimal variance reduction for smooth, non-convex problems

PAGE method

36 of 67

36

EF21-PAGE: EF21 with Variance Reduction

server

devices

PAGE estimator on device

EF21 estimator on device

37 of 67

37

EF21-PAGE: EF21 with Variance Reduction

EF21-PAGE

Eduard GorbunovKonstantin BurlachenkoZhize LiPeter Richtárik

MARINA: Faster Non-Convex Distributed Learning with Compression,

ICML, 2021, arXiv:2102.07845, 2021

38 of 67

38

EF21-PAGE: EF21 with Variance Reduction

EF21-PAGE

# local

datapoints

Eduard GorbunovKonstantin BurlachenkoZhize LiPeter Richtárik

MARINA: Faster Non-Convex Distributed Learning with Compression,

ICML, 2021, arXiv:2102.07845, 2021

39 of 67

39

Extension 3

Partial participation

40 of 67

40

EF21-PP: EF21 with Partial Participation of Devices

server

devices

Zheng Qu and Peter Richtárik

Coordinate descent with arbitrary sampling ii: Expected separable overapproximation, arXiv:1412.8063, 2014

Arbitrary proper sampling

more about samplings

41 of 67

41

EF21-PP

EF21-PP: EF21 with Partial Participation of Devices

42 of 67

42

EF21-PP

EF21-PP: EF21 with Partial Participation of Devices

is probability of sampling a device in EF21-PP

43 of 67

43

Extension 4

Bidirectional compression

44 of 67

44

EF21-BC: EF21 with Bidirectional Compression

3 devices

server

d-dimensional parameter vector is sent to device

not sparse!

45 of 67

45

EF21-BC: EF21 with Bidirectional Compression

Hanlin Tang, Xiangru Lian, Chen Yu, Tong Zhang, and Ji Liu

DoubleSqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression, ICML 2020, arXiv:1905.05957, 2019

Bounded magnitude of error

(suboptimal)

DoubleSqueeze method

46 of 67

46

EF21-BC: EF21 with Bidirectional Compression

devices

M— compressor on Server/Master

— compressor on device/worker

server

server

&

devices

47 of 67

47

EF21-BC

EF21-BC: EF21 with Bidirectional Compression

48 of 67

EF21-BC

EF21-BC: EF21 with Bidirectional Compression

49 of 67

49

EF21-BC

EF21-BC: EF21 with Bidirectional Compression

DoubleSqueeze

improve

EF21-BC

50 of 67

50

Extension 5

Momentum

51 of 67

51

EF21-HB: EF21 with Heavy Ball Momentum

Boris T. Polyak

Some methods of speeding up the convergence of iteration methods,

USSR Computational Mathematics and Mathematical Physics 4(5):1-17, 1964

Diederik P Kingma and Jimmy Ba

Adam: A method for stochastic optimization,

arXiv:1412.6980, 2014

heavy ball momentum

momentum parameter

52 of 67

52

EF21-HB: EF21 with Heavy Ball Momentum

server

devices

momentum parameter

53 of 67

53

EF21-HB

EF21-HB: EF21 with Heavy Ball Momentum

54 of 67

54

EF21-HB

EF21-HB: EF21 with Heavy Ball Momentum

momentum parameter

55 of 67

55

Extension 6

Proximal setting

56 of 67

56

Proximal

update/subproblem:

Evaluation metric—

gradient mapping:

EF21-Prox: EF21 with Proximal Step

57 of 67

57

EF21-Prox: EF21 with Proximal Step

server

devices

58 of 67

58

EF21-Prox

EF21-Prox: EF21 with Proximal Step

59 of 67

Complexity: General Nonconvex Case

59

— compression parameter on devices,

— compression parameter on server

,

—variance of stoch. gradient

,

60 of 67

60

Experiments

61 of 67

Problem and Data

61

non-convex regularization

62 of 67

62

Experiment 1: Fast Convergence with Variance Reduction

(a) Convergence in epochs.

(b) Convergence in terms of total number of bits sent from Clients to the Server divided by .

63 of 67

63

Experiment 1: Fast Convergence with Variance Reduction

64 of 67

64

Experiment 2: Effect of Partial Participation of Devices

(a) Convergence in communication rounds.

(b) Convergence in terms of total number of bits sent from Clients to the Server divided by .

65 of 67

65

Experiment 2: Effect of Partial Participation of Devices

66 of 67

66

Deep Learning Experiments

Figure 3: Comparison of EF-SGD and EF21-SGD with EF-SGD-HB, EF21-SGD-HB, and EF21+-SGD-HB with tuned stepsizes applied to train ResNet 18 on CIFAR10.

67 of 67

67

The END