1 of 60

Byzantine Robust Collaborative Learning

Sai Praneeth Karimireddy

Lie He

Martin Jaggi

2 of 60

Byzantine robust learning

2

3 of 60

Collaborative learning

3

  • Multiple workers with their own data

  • Work together to learn a model over combined data

  • Better privacy, security, and scalability

4 of 60

Federated learning

4

  • Server sends model x to everyone

5 of 60

Federated learning

5

  • Server sends model x to everyone

  • Each worker i computes stochastic gradient at x and sends to server

6 of 60

Federated learning

6

  • Server sends model x to everyone

  • Each worker i computes stochastic gradient at x and sends to server

  • Server accumulates gradients and computes new parameters

7 of 60

Byzantine robust learning

7

  • But workers might have hardware failures

8 of 60

Byzantine robust learning

8

  • Or they may even be malicious attackers

9 of 60

Byzantine robust learning

9

We protect against worst-case:

  • Small (𝛿) fraction of workers may send arbitrary updates

  • They can collude and are omniscient

  • They want to derail convergence

Can we ensure no one can tamper with our training?

10 of 60

Part I

Robust Aggregators

10

11 of 60

Byzantine robustness: Robust Aggregator

11

What if I send ?

Avg is very sensitive.

Idea: replace with a robust (aka “insensitive”) aggregator

12 of 60

Byzantine robustness: Robust Aggregator

12

Median is very robust.

median

13 of 60

Byzantine robustness: Robust Aggregator

13

Median is very robust.

median

median

14 of 60

Byzantine robustness: Robust Aggregator

14

Median is very robust.

Generalizations:

  • Coordinate-wise median

[Yin et al. 2017]

  • Krum [Blanchard et al. 2018]

  • Geometric median / RFA [Pillutla et al. 2019]

15 of 60

Median based aggregators: theoretical failure

15

  • Consider all correct outputs

(+1, -1, +1, -1, +1, …., -1)�

  • Correct Avg is 0

  • Median outputs ±1

Robust but inaccurate.

  • CM, Krum, RFA all have bad accuracy.

16 of 60

Median based aggregators: theoretical failure

16

[Karimireddy et al. ICML 2021]

Theorem. There exist 1D strongly convex stochastic optimization problems where SGD converges but CM, RFA, and Krum do not.

Even without any Byzantine attackers.

17 of 60

Median based aggregators: experimental failure

17

  • We construct long-tailed MNIST dataset

  • 75% accuracy corresponds to learning only class 1 & 2 and ignoring all others.

No attackers!

18 of 60

Robust aggregator: Definition

18

(𝛿max, c)-robust aggregator.

For any 𝛿 < 𝛿max, given {x1, … xn} such that (1- 𝛿) fraction satisfy:

E|| xi - xj ||2 < ρ2 for a fixed i, j.

Aggregator(x1, … xn) outputs xout with error:

E|| xout - xmean ||2 < c 𝛿 ρ2

𝛿max controls robustness, c controls accuracy.

19 of 60

Robust aggregator: Definition

19

(𝛿max, c)-robust aggregator.

For any 𝛿 < 𝛿max, given {x1, … xn} such that (1- 𝛿) fraction satisfy:

E|| xi - xj ||2 < ρ2 for a fixed i, j.

Aggregator(x1, … xn) outputs xout with error:

E|| xout - xmean ||2 < c 𝛿 ρ2

We expect exact mean if

𝛿=0: use average� ρ=0: use majority

Best we can do.

20 of 60

Constructing a robust aggregator: centered clipping

20

Suppose we are give some inputs

21 of 60

Constructing a robust aggregator: centered clipping

21

Suppose we are give some inputs

And a “guess” v,

22 of 60

Constructing a robust aggregator: centered clipping

22

Suppose we are give some inputs

And a “guess” v,

And clipping threshold 𝝉.

𝝉

v

23 of 60

Constructing a robust aggregator: centered clipping

23

Suppose we are give some inputs

And a “guess” v,

And clipping threshold 𝝉.

Clip all values from guess to

clipping threshold and average

24 of 60

Constructing a robust aggregator: centered clipping

24

Clip all values from guess to

clipping threshold and average

25 of 60

Constructing a robust aggregator: centered clipping

25

Theorem. Centered clip is a robust aggregator for 𝛿max=0.15 and c = O(1) for arbitrary starting point v.

If v is good, 1 iteration suffices.

  • Cheap, only O(n) cost
  • Compatible with secure aggregation
  • Compatible with all reduce and asynchrony

[Karimireddy et al. ICML 2021]

26 of 60

Constructing a robust aggregator: centered clipping

26

  • Long-tailed MNIST dataset

  • Centered clip beats all other methods

  • For guess v, use aggregate output of previous round

27 of 60

Part II:

Necessity of history in

i.i.d. federated learning

27

28 of 60

Necessity of history: ideal update

28

Assume for this part i.i.d.

29 of 60

Necessity of history: approximate update

29

  • Suppose, we successfully defend Byzantine attacks with smaller error e

e

30 of 60

Necessity of history: approximate update

30

  • Attacks can be coupled across time

  • Error e adds up over time

  • Eventually, leads to large divergence

e

e

e

e

net error

ideal path

31 of 60

Necessity of history: theorem

31

Impossible to avoid for any algorithm which is ‘memory-less’ i.e.

If we permute the inputs, the output of the algorithm doesn’t change.

Theorem: The output x of any permutation invariant algorithm necessarily has an error:

[Karimireddy et al. ICML 2021]

32 of 60

Necessity of history: experiment

32

  • “A little is enough” (ALIE) attacks on standard MNIST.

  • Dotted line is ideal accuracy

  • All aggregators (solid lines) have 20--60% drop in accuracy

33 of 60

Using history

33

  • Noise / variance is independent across time

  • Attacks are coupled

34 of 60

Using history: idea

34

Noise is independent across rounds => averaging reduces their magnitude.

Time-coupled attacks don’t get smaller => easier to identify.

35 of 60

Using history: momentum

  • Use momentum locally

Momentum effectively averages over history.

35

  • Aggregate worker momentums instead of gradients

Momentum reduces variance but introduces bias. Convergence?

36 of 60

Using history: convergence theory

36

Theorem: Given any (𝛅max , c) robust aggregator, and a Byzantine robust problem with 𝛅-fraction attackers and 𝞂2 variance, our algorithm outputs xout s.t.

[Karimireddy et al. ICML 2021]

  • Converges to optimum.

  • When 𝛅 = O(1/n ), recovers SGD rate.

  • Works with any robust aggregator + momentum.

37 of 60

Using history: experiment

37

  • “A little is enough” (ALIE) attacks on normal MNIST�with 0.99 momentum

  • Centered Clip + momentum=0.99�matches ideal performance

38 of 60

Summary

38

  • A new definition of robust aggregator

Centered clipping satisfies it.

  • Need to use history for Byzantine robustness.�Using worker momentum suffices.

  • Centered clipping with worker momentum provably and practically defends against Byzantine attacks

39 of 60

Part II:

Over-parameterization for non- i.i.d. FL

39

40 of 60

Heterogeneity in FL: definition

40

In heterogeneous FL,

Bounded heterogeneity:

41 of 60

Heterogeneity in FL: challenge

41

  • Say worker 1 consistently sends a different update

  • Is worker 1 an attacker, or do they have different data?

Impossible to tell.

42 of 60

Heterogeneity in FL: impossibility

42

Theorem: For any algorithm, there exists a strongly convex problem for which (1-𝛅) fraction of users satisfy

but the algorithm’s output necessarily has an error

Unlike in iid case, convergence to optimum in non-iid seems impossible.

Possible workaround?

[Karimireddy et al. ICLR 2022]

43 of 60

Overparameterization in FL

43

Old assumption:

New assumption:

  • At optimum, RHS is 0 => LHS is 0 => one model fits all clients

  • strong growth” condition [Vaswani et al. 2019]

44 of 60

Using overparameterization: convergence theory

44

Theorem: Given any (𝛅max , c) robust aggregator, and a Byzantine robust overparameterized problem with 𝛅-fraction attackers and 𝞂2 variance, our algorithm outputs xout s.t.

[Karimireddy et al. ICLR 2022]

  • Converges to optimum for small enough B (sufficient overparameterization)

45 of 60

Using overparameterization: experiment

45

Rounds ->

<-- Test Loss

Experiment: centered clipping, with attack for different model sizes.

Convergence improves by increasing model size.

46 of 60

Using overparameterization: intuition

46

  • We know model can fit all users.�
  • If we consistently get update from user which makes another user worse �=> it must be an attack �=> can remove.�
  • Only attacks which do not conflict are incorporated�=> leaves train / test accuracy intact�=> “benign overfitting” [Bartlett et al. 2019]

  • Limitation: backdoor attacks

47 of 60

Part III:

Decentralized robustness

47

48 of 60

Decentralized learning

48

  • No server, users directly talk to each other��

  • Communication restricted to edges of a graph�

49 of 60

Decentralized SGD

49

  • Everyone maintains their own parameter��

In each step,

  • Take an SGD step locally

x1

x2

x3

x4

x5

50 of 60

Decentralized SGD

50

  • Everyone maintains their own parameter��

In each step,

  • Take an SGD step locally

  • Receive xj parameters from neighbors N(i)

x1

x2

x3

x5

  • Update with averaged parameters

51 of 60

Byzantine Robust Decentralized learning

51

Can we do decentralized training in the presence of Byzantine agents?

52 of 60

Byzantine Robust Decentralized learning

52

New Properties:

  • You only trust yourself, no one else.�=> everyone maintains their own different xi.

  • All communication is filtered through your neighbors

=> topology is important.

  • Everyone is an aggregator and has some data.

53 of 60

Self-centered clipping

53

Clip all values to your own parameters

i.e. use v = xi as center.

54 of 60

Self-centered clipping

54

Theorem. Self-centered clipping converges to a radius around the optimum of

where γ is the spectral gap of comm. graph.

[He et al. Arxiv 2022]

55 of 60

Challenge of Information bottlenecks

2 cliques with 1 bad actor.

56 of 60

Challenge of Information bottlenecks

2 cliques with 1 bad actor.

Bad node can simulate to be another clique.

From clique 1 perspective, impossible to tell who is real / telling the truth.

57 of 60

Challenge of Information bottlenecks

[He et al. Arxiv 2022]

Theorem. For any algorithm, there exists a decentralized problem with O(1/n) fraction of attackers who control O(1/n2) fraction of edges such that error is

  • Not circumventable within opt. framework - robust comm networks very important.��
  • Possibly can prevent using proof-of-work / crypto techniques?

58 of 60

Conclusion

58

59 of 60

Byzantine -> Strategic

  • Why should agents join? �Individual rationality.

  • Why should they tell you the truth?�Incentive compatibility.

60 of 60

Thank you!

I sit at SDH 7,�reach out!

60