1 of 60

Byzantine Robust Collaborative Learning

Sai Praneeth Karimireddy

Lie He

Martin Jaggi

2 of 60

Byzantine robust learning

2

3 of 60

Collaborative learning

3

Multiple workers with their own data

Work together to learn a model over combined data

Better privacy, security, and scalability

4 of 60

Federated learning

4

Server sends model x to everyone

5 of 60

Federated learning

5

Server sends model x to everyone

Each worker i computes stochastic gradient at x and sends to server

6 of 60

Federated learning

6

Server sends model x to everyone

Each worker i computes stochastic gradient at x and sends to server

Server accumulates gradients and computes new parameters

7 of 60

Byzantine robust learning

7

But workers might have hardware failures

8 of 60

Byzantine robust learning

8

Or they may even be malicious attackers

9 of 60

Byzantine robust learning

9

We protect against worst-case:

Small (𝛿) fraction of workers may send arbitrary updates

They can collude and are omniscient

They want to derail convergence

Can we ensure no one can tamper with our training?

10 of 60

Part I

Robust Aggregators

10

11 of 60

Byzantine robustness: Robust Aggregator

11

What if I send ∞?

Avg is very sensitive.

Idea: replace with a robust (aka “insensitive”) aggregator

12 of 60

Byzantine robustness: Robust Aggregator

12

Median is very robust.

median

13 of 60

Byzantine robustness: Robust Aggregator

13

Median is very robust.

median

∞

median

14 of 60

Byzantine robustness: Robust Aggregator

14

Median is very robust.

Generalizations:

Coordinate-wise median

[Yin et al. 2017]

Krum [Blanchard et al. 2018]

Geometric median / RFA [Pillutla et al. 2019]

15 of 60

Median based aggregators: theoretical failure

15

Consider all correct outputs

(+1, -1, +1, -1, +1, …., -1)�

Correct Avg is 0

Median outputs ±1

Robust but inaccurate.

CM, Krum, RFA all have bad accuracy.

16 of 60

Median based aggregators: theoretical failure

16

[Karimireddy et al. ICML 2021]

Theorem. There exist 1D strongly convex stochastic optimization problems where SGD converges but CM, RFA, and Krum do not.

Even without any Byzantine attackers.

17 of 60

Median based aggregators: experimental failure

17

We construct long-tailed MNIST dataset

75% accuracy corresponds to learning only class 1 & 2 and ignoring all others.

No attackers!

18 of 60

Robust aggregator: Definition

18

(𝛿_max, c)-robust aggregator.

For any 𝛿 < 𝛿_max,given {x₁, … x_n} such that (1- 𝛿) fraction satisfy:

E|| x_i - x_j ||² < ρ² for a fixed i, j.

Aggregator(x₁, … x_n) outputs x_out with error:

E|| x_out - x_mean ||² < c 𝛿 ρ²

𝛿_maxcontrols robustness, c controls accuracy.

19 of 60

Robust aggregator: Definition

19

(𝛿_max, c)-robust aggregator.

For any 𝛿 < 𝛿_max,given {x₁, … x_n} such that (1- 𝛿) fraction satisfy:

E|| x_i - x_j ||² < ρ² for a fixed i, j.

Aggregator(x₁, … x_n) outputs x_out with error:

E|| x_out - x_mean ||² < c 𝛿 ρ²

We expect exact mean if

𝛿=0: use average� ρ=0: use majority

Best we can do.

20 of 60

Constructing a robust aggregator: centered clipping

20

Suppose we are give some inputs

21 of 60

Constructing a robust aggregator: centered clipping

21

Suppose we are give some inputs

And a “guess” v,

22 of 60

Constructing a robust aggregator: centered clipping

22

Suppose we are give some inputs

And a “guess” v,

And clipping threshold 𝝉.

𝝉

v

23 of 60

Constructing a robust aggregator: centered clipping

23

Suppose we are give some inputs

And a “guess” v,

And clipping threshold 𝝉.

Clip all values from guess to

clipping threshold and average

24 of 60

Constructing a robust aggregator: centered clipping

24

Clip all values from guess to

clipping threshold and average

25 of 60

Constructing a robust aggregator: centered clipping

25

Theorem. Centered clip is a robust aggregator for 𝛿_max=0.15 and c = O(1) for arbitrary starting point v.

If v is good, 1 iteration suffices.

Cheap, only O(n) cost
Compatible with secure aggregation
Compatible with all reduce and asynchrony

[Karimireddy et al. ICML 2021]

26 of 60

Constructing a robust aggregator: centered clipping

26

Long-tailed MNIST dataset

Centered clip beats all other methods

For guess v, use aggregate output of previous round

27 of 60

Part II:

Necessity of history in

i.i.d. federated learning

27

28 of 60

Necessity of history: ideal update

28

Assume for this part i.i.d.

29 of 60

Necessity of history: approximate update

29

Suppose, we successfully defend Byzantine attacks with smaller error e

e

30 of 60

Necessity of history: approximate update

30

Attacks can be coupled across time

Error e adds up over time

Eventually, leads to large divergence

e

net error

ideal path

31 of 60

Necessity of history: theorem

31

Impossible to avoid for any algorithm which is ‘memory-less’ i.e.

If we permute the inputs, the output of the algorithm doesn’t change.

Theorem: The output x of any permutation invariant algorithm necessarily has an error:

[Karimireddy et al. ICML 2021]

32 of 60

Necessity of history: experiment

32

“A little is enough” (ALIE) attacks on standard MNIST.

Dotted line is ideal accuracy

All aggregators (solid lines) have 20--60% drop in accuracy

33 of 60

Using history

33

Noise / variance is independent across time

Attacks are coupled

34 of 60

Using history: idea

34

Noise is independent across rounds => averaging reduces their magnitude.

Time-coupled attacks don’t get smaller => easier to identify.

35 of 60

Using history: momentum

Use momentum locally

Momentum effectively averages over history.

35

Aggregate worker momentums instead of gradients

Momentum reduces variance but introduces bias. Convergence?

36 of 60

Using history: convergence theory

36

Theorem: Given any (𝛅_max, c) robust aggregator, and a Byzantine robust problem with 𝛅-fraction attackers and 𝞂² variance, our algorithm outputs x^out s.t.

[Karimireddy et al. ICML 2021]

Converges to optimum.

When 𝛅 = O(1/n ), recovers SGD rate.

Works with any robust aggregator + momentum.

37 of 60

Using history: experiment

37

“A little is enough” (ALIE) attacks on normal MNIST�with 0.99 momentum

Centered Clip + momentum=0.99�matches ideal performance

38 of 60

Summary

38

A new definition of robust aggregator

Centered clipping satisfies it.

Need to use history for Byzantine robustness.�Using worker momentum suffices.

Centered clipping with worker momentum provably and practically defends against Byzantine attacks

39 of 60

Part II:

Over-parameterization for non- i.i.d. FL

39

40 of 60

Heterogeneity in FL: definition

40

In heterogeneous FL,

Bounded heterogeneity:

41 of 60

Heterogeneity in FL: challenge

41

Say worker 1 consistently sends a different update

Is worker 1 an attacker, or do they have different data?

Impossible to tell.

42 of 60

Heterogeneity in FL: impossibility

42

Theorem: For any algorithm, there exists a strongly convex problem for which (1-𝛅) fraction of users satisfy

but the algorithm’s output necessarily has an error

Unlike in iid case, convergence to optimum in non-iid seems impossible.

Possible workaround?

[Karimireddy et al. ICLR 2022]

43 of 60

Overparameterization in FL

43

Old assumption:

New assumption:

At optimum, RHS is 0 => LHS is 0 => one model fits all clients

“strong growth” condition [Vaswani et al. 2019]

44 of 60

Using overparameterization: convergence theory

44

Theorem: Given any (𝛅_max, c) robust aggregator, and a Byzantine robust overparameterized problem with 𝛅-fraction attackers and 𝞂² variance, our algorithm outputs x^out s.t.

[Karimireddy et al. ICLR 2022]

Converges to optimum for small enough B (sufficient overparameterization)

45 of 60

Using overparameterization: experiment

45

Rounds ->

<-- Test Loss

Experiment: centered clipping, with attack for different model sizes.

Convergence improves by increasing model size.

46 of 60

Using overparameterization: intuition

46

We know model can fit all users.�
If we consistently get update from user which makes another user worse �=> it must be an attack �=> can remove.�

Only attacks which do not conflict are incorporated�=> leaves train / test accuracy intact�=> “benign overfitting” [Bartlett et al. 2019]

Limitation: backdoor attacks

47 of 60

Part III:

Decentralized robustness

47

48 of 60

Decentralized learning

48

No server, users directly talk to each other��

Communication restricted to edges of a graph�

49 of 60

Decentralized SGD

49

Everyone maintains their own parameter��

In each step,

Take an SGD step locally

x₁

x₂

x₃

x₄

x₅

50 of 60

Decentralized SGD

50

Everyone maintains their own parameter��

In each step,

Take an SGD step locally

�

Receive x_jparameters from neighbors N(i)�

x₁

x₂

x₃

x₅

Update with averaged parameters

51 of 60

Byzantine Robust Decentralized learning

51

Can we do decentralized training in the presence of Byzantine agents?

52 of 60

Byzantine Robust Decentralized learning

52

New Properties:

You only trust yourself, no one else.�=> everyone maintains their own different x_i.

All communication is filtered through your neighbors

=> topology is important.�

Everyone is an aggregator and has some data.��

53 of 60

Self-centered clipping

53

Clip all values to your own parameters

i.e. use v = x_i as center.

54 of 60

Self-centered clipping

54

Theorem. Self-centered clipping converges to a radius around the optimum of

where γ is the spectral gap of comm. graph.

[He et al. Arxiv 2022]

55 of 60

Challenge of Information bottlenecks

2 cliques with 1 bad actor.

56 of 60

Challenge of Information bottlenecks

2 cliques with 1 bad actor.

Bad node can simulate to be another clique.

From clique 1 perspective, impossible to tell who is real / telling the truth.

57 of 60

Challenge of Information bottlenecks

[He et al. Arxiv 2022]

Theorem. For any algorithm, there exists a decentralized problem with O(1/n) fraction of attackers who control O(1/n²) fraction of edges such that error is

Not circumventable within opt. framework - robust comm networks very important.��
Possibly can prevent using proof-of-work / crypto techniques?

58 of 60

Conclusion

58

59 of 60

Byzantine -> Strategic

Why should agents join? �Individual rationality.

Why should they tell you the truth?�Incentive compatibility.

60 of 60

Thank you!

I sit at SDH 7,�reach out!

60