Byzantine Robust Collaborative Learning
Sai Praneeth Karimireddy
Lie He
Martin Jaggi
Byzantine robust learning
2
Collaborative learning
3
Federated learning
4
Federated learning
5
Federated learning
6
Byzantine robust learning
7
Byzantine robust learning
8
Byzantine robust learning
9
We protect against worst-case:
Can we ensure no one can tamper with our training?
Part I
Robust Aggregators
10
Byzantine robustness: Robust Aggregator
11
What if I send ∞?
Avg is very sensitive.
Idea: replace with a robust (aka “insensitive”) aggregator
Byzantine robustness: Robust Aggregator
12
Median is very robust.
median
Byzantine robustness: Robust Aggregator
13
Median is very robust.
median
∞
median
Byzantine robustness: Robust Aggregator
14
Median is very robust.
Generalizations:
[Yin et al. 2017]
Median based aggregators: theoretical failure
15
(+1, -1, +1, -1, +1, …., -1)�
Robust but inaccurate.
Median based aggregators: theoretical failure
16
[Karimireddy et al. ICML 2021]
Theorem. There exist 1D strongly convex stochastic optimization problems where SGD converges but CM, RFA, and Krum do not.
Even without any Byzantine attackers.
Median based aggregators: experimental failure
17
No attackers!
Robust aggregator: Definition
18
(𝛿max, c)-robust aggregator.
For any 𝛿 < 𝛿max, given {x1, … xn} such that (1- 𝛿) fraction satisfy:
E|| xi - xj ||2 < ρ2 for a fixed i, j.
Aggregator(x1, … xn) outputs xout with error:
E|| xout - xmean ||2 < c 𝛿 ρ2
𝛿max controls robustness, c controls accuracy.
Robust aggregator: Definition
19
(𝛿max, c)-robust aggregator.
For any 𝛿 < 𝛿max, given {x1, … xn} such that (1- 𝛿) fraction satisfy:
E|| xi - xj ||2 < ρ2 for a fixed i, j.
Aggregator(x1, … xn) outputs xout with error:
E|| xout - xmean ||2 < c 𝛿 ρ2
We expect exact mean if
𝛿=0: use average� ρ=0: use majority
Best we can do.
Constructing a robust aggregator: centered clipping
20
Suppose we are give some inputs
Constructing a robust aggregator: centered clipping
21
Suppose we are give some inputs
And a “guess” v,
Constructing a robust aggregator: centered clipping
22
Suppose we are give some inputs
And a “guess” v,
And clipping threshold 𝝉.
𝝉
v
Constructing a robust aggregator: centered clipping
23
Suppose we are give some inputs
And a “guess” v,
And clipping threshold 𝝉.
Clip all values from guess to
clipping threshold and average
Constructing a robust aggregator: centered clipping
24
Clip all values from guess to
clipping threshold and average
Constructing a robust aggregator: centered clipping
25
Theorem. Centered clip is a robust aggregator for 𝛿max=0.15 and c = O(1) for arbitrary starting point v.
If v is good, 1 iteration suffices.
[Karimireddy et al. ICML 2021]
Constructing a robust aggregator: centered clipping
26
Part II:
Necessity of history in
i.i.d. federated learning
27
Necessity of history: ideal update
28
Assume for this part i.i.d.
Necessity of history: approximate update
29
e
Necessity of history: approximate update
30
e
e
e
e
net error
ideal path
Necessity of history: theorem
31
Impossible to avoid for any algorithm which is ‘memory-less’ i.e.
If we permute the inputs, the output of the algorithm doesn’t change.
Theorem: The output x of any permutation invariant algorithm necessarily has an error:
[Karimireddy et al. ICML 2021]
Necessity of history: experiment
32
Using history
33
Using history: idea
34
Noise is independent across rounds => averaging reduces their magnitude.
Time-coupled attacks don’t get smaller => easier to identify.
Using history: momentum
Momentum effectively averages over history.
35
Momentum reduces variance but introduces bias. Convergence?
Using history: convergence theory
36
Theorem: Given any (𝛅max , c) robust aggregator, and a Byzantine robust problem with 𝛅-fraction attackers and 𝞂2 variance, our algorithm outputs xout s.t.
[Karimireddy et al. ICML 2021]
Using history: experiment
37
Summary
38
Centered clipping satisfies it.
Part II:
Over-parameterization for non- i.i.d. FL
39
Heterogeneity in FL: definition
40
In heterogeneous FL,
Bounded heterogeneity:
Heterogeneity in FL: challenge
41
Impossible to tell.
Heterogeneity in FL: impossibility
42
Theorem: For any algorithm, there exists a strongly convex problem for which (1-𝛅) fraction of users satisfy
but the algorithm’s output necessarily has an error
Unlike in iid case, convergence to optimum in non-iid seems impossible.
Possible workaround?
[Karimireddy et al. ICLR 2022]
Overparameterization in FL
43
Old assumption:
New assumption:
Using overparameterization: convergence theory
44
Theorem: Given any (𝛅max , c) robust aggregator, and a Byzantine robust overparameterized problem with 𝛅-fraction attackers and 𝞂2 variance, our algorithm outputs xout s.t.
[Karimireddy et al. ICLR 2022]
Using overparameterization: experiment
45
Rounds ->
<-- Test Loss
Experiment: centered clipping, with attack for different model sizes.
Convergence improves by increasing model size.
Using overparameterization: intuition
46
Part III:
Decentralized robustness
47
Decentralized learning
48
Decentralized SGD
49
In each step,
x1
x2
x3
x4
x5
Decentralized SGD
50
In each step,
�
x1
x2
x3
x5
Byzantine Robust Decentralized learning
51
Can we do decentralized training in the presence of Byzantine agents?
Byzantine Robust Decentralized learning
52
New Properties:
=> topology is important.�
Self-centered clipping
53
Clip all values to your own parameters
i.e. use v = xi as center.
Self-centered clipping
54
Theorem. Self-centered clipping converges to a radius around the optimum of
where γ is the spectral gap of comm. graph.
[He et al. Arxiv 2022]
Challenge of Information bottlenecks
2 cliques with 1 bad actor.
Challenge of Information bottlenecks
2 cliques with 1 bad actor.
Bad node can simulate to be another clique.
From clique 1 perspective, impossible to tell who is real / telling the truth.
Challenge of Information bottlenecks
[He et al. Arxiv 2022]
Theorem. For any algorithm, there exists a decentralized problem with O(1/n) fraction of attackers who control O(1/n2) fraction of edges such that error is
Conclusion
58
Byzantine -> Strategic
Thank you!
I sit at SDH 7,�reach out!
60