SCAFFOLD: algorithm for federated learning
Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, Ananda Theertha Suresh
Tight analysis of FedAvg when clients are heterogeneous (non iid data)
Explain degradation of FedAvg via the ‘drift’ in the client updates
Prove SCAFFOLD is resilient to heterogeneity and client sampling
2
Federated Learning: Setting
3
Server
(e.g. Google)
Clients
(e.g. hospitals, phones)
model
data
[McMahan et al. 2016]
Federated Learning: Setting
4
In each round,
x
[McMahan et al. 2016]
Federated Learning: Setting
5
In each round,
x
y
y
y
[McMahan et al. 2016]
Federated Learning: Setting
6
In each round,
x
y
y
y
[McMahan et al. 2016]
Federated Learning: Setting
7
In each round,
x
[McMahan et al. 2016]
Federated Learning: Characteristics
8
model
data
x
Federated Learning: Formalism
9
Model parameters
(weighted) average over clients
Expectation over client data
Loss function wrt parameters and client data
Algorithms for Federated Learning
10
Solving FL: SGD
11
Mini-batch gradient with
Batch size K per client
Average over all sampled clients
Solving FL: FedAvg
12
Repeat K times
Average over all sampled clients
[McMahan et al. 2016]
Solving FL: SCAFFOLD
13
Correction term!
Repeat K times
New!
When does
FedAvg fail?
14
FedAvg degrades with heterogeneous clients. Why?
15
Client updates: SGD vs. FedAvg
16
Optimum
Surface of two client loss functions, and the combined function
Client updates: SGD
17
The limit point of SGD is the optimum.
Client updates: FedAvg
18
Client updates: FedAvg
19
Drift in client updates: SGD vs. FedAvg
20
Convergence Rates: SGD
21
Notation:
“the noise is in the noise and SGD don't care” [Chaturapruek et al. 2015]
Convergence Rates: SGD vs. FedAvg
22
Assume: (B,G)-similar gradients
Generalizes [Li et al. 2019] and [Khaled et al. 2019]
Convergence Rates: SGD vs. FedAvg
23
FedAvg
SGD
Assume: (B,G)-similar gradients
Tightest rates, uses server and client step-sizes
Lower bound: FedAvg
24
Assume: (B,G)-similar gradients
Necessary!
Theorem: For any G, we can find functions with (2, G)-similar gradients such that FedAvg for K>1 with arbitrary step-sizes always has error
Quick demo: SGD vs. FedAvg
> FedAvg needs smaller learning rate
> Slower than SGD
25
Lower is better
SCAFFOLD: stochastic controlled averaging
26
Main Idea: Use control variates
27
Main Idea: Corrected updates
28
Correction terms
Mimics centralized updates!
Main Idea: Updating control variates
29
+
+
+
+
SCAFFOLD: Algorithm
30
SCAFFOLD: Quick demo
> SCAFFOLD works with same learning rate as SGD
> Faster than SGD!
31
Lower is better
SCAFFOLD: Client sampling
Different view: SAGA is a special case of SCAFFOLD with client sampling
32
SCAFFOLD: Variance reduced convergence Rates
33
> Better than FedAvg
> Variance reduced rates!
Notation:
SCAFFOLD: Why take more than 1 step?
34
SCAFFOLD: Why take more than 1 step?
35
𝛿 - BHD (Bounded Hessian Dissimilarity)
And is 𝛿-weakly convex.
Note that
SCAFFOLD: Why take more than 1 step?
Assume: 𝛿 - BHD, quadratics
36
Notation:
SCAFFOLD: Why take more than 1 step?
Assume: 𝛿 - BHD
37
> Best to take
> We replaced L with 𝛿 in the rates (typically 𝛿 << L)
> First rate to characterize improvement due to local steps!
SCAFFOLD: Why take more than 1 step?
38
Quick demo on scalar quadratics
Experiments
39
Experimental Setup
40
Performance of SCAFFOLD
41
Similarity = 0, 1 Epoch, #sampled clients = 20, total clients = 400
Communication rounds -->
Effect of similarity
42
Test accuracy, 10 Epochs, #sampled clients = 20, total clients = 100
Communication rounds -->
Effect of number of clients
43
SCAFFOLD with 5 clients is better than FedAvg with 50!
> Total #clients = 400
> Total #categories = 47
> 1 Epoch per round
> Similarity = 0
Communication rounds -->
Take aways
44
Some new work
45
Thank You.
Questions?
46