1 of 55

An Inferential Perspective on FL &�Remarks on Data Efficiency of Meta-learning

Maruan Al-Shedivat

Based on joint work with:

Jenny Gillenwater, Liam Li, Afshin Rostamizadeh, Ameet Talwalkar, Eric Xing

FLOW seminar 3/24/2021

2 of 55

Outline & Relevant Papers

Federated Learning via Posterior Averaging:�A New Perspective and Practical Algorithms�with Jenny Gillenwater, Afshin Rostamizadeh, Eric Xing�to appear at ICLR 2021
On Data Efficiency of Meta-learning�with Liam Li, Ameet Talwalkar, Eric Xing�to appear at AISTATS 2021

Part I (25-30 minutes)�focus on standard FL

Part II (10-15 minutes)

focus on personalized FL

3 of 55

Part I:

An Inferential Perspective on FL

4 of 55

Solve this problem using FedAvg (local SGD):

Optimize the global objective over multiple communication rounds.
At each round, a subset of clients runs local optimization and communicates with the server.

Federated Learning (FL) is usually formulated as a distributed optimization problem

local client objectives

global objective

Server

request�clients

Client Population

Local SGD...

5 of 55

Cross-device FL: Modeling Assumptions

10¹-10³ clients per round

Bonawitz et al. “Towards Federated Learning at Scale: System Design.” arXiv:1902.01046

6 of 55

Cross-device FL: Modeling Assumptions

Bonawitz et al. “Towards Federated Learning at Scale: System Design.” arXiv:1902.01046

communication of the model if often the bottleneck

7 of 55

Cross-device FL: Modeling Assumptions

Bonawitz et al. “Towards Federated Learning at Scale: System Design.” arXiv:1902.01046

The cross-device federated setting:

A very large number of clients (1M+) ⇒�clients participate in ≤ 1 training rounds
Data distributions on clients are different ⇒ non-IID setting
Increasing # of steps per client is cheap compared to increasing # of rounds due to communication costs
Increasing # of clients per round often has a negligible overhead

8 of 55

Solve this problem using FedAvg (local SGD):

Optimize the global objective over multiple communication rounds.
At each round, a subset of clients runs local optimization and communicates with the server.

Federated Learning (FL) is usually formulated as a distributed optimization problem

local client objectives

global objective

Client-server communication is often slow & expensive. How can we speed up training?

Server

Client Population

9 of 55

Solve this problem using FedAvg (local SGD):

Optimize the global objective over multiple communication rounds.
At each round, a subset of clients runs local optimization and communicates with the server.

Federated Learning (FL) is usually formulated as a distributed optimization problem

local client objectives

global objective

Client-server communication is often slow & expensive. How can we speed up training?

To speed up (x10-100) we can make clients spend more time at each round on local training (e.g., do more local SGD steps)�⇒ do more local progress, thereby reducing the total number of communication rounds.

Server

Client Population

10 of 55

Solve this problem using FedAvg (local SGD):

Optimize the global objective over multiple communication rounds.
At each round, a subset of clients runs local optimization and communicates with the server.

Federated Learning (FL) is usually formulated as a distributed optimization problem

local client objectives

global objective

Client-server communication is often slow & expensive. How can we speed up training?

To speed up (x10-100) we can make clients spend more time at each round on local training (e.g., do more local SGD steps)�⇒ do more local progress, thereby reducing the total number of communication rounds.
Because of client data heterogeneity, it turns out that more local computation per round results in convergence to inferior models!

Server

Client Population

11 of 55

Convergence Issues: Toy Example (Least Squares in 2D)

⇒

Least squares:

12 of 55

Convergence Issues: Toy Example (Least Squares in 2D)

⇒

Least squares:

13 of 55

Convergence Issues: Toy Example (Least Squares in 2D)

⇒

Least squares:

14 of 55

Federated Averaging: Fixed Points (for quadratic losses)

Pathak & Wainwright (NeurIPS, 2020)

Centralized least squares optimum:

FedAvg fixed point (e steps per round):

Charles & Konečný (AISTATS, 2021)

Takeaways:

FedAvg (as well as FedProx and other methods) optimize a surrogate objective function
There is a gap between the optimum of the true objective and the surrogate objective

Proposed fix: an algorithm similar to SCAFFOLD, which uses stateful clients that are revisited throughout the course of training.

Proposed fix: careful tuning of the client and server learning rate schedules that reduce the discrepancy.

15 of 55

Federated Optimization ⇒ Federated Posterior Inference

Federated learning is often formulated as an optimization problem:

expectation�over clients

expectation over�client data

16 of 55

Federated Optimization ⇒ Federated Posterior Inference

Federated learning is often formulated as an optimization problem:

If the loss function is the negative log likelihood then the solution of the optimization problem is the maximum likelihood estimator (MLE)

expectation�over clients

expectation over�client data

17 of 55

Federated Optimization ⇒ Federated Posterior Inference

Federated learning is often formulated as an optimization problem:

If the loss function is the negative log likelihood then the solution of the optimization problem is the maximum likelihood estimator (MLE)

An alternative to MLE is posterior inference, we would like to infer the posterior distribution over the parameters

expectation�over clients

expectation over�client data

under the uniform prior, posterior mode ≡ MLE

18 of 55

Federated Optimization ⇒ Federated Posterior Inference

Any posterior distribution decomposes into a product of sub-posteriors:

19 of 55

Federated Optimization ⇒ Federated Posterior Inference

Any posterior distribution decomposes into a product of sub-posteriors:

A high-level algorithm that will attain the global optimum:

On each client, (approximately) infer the local posterior distribution
Communicate information about the inferred local posteriors to the server
Multiplicatively aggregate local posteriors into the global on the server
The mode of the global posterior is the global optimum!

Key point:�If we can do this efficiently, then we will have a globally consistent algorithm with stateless clients.

20 of 55

Example: Federated Quadratics

Quadratic objectives are log likelihoods under the Gaussian model:

21 of 55

Example: Federated Quadratics

Quadratic objectives are log likelihoods under the Gaussian model:

The global posterior is a product of Gaussians (⇒ also Gaussian):

22 of 55

Example: Federated Quadratics

Quadratic objectives are log likelihoods under the Gaussian model:

The global posterior is a product of Gaussians (⇒ also Gaussian):

The global optimum is the posterior mode (coincides with the mean for Gaussians):

23 of 55

Federated Posterior Averaging

If we can infer the posterior mode, we have solved the problem!

local posterior covariances

local posterior means

Key idea:

Estimate moments of the local posteriors on the clients.
Communicate this to the server that will infer the global posterior mean.

24 of 55

Federated Posterior Averaging

If we can infer the posterior mode, we have solved the problem!

Note: If posteriors are non-Gaussian, multimodal, etc., the above expression is simply a (federated) Laplace approximation of both local and global posteriors.

local posterior covariances

local posterior means

Key idea:

Estimate moments of the local posteriors on the clients.
Communicate this to the server that will infer the global posterior mean.

25 of 55

Federated Posterior Averaging

If we can infer the posterior mode, we have solved the problem!

Challenges:

(1) how to infer local posteriors efficiently?

(2) how to do aggregation on the server efficiently?

(3) how to communicate them to the server efficiently?

local posterior covariances

local posterior means

Key idea:

Estimate moments of the local posteriors on the clients.
Communicate this to the server that will infer the global posterior mean.

26 of 55

Local Posterior Inference: Stochastic Gradient MCMC

JMLR 2017

Key ideas:

SGD on the log likelihood ≈�sampling from the posterior
Run SGD on the local objective �(long enough for the Markov chain to mix-in)
Keep running SGD, collect posterior samples, and use them for estimating and

27 of 55

Global Posterior Inference: Computing the Matrix Inverse

denote:

is the minimizer of the following objective:

Idea: solve it using SGD instead of matrix inverse!

Precisely the server update done by FedAvg, except client Δ’s are different

It also solves our communication problem: clients need to only send some new deltas to the server!

28 of 55