1 of 77

An Introduction to �Federated Computation

Akash Bharadwaj, Graham Cormode �Meta AI�{akashb, gcormode}@fb.com

https://akashb94.wixsite.com/website

2 of 77

Tutorial overview

Introduction and Motivation
Technical Preliminaries and Primitives
Federated Algorithms
Practical Considerations
Advanced topics and open problems

2

3 of 77

Disclaimer

This tutorial is a survey of published work on federated computation.

It does not disclose non-public information about practices and deployments at Meta.

Images in this tutorial sourced from Wikimedia Commons and attributed accordingly in the notes.

Full references are in the slides, available from the tutorial website https://akashb94.wixsite.com/website

3

4 of 77

Introduction and Motivation

4

5 of 77

What’s the catch with data today?

There is great sensitivity around handling private data:

Medical information: diagnoses, medication history, family history…
Financial information: salary, savings, debts, payment history…
Personal information: interests, political and religious beliefs, sexuality…

There are increasingly many legal protections on the collection and processing of personal data:

GDPR: Europe-wide data protection regulation enforced since 2018
ePD: ePrivacy regulation on electronic communications in Europe, ongoing
CCPA: California Consumer Privacy Act, similar to GDPR, since 2020

Companies want to differentiate their offerings by giving strong privacy guarantees

E.g., End-to-end encryption in messaging, Apple’s iOS 14

BUT … Modern data applications benefit from being informed by personal data:

Improving service quality by gathering statistics on usage patterns
Providing relevant advertising and helpful recommendations to users

5

6 of 77

What is Federated Computation?

Like Map Reduce for highly decentralized data with privacy built in

Storage is massively distributed (potentially billions of user devices)
Compute instructions are sent to where the data lives
User’s own and keep their data => Consent, privacy and security are first order concerns
Non-standard and limited bandwidth/compute/memory availability on nodes
Intermittent node availability
Highly ephemeral/unbalanced/non-stationary data

7 of 77

Privacy Principles

Privacy principles to build systems that offer these benefits while providing strong privacy guarantees:

Data minimization: Gather the least amount of data necessary to perform the service
Purpose limitation: Ensure that data gathered is for a specific, limited purpose
Ephemerality: Don’t retain data once its purpose has been met
Access controls and security: Restrict access to data
Aggregation: Analyze and store population level aggregates rather than individual data
Anonymization: Ensure data cannot be traced back to individual users
Plausible deniability: Introduce uncertainty about the accuracy of each user’s data disclosures

7

8 of 77

Private Computation Models

Different approaches to providing privacy address different scenarios and tradeoffs:

Distributed vs. centralized: Is data gathered centrally, or processed by distributed entities?
Syntactic vs. Statistical: Is data protected by enforcing rules, or by adding random noise?
Type of data handled: Tabular, graph, textual, image, video…
Type of task handled: materializing statistics, training a model…
Granularity of privacy: Record vs. device vs. user vs. group level privacy
Batch vs. stream: Is computation being performed on a finite/unchanging dataset?
Longitudinal vs. cross-sectional: Are statistics about users being tracked over time?
Scale: How many users contribute their data and how much data do they possess?
User availability: Can data disclosures be obtained on-demand uniformly at random?

8

9 of 77

Buffet of privacy technologies

9

Clients

Server

...

Σ

Λ

Report

Inputs from server

.

Client eligibility

⟳

On device execution

On device data

On device algorithm

Aggregator

Server task

.

Privacy tech 1: On device data management + manipulation

Privacy tech 2: Prevention of memorization (Local/Distributed DP)

Privacy tech 3: Federated algorithms to minimize data

Privacy tech 5: Secure aggregation (SMC, Central DP, Secure � Enclaves) of ephemeral reports

Client outputs

Noise

Privacy Tech 4: Anonymous Channels

Running examples:

Evaluating number of clients with salary above a certain value

Helps with tech 3 and tech 2

Trie HH

Helps with tech 3, since an algorithmic innovation was necessary to identify trie HH without knowing entire vocabulary and evaluating a histogram

Face ID

Helps with tech 5, since faces can be reverse engineered even from discriminative models
Also suffers from adversarial attacks

Ads ranking:

Proprietary FB tech

LDP noise is carefully chosen to achieve good enough utility, and so that independently added noise cancels out and further utility can be recovered via post-processing

Client eligibility criteria are specified by us

Logic executed on device is specified by us

Logic is run in parallel across all clients, who don’t share anything other than initial inputs

LDP ensures server picks up on patterns rather than individual data points. Clients gain plausible deniability

Memorization is especially problematic with generative models, BUT IS EVEN PROBLEMATIC FOR DISCRIMINATIVE MODELS

Memorization is problematic not only because server might decode it, but because a peer node might decode it

LDP can be done at user level rather than example level to ensure that noise doesn’t compound due to potentially large amount of data on users’ devices

Focused data collection often requires algorithmic innovations that our team does (eg. collecting heavy hitters without collecting histogram of frequencies for all data elements, some of which may be private)

Aggregation alone is insufficient if individual disclosures are communicated in the open or if sybil attacks are attempted.

Iterativity helps with utility (example trie HH)

10 of 77

The Federated Approach

The federated approach to computation aims to support privacy requirements:

Data remains under the control of user clients (e.g., on their phone)
Only a small amount of necessary data is shared with central servers
Communication is done under strong security guarantees (e.g., encryption)
Additional privacy guarantees are provided (e.g., via adding random noise, �anonymous communication channels, secure aggregation etc.)

So federated computation is secure, private and distributed (but not fully decentralized)

This builds on prior work that achieves subsets of these properties

10

11 of 77

The Federated Model

Horizontal federation: each client’s data has the same schema, from one or more individual users

E.g., each client is a mobile device with one individual’s browsing history
E.g., each client is a bank that holds multiple users’ financial data

Vertical federation: each client has data on different features of a common group of individual users

E.g., each client is a hospital department (cardiology, oncology) with specific information
E.g., each client is an advertiser with data on how users have interacted with their ads

Our focus in this tutorial is on the horizontally federated case with a coordinating server

11

Features

Individuals

Data

Features

Individuals

Data

12 of 77

Challenges for horizontal federation

Scale of the data involved:

Volumes of data are increasingly large, e.g., video data used to train ML models
It is not practical/possible/legal to gather such data from many distributed clients
It is more efficient to process data at rest to extract a few necessary features

The need to offer strong privacy guarantees:

Gathering a large amount of sensitive data creates a risk for the communication and storage
The client’s response to a query can be less revealing than the data used to answer it
Knowing that data remains on their device is an powerful and intuitive privacy promise to users

12

13 of 77

Federation Scenarios

Small: A few organizations (2–10) cooperate in a federated setting, each representing many users

Canonical example: hospitals pool medical data to better understand a disease

Medium: Several organizations (tens–hundreds) cooperate, each representing many users

Canonical example: banks share information on financial activity to find patterns of fraud

Large: Many entities (thousands–millions) cooperate, each representing a single user

Canonical example: an app on smartphone gathers statistics to improve performance

Computational power, access and resources vary: we expect a smartphone to be weaker than a bank!

Most focus has been on the medium to large scale, particularly for federated analytics

13

14 of 77

Types of Federated Computation

Federated learning refers to the task of training ML models over distributed data

Each client receives a candidate model, and proposes improvements from their data
A vast and hot topic in ML research! >10K papers per year on this topic

Federated analytics has recently emerged as a distinct topic of study

Clients collaborate to describe properties of the data distribution
An emerging topic, building on prior work on privacy and data summarization

This tutorial is focused on federated analytics (FA) while touching on federated learning (FL)

Many excellent resources available that dive deep into FL, e.g., sites.google.com/view/fl-tutorial/

14

15 of 77

The costs of federated computing

Does federation deepen the digital divide?

Participating in federated computing may require access to more powerful devices
Federated systems may prefer faster, more capable devices and overrepresent these

Is federated computation more costly overall?

Computation on many weak devices can be more energy intensive than powerful servers
Federation may require many rounds of interaction with client devices, so slower overall

May try to design new protocols with few rounds or one round of interaction

Quantifying and tackling these issues forms open questions for future federated research…

15

16 of 77

Federated Computation Summary

Privacy-preserving computations distributed over collections of users at web scale

Akin to “a privacy-preserving MapReduce”: clients are mappers, servers are reducers

Data stays on client devices, only sufficient statistics are shared in adherence to privacy principles
Additional privacy comes from (local or central) differential privacy and secure multiparty computation
Federated Computation is increasingly widely adopted for ML and analytics applications:

Federated Learning: chiefly concerned with training machine learning models (i.e., gradient descent)
Federated Analytics: gathering statistics and aggregates over client populations

16

17 of 77

Technical Preliminaries

17

18 of 77

Privacy and security formalizations

Privacy protects the output of the computation, while security protects the intermediate values

Various different formalizations of privacy and security can be instantiated in the federated setting:

Data minimization
Secure Multiparty Computation
Secure Aggregation
Syntactic Privacy
Statistical Privacy

Each has different emphasis and tradeoffs between privacy, security, computational costs, and accuracy

18

19 of 77

Data Minimization

Intuitive protection: Each client only sends the minimal amount of data necessary for the task at hand

E.g., report the total number of visits to a website, rather than full details of every visit
E.g., only send suggested updates to an ML model, instead of all data points
May still reveal private information, depending on the task!�“Deep Leakage from Gradients” [Zhu, Liu, Han 19]
In most cases, we seek a stronger guarantee than data minimization

Throughout, we assume that communications are encrypted between client and server

Primary privacy concern is what the server learns, or what can be learned from the server’s output

19

20 of 77

Secure Multiparty Computation (SMC)

Multiple honest, non-colluding servers cooperate to compute a function of the client’s messages
No server sees the whole message, so only the final result of the computation is revealed
E.g., a simple secure multiparty computation (SMC) protocol to sum up client reports

Secret sharing: The client sends (mask) to one server, (data - mask) to another
Each server sums up all the (masked) shares
The servers then reveal their sums: the addition of these gives the sum of all client reports

20

x - r

y - s

s

r

(x - r + y - s)

(r + s)

x + y

x

y

21 of 77

SMC primitives for federated computing

We can write [x] for a secret share of the value x: [x] = (x₁, … x_k) so that ∑_i x_i = x

Many arithmetic operations can be applied on secret shares efficiently:

[x] + [y] = [x + y] : addition is performed entry-wise [x] + [y] = (x₁ + y₁, … x_k + y_k) = [x + y]
[x] - [y] = [x - y] : subtraction is dual to addition
c[x] = [cx] : multiplication by a (public) scalar value c

Other functions require some extra steps, and some preprocessing

E.g., Multiplication of two private values [x], [y] using helper values (a, b, c) such that (ab = c) [Beaver 91]
Comparison (greater than, less than) via conversion of shares to binary [Damgård et al 06, Nishide Ohta 07]

A core trade-off: more specific protocols can cost less than more general-purpose approaches

21

22 of 77

Secure Aggregation

Secure Aggregation addresses the special case that we want to compute the sum of vectors held by clients

Various implementations have been proposed with different tradeoffs and trust models:

Clients secret-share their data to 2 or more servers (SMC-like), who combine the results (previous slides)
Clients secret-share their data to all other clients, and all pass the shares to a trusted server to aggregate
Clients secret-share to O(log n) other clients, and all shares are combined by one server (described later)
Clients obtain a “mask” from secure enclave and release data+mask. Enclave sends sum of masks to server
A subset of clients cooperate to perform cryptographically secure aggregation [Roth et al 19]

Practical implementations emphasize handling client drop-outs: what happens when a client goes offline midway?�Other aspects: what trust is assumed? Is peer to peer connectivity needed? Is public key infrastructure needed?

22

Secure Aggregation

⋮

x₁

x_n

∑_i=1ⁿ x_i

23 of 77

Syntactic privacy: K-anonymity [Samarati Sweeney 98]

K-anonymity is a condition that requires each tuple in the output to be supported by at least k clients

E.g., only report a map grid cell as occupied if there are at least 10 active users there

K-anonymity has intuitive appeal but has been mostly deprecated by the research community

Sensitive information can still be leaked about individual users due to background knowledge
K-anonymity is sometimes useful in conjunction with statistical privacy for lay users
K-anonymity variants are still popular for some types of data, e.g., graph structured data

23

k=5 anonymous

24 of 77

Statistical Privacy: Differential Privacy

Differential privacy limits the probability of seeing different outputs from an algorithm [Dwork Roth 14]

“The output distribution should look similar when an individual user’s contributions are added or removed”

More formally for algorithm A, output set O, adjacent datasets D, D’ differing in one individual: � Pr[ A(D) ∈ O] ≤ exp(ε) Pr[ A(D’) ∈ O] + δ� Probability of some outcome is similar to same outcome on similar data

Parameter ε captures the level of privacy: ε < 0.1 “high privacy”, ε > 1 “lower privacy”

Parameter δ is typically very small “wiggle room”

24

+

25 of 77

Achieving Differential Privacy

Canonical approach to achieving differential privacy in the central case [Dwork Roth 14]:

Compute the answer to the function exactly
Add noise from an statistical distribution (Laplace, Gaussian), scaled by “sensitivity” of the function

“Sensitivity”: the maximum amount that the output of the function can change when one user is added

Two intertwined challenges for the federated setting:

How to compute the answer to the function over distributed inputs?
How to ensure that sufficient noise is added?

Several approaches to these challenges: trusted entity, local noise, distributed noise

25

+

26 of 77

Local Differential Privacy

Under Local Differential Privacy (LDP) each client applies differential privacy locally and shares the result�No need to trust any other entity, as the DP guarantee holds for each client message

Randomized response [Warner 1965] is at the core of most LDP algorithms

Essentially, randomly flip data bits with carefully chosen probability to meet the DP requirement
Many algorithms represent each of N user’s data as binary information and apply RR
The error in the estimate is typically proportional to 1/√N (much larger than central noise addition)

Local differential privacy has been widely deployed, touching 100s of millions of users:

In Google Chrome, to build browsing models (2014)
In Microsoft Windows, Apple’s iOS and macOS to collect app usage data (2017)
By Snap(chat) to instantiate ML models (2018)

26

+

27 of 77

Privacy and Security Summary

Security is concerned with protecting intermediate values (SMC, Secure aggregation)
Privacy is concerned with protecting the outputs (differential privacy variants)

Privacy can be applied at different levels:

Central DP has a central trusted entity gather the data and add the noise to give DP on their output
Local DP has each client add enough noise to mask their own contribution (less trust, more noise)
Distributed DP: each client adds a small amount of noise which sums to give a central DP guarantee

Federated computation may combine privacy and security e.g. distributed DP plus secure aggregation

27

28 of 77

Federated Algorithms

28

29 of 77

Private Histograms

Private histograms are the bedrock of many privacy tasks

Abstraction: Each client holds one label out of a discrete set of size D
The exact histogram lists the aggregated counts of each label

Creating private histograms has been heavily studied under different differential privacy models:

Central DP: add your favorite flavor of DP noise (Laplace, Gaussian, symmetric geometric etc.)
Local DP: referred to as ‘frequency oracles’, apply some variant of randomized response at each client
Distributed DP: each client adds some small noise (Bernoulli, Polya etc.) that sums up to equal central DP

Private histograms are used to describe data distributions

Via heavy hitters, range queries, quantiles etc.

Will show ways to build histograms in four different privacy/security models!

29

	6
	1
	3

30 of 77

Secure Aggregation Protocol with one server

Each client i has a vector x_i and we want their sum, ∑_i x_i, tolerating some client dropouts [Bell et al. 20]

Each client i picks k “neighbors” randomly, agrees mask pairs m_ij= -m_jifor each neighbor j, plus shares of a random r_i

Client sends the message x_i + r_i + ∑_{j is a neighbor of i} m_ij

If all clients stay present, the mask pairs m_ij + m_jicancel out

Server can request shares of r_i from neighbors to remove r_i

If a client i drops out, we must remove the unmatched masks m_jifrom others

Server instead requests information on m_ji from neighbors to remove m_jis

Server obtains the correct sum, and cannot obtain any unmasked vector x_i

30

x₁ + r₁ + m₁₂

x₂ + r₂ + m₂₁

31 of 77

Secure Aggregation extensions

Rather than requiring all k neighbors to participate, we can relax to requiring only t out of k

Threshold secret sharing: secret is a degree-t polynomial f(), each neighbor holds a different evaluation of f

The secure aggregation protocol succeeds provided:

Fewer than t neighbors of each node are “corrupt” (collude with the server)
The graph of neighbors remains connected after removing corrupt and dropout clients
Each node has at least t neighbors surviving after dropouts

Choosing k to be logarithmic in n, the number of clients, ensures that these all hold with high probability

Each client only has to communicate their (masked) message + O(log n) extra information [Bell et al. 20]

31

32 of 77

Using Secure Aggregation for histograms

Each client holds a label l_i, and we want to build the histogram of all clients

Each client can create x_i as the one-hot vector encoding of l_i, and perform secure aggregation on x_is

[Bell et al. 20] show a protocol to resist semi-honest servers trying to learn clients’ data

Scalability. If the set of labels is very big, the encoding of x_i will be very large, causing high communication

“Distributed Point functions” allow a compact encoding of sparse vectors in the SMC model�[Gilboa Ishai 14]
The Prio system considers how to ensure that each client only votes for one label in SMC�[Corrigan-Gibbs Boneh 17]

32

33 of 77

2. Frequency Oracles: LDP Histograms

Frequency oracles (FOs) collect a frequency distribution over a discrete domain of size D

Binary domain (D=2): randomized response is optimal/unique
Small domain: (generalized) randomized response over the domain
Larger domain (D > 10) apply hashing to map to a smaller domain:

Implemented using e.g., Hadamard transform as hash function

Very large domain (D > 1000s) apply dimensionality reduction (sketching) to the input before FO

Used in frequent item identification in systems at Google and Apple [Erlingsson et al 14, AppleDP 17]

33

	6
	1
	3

34 of 77

LDP histograms

Unary encoding: apply randomized response to one-hot encoding of the input

Noise is improved by asymmetric bit flip probabilities [Wang et al 17]

Direct encoding (Generalized randomized response): pick one of D labels with appropriate probability

Pick true label with probability p, others with probability (1-p)/(D-1)
ε-LDP requires (1 - p)/(D-1) ≤ exp(ε) p. Solving sets p = exp(ε)/(exp(ε) + D-1)

Local hashing: pick public hash function h_i to hash input to g ≪ D options and apply direct encoding

34

0	1	0	0	0	0

0	1	0	0	1	0

0	1	0	0	0	0

1	0

Hash value 0

Hash to 2 options

Perturb

35 of 77

3. Distributed noise generation for histograms

What if each user just adds a small amount of noise?

E.g., each of N users adds a Gaussian with low variance σ²: sums up to a Gaussian with N σ²
E.g., each user adds a Polya distribution: sums to (discrete) Laplace distribution [Bagdasaryan et al 21]

Provided enough users add noise honestly, we can achieve a centralized DP guarantee on their sum!

If not enough users add noise, the protection is weaker
Need to use secure aggregation (or trusted aggregator) to provide security for the client reports
The noise has to be discrete if using secure aggregation

Many recent results on “shuffle” privacy – can often be seen as distributed noise generation with summation

35

+ + + +

36 of 77

Distributed noise distributions

Recently, new distributions have been proposed tailored to the task of distributed noise generation

The Skellam mechanism is a discrete distribution [Agarwal et al 21]:

Easy to sample: generated via difference of two Poisson distributions
Discrete and closed under summation, as Poisson has these properties
Provides strong differential privacy guarantees under composition

The (continuous) Arete distribution is tailored to the high privacy (ε ≫ 1) setting �[Pagh Stausholm 22]

36

Skellam distributions

37 of 77

4. Differentially Private Histograms via Sampling

Sample-and-threshold [Bharadwaj Cormode 22] gives a federated approach without explicit noise

Sample: Given a set of n clients, (Bernoulli) sample each one with probability p

Collect their labels, and build the exact histogram

Threshold: replace any count that is less than τ with 0

Claim: the output of this procedure gives an (ε, δ)-DP guarantee (provided the steps are kept secret)

37

Favorite Color (FC): Red

FC: Green

Response probability: p_s

p_s

FC: Blue

p_s

FC: Black

p_s

. . .

Σ

38 of 77

Histogram applications

Histograms can used to build hierarchical histograms over an ordered domain, which in turn solve:

Heavy hitters: the items that occur most commonly in the input
Prefix queries: the count (sum) of items below a given value
Range queries: the count (sum) of items within a range of values
Quantiles: the median and percentiles of the data distribution

Accuracy guarantees follow from analyzing the sampling error (e.g., via multiplicative Chernoff bound)

38

39 of 77

Heavy Hitter Trie Search

Extend one layer of the trie per round by gathering a (private) histogram [Zhu et al 20]
Only extend prefixes that are popular enough in the histogram for the next round

39

Examples

Popular words typed
Top-k IP Prefixes in an area

Idea

Iteratively build a trie one character at a time with randomly sampled, anonymous clients voting in favor of the prefix that matches their data.

40 of 77

Further Histogram Applications

Some standard tricks allow us to reduce problems to histograms:

Range queries: ∑_j=a^b y_j�Overlay universe with a tree, histogram for each level�Answer query by adding node weights
Quantile queries: medians etc.�Can reduce to range queries
Marginal queries: Compute frequencies �of a few carefully chosen marginals
Various other (graphical) models:�Express the models as linear functions of input frequencies

40

41 of 77

Scalar mean estimation

Estimating the mean (and related quantities such as variance) is a basic question on distributed scalar data

Efforts try to reduce the communication per client to reduce information shared. Assume data is in [0, 1]

Randomized rounding: with probability x, round x to 1, else 0. Apply randomized response for LDP
Subtractive dithering: pick a random threshold t and report whether x > t [Ben-Basat et al 21]
Bit-pushing: Sample random bits from a fixed precision representation [Cormode Markov 21]
Privacy-aware compression: Design an optimal probability matrix for a b-bit alphabet [Chaudhuri et al 22]

Each approach obtains different trade off between accuracy, privacy, and communication cost

41

42 of 77

Vector Mean estimation

Federated Learning depends on computing the (weighted) mean of a collection of vectors [McMahan et al 17]

FedSGD: The server averages the local gradients from each client to update a current model
FedAvg: The server averages the local updates (multiple steps) from each client to a current model
Many subsequent variations proposed to improve the communication-accuracy-convergence tradeoff

Vector average computation is the main motivation for the design of secure aggregation

Vector mean estimation with (differential) privacy requires some extra steps:

Baseline: Independent noise addition on each coordinate
Clipping: bounding the magnitude of any coordinate reduces the level of noise needed
Distributed noise addition reduces the level of trust needed in an aggregator

42

43 of 77

Federated Algorithms Summary

Federated protocols trade-off accuracy, privacy, security, computation and communication costs

Where and what noise is added to give privacy?
What primitives are used to achieve security (secure aggregation, or more advanced SMC?)
What accuracy and computational costs can be guaranteed?

Many emerging techniques for fundamental analytic tasks:

Histograms, quantiles, heavy hitters and range queries
Means of scalars and vectors, and generalizations to variance and higher moments

43

44 of 77

Practical Considerations

44

45 of 77

Federation in practice - status

Practical federated systems are moving from prototypes to production

Google highlights the importance of federated computation in comic form: https://federated.withgoogle.com/
Meta discusses federated learning in the Papaya system [Huba et al 21] and the FLSim tool https://github.com/facebookresearch/FLSim
IBM showcases the Musketeer FL system https://musketeer.eu/

Most documented systems focused on training ML models (FL) or frequency statistics (FA)

New federated analytics systems are on the cusp of deployment: a few papers discuss practical issues

45

46 of 77

Federated Analytics systems

The first generation of federated analytics tools adopted Local Differential Privacy to gather statistics:

Google’s RAPPOR system (collecting frequency statistics with LDP guarantees) [Erlingsson et al 14]
Apple’s DP deployment (collecting frequency statistics with LDP guarantees) [Apple DP 17]
Microsoft’s telemetry monitoring (collecting app usage statistics with LDP) [Ding et al 17]

However, LDP has been criticized for offering a poor accuracy-privacy tradeoff: results are very noisy

New approaches based on distributed noise generation and secure aggregation may be preferred

46

47 of 77

Federation Challenges

Moving the federated approach from prototype to practice has many challenges:

Technical: How to perform computations in a distributed and private way?
Latency: Updates from clients may be slow or missing. Clients may also not always be available.
Robustness: How to handle malformed responses and malicious actors?
Energy: May be more energy intensive to distribute computations
Fairness: Who is most influential in these computations?
Privacy: What privacy model to adopt and parameters to use? How to budget privacy?
Data distribution: Real data is often skewed, sparse, non-uniformly distributed and non-stationary
Device heterogeneity: User devices can vary in capability and functionality
Scale: Solutions are different for 10s vs. 1000s vs. 10^6 s of users

47

48 of 77

Privacy Granularity and Budgeting

Recall that the differential privacy definition can be applied at different granularities:

Group level: noise is added to protect shared data between users
User level: noise is added to mask the entire footprint of a user in the data
Device level: noise is added to mask the impact of a single device in the data
Event level: noise is added to mask a single event in the data

Other levels of privacy (eg. app level) could be used based on preferences and applicability

If multiple/longitudinal queries are run on the same population, careful budgeting is needed to give an overall DP guarantee

How often should privacy budgets be refreshed?
Against whom are we providing privacy protections?

48

Image:

https://commons.wikimedia.org/wiki/File:Users_icon.svg

How to choose the privacy granularity:

Nature of the data:

Eg. messaging data, genetic info is shared

Trust model

If protecting against 3rd parties rather than the aggregator, then user level may be feasible
If no trusted party, then you can only commit to device level privacy at most

Sensitivity of the query:

Higher the level of privacy, the more the sensitivity, and potentially more noise needs to be added. Clipping is necessary here.

Product privacy promises

Same user’s data across multiple apps cannot be unified. App level privacy is necessary

Regulatory burden

Local laws may demand a certain level of privacy

Event level privacy is easier to achieve, but a weaker promise

For user level, clipping can be used to bound the impact of any one user

How often to refresh budgets depends on

Non-stationarity of data: if data distributions change according to a predictable cadence, then refresh by that cadence
Horizon for the query: eg. DAUs means you can refresh daily
Privacy tech used: Anonymous channels => no need to add up epsilons

49 of 77

Availability and Latency of Clients

Availability of clients meeting the required criteria can be an issue

The system may only want to collect data when a mobile device is on wifi and charging

Ensure on-device logging is lightweight and happens outside of this window as well

This can limit client availability to night hours in the region of interest

Choose eligibility criteria that are as permissive as possible

Latency and dropouts: Clients may be slow to respond or not respond at all

Mobile device may lose network connectivity, be switched off or logged out
Late responses may still be of value if they can be incorporated into the results
Can ask more clients to submit responses than needed, in anticipation of dropouts
Devise asynchronous algorithms that don’t require all clients to check in quickly

49

50 of 77

Federated Data characteristics

The data observed in practical federated scenarios can be challenging to work with:

Data imbalance: some clients have many records, others have only a few

Should “rich” clients reduce to a single record, by sampling or aggregating their records?

Non-IID local data distributions: different clients’ data may vary a lot

May need to ensure that enough clients are sampled from each category

Skewness: values are often drawn from long-tail distributions

Clipping may be necessary to avoid excessive noise from privacy

Non-stationarity: distribution of client data may itself change over time

Avoid long lived queries
Cache local data
Use asynchronous federated algorithms

50

51 of 77

Building a federated stack

A practical federated stack should allow the following:

Register a new task for clients to participate in

Provide a high-level language for task specification

Allocate tasks to clients subject to privacy budgets

Balance the distribution of clients given a task

Receive responses from clients and compute results

Ensuring that privacy guarantees are enforced before release

Manage access to the computed statistics

Share securely with downstream processes

The stack itself may need to be distributed over multiple servers!

51

52 of 77

Federation in practice

Anecdotally, the following issues are observed in practice:

Each round of federated computation is relatively fast: typically, a few minutes

Learning works with batches of few hundreds, analytics targets batches of thousands to millions
More clients reduces error, and can reduce the number of rounds needed

The algorithms used in practice are fairly robust to non-adversarial behavior

Accuracy is only modestly affected by client dropouts

Debugging is hard: the federated model does not allow data logging or replay

Testing on realistic data in simulation is needed before deployment

52

53 of 77

Advanced Topics �and Open Problems

53

54 of 77

Federated Statistics

There are many natural data analytics tasks based on statistics to address in the federated setting:

Hypothesis testing: computing the necessary test statistics

And adjusting the test in light of privacy noise or sampling bias

Distribution comparison: measuring similarity of (empirical) distributions

E.g., each client holds a value: how similar is the global distribution to last week?

Statistical modeling: building simple graphical or regression models

Can this be done more cheaply than applying (federated) gradient descent?

54

55 of 77

Federated Analytics on structured data

Initial work on analytics has focused on data that can be modeled as scalars, sets or vectors

Open problem: providing private federated analytics on data with more complex structure

E.g.: Find approximate shortest paths in graphs where each client holds some edges
E.g.: Find approximate low-rank factorizations of distributed matrices
Many results in the (non-private) distributed setting to build on

55

56 of 77

Models of Federated Privacy

Current approaches explore a few combinations of security (SMC) and privacy (DP). �Other directions to consider include:

Stronger security models that are robust to malicious servers or cheating clients
Detecting or preventing “poisoning” attacks where clients collude to bias the results
Tighter privacy bounds based on e.g., Renyi DP and advanced “privacy accountants”

56

57 of 77

Federated Computation over time

More challenges to handle data changing over time!

Semantics: What should be the result of queries on changing data?
Streaming analytics: can a long-running FA query keep the result up to date?
Longitudinal issues: How to manage privacy if one user might participate often in data collection?
Parallelism: how to assign clients and manage privacy when there are many concurrent tasks
Adaptive analytics: can the task be modified in response to the current data distribution?

57

58 of 77

Federated Tools and Systems

Currently, there are few public examples and federated general purpose tools and systems

Although, there are many federated learning projects with Python/C++ interfaces

Many research opportunities for building federated systems:

Production grade open source systems to handle federated computation
High-level query languages to allow complex tasks to be specified
User interface work to make it easy to launch new federated tasks and understand tradeoffs

58

59 of 77

Federated Learning improvements

Most machine learning is done via (stochastic) gradient descent

Model improvements by computing a gradient for each example

Observation: gradient descent is easy to distribute (in theory)

Each client computes their local gradient, and we average these
Federated SGD approach takes one step of SGD per round

There are many details and variations to make this practical!

Sample a batch of clients per round to do the update
Ask clients to do multiple update steps per round (FedAvg)
Meta-parameter settings (learning rate, momentum, etc.)
Ensure that malformed / skewed responses don’t delay convergence

59

60 of 77

Current Directions in Federated Learning

See ‘Advances and Open Problems in Federated Learning’ [Kairouz et al 21] :

Maximize performance: improve convergence-communication tradeoff
Tighter privacy bounds: give better results under DP guarantees
Distributed privacy: Distributed noise generation + Federated Learning
Robustness to poisoning attacks, membership inference, model inversion
Asynchronous FL to handle delays and different usage patterns
Model compression: smaller models, smaller updates
Better handling of heterogeneity and non-IID data distributions
Fairness: Ensuring fairness and reducing bias
Building robust systems and simulators to deploy FL: at least a dozen open source projects already!

60

61 of 77

Federated Learning beyond training

Most focus on Federated Learning (FL) has been on the core training procedure

A complete end-to-end learning solution has many additional steps:

Feature selection and feature normalization
Model calibration and evaluation
Model maintenance and checking post-deployment

These steps share the same privacy concerns as the core FL training, and need private federated solutions

E.g. build a lightweight classifier to identify which features are most important for learning
E.g. capture the ROC curve to estimate precision, recall and area under the ROC curve

These tasks can be done effectively using federated computation primitives like private histograms!

61

62 of 77

Summary

Federated Computation is an emerging model for secure, private, distributed computation

Federated Learning has received significant attention and has been widely adopted
Other forms of federated computation such as analytics are on the cusp of impact

Many opportunities to contribute to this area of computation

Using techniques from multiparty computation, (differential) privacy and distributed computing
Many open problems to design new algorithms and build robust systems for federated computing

Thank you!

62

63 of 77

References

63

64 of 77

Preliminaries References

[Zhu, Liu, Han 19] Ligeng Zhu, Zhijian Liu, Song Han Deep Leakage from Gradients. NeurIPS 2019

[Samarati Sweeney 98] Pierangela Samarati, Latanya Sweeney. Generalizing Data to Provide Anonymity when Disclosing Information. PODS 1998

[Dwork Roth 14] Cynthia Dwork, Aaron Roth. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9(3-4): 211-407 (2014)

[Roth et al 19] Edo Roth, Daniel Noble, Brett Hemenway Falk, Andreas Haeberlen. Honeycrisp: large-scale differentially private aggregation without a trusted core. SOSP 2019

64

65 of 77

Primitives References

[Beaver 91] Donald Beaver. Efficient Multiparty Protocols Using Circuit Randomization. CRYPTO 1991

[Damgård et al 06] Ivan Damgård, Matthias Fitzi, Eike Kiltz, Jesper Buus Nielsen, Tomas Toft. Unconditionally Secure Constant-Rounds Multi-party Computation for Equality, Comparison, Bits and Exponentiation. TCC 2006

[Nishide Ohta 07] Takashi Nishide, Kazuo Ohta Constant-Round Multiparty Computation for Interval Test, Equality Test, and Comparison. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 90-A(5): 960-968 (2007)

[Dwork Roth 14] Cynthia Dwork, Aaron Roth. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9(3-4): 211-407 (2014)

[Warner 65] S. L. Warner. Randomised response: a survey technique for eliminating evasive answer bias/ Journal of the American Statistical Association. 60 (309): 63–69 (1965)

65

66 of 77

Algorithms References

[Bell et al. 20] James Bell, Kallista A. Bonawitz, Adrià Gascón, Tancrède Lepoint, Mariana Raykova. Secure Single-Server Aggregation with (Poly)Logarithmic Overhead. CCS 2020

[Bagdasaryan et al 21] Eugene Bagdasaryan, Peter Kairouz, Stefan Mellem, Adrià Gascón, Kallista A. Bonawitz, Deborah Estrin, Marco Gruteser. Towards Sparse Federated Analytics: Location Heatmaps under Distributed Differential Privacy with Secure Aggregation. CoRR abs/2111.02356 (2021)

[Corrigan-Gibbs Boneh 17] Henry Corrigan-Gibbs, Dan Boneh. Prio: Private, Robust, and Scalable Computation of Aggregate Statistics. NSDI 2017

[Gilboa Ishai 14] Niv Gilboa, Yuval Ishai. Distributed Point Functions and Their Applications. EUROCRYPT 2014

66

67 of 77

Algorithms References

[Erlingsson et al 14] Úlfar Erlingsson, Vasyl Pihur, Aleksandra Korolova. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. CCS 2014

[AppleDP 17] Differential Privacy Team, Apple. Learning with privacy at scale. 2017.

[Wang et al. 17] Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha. Locally Differentially Private Protocols for Frequency Estimation. USENIX Security Symposium 2017

[Agarwal et al 21] Naman Agarwal, Peter Kairouz, Ziyu Liu. The Skellam Mechanism for Differentially Private Federated Learning. NeurIPS 2021

[Pagh Stausholm 21] Rasmus Pagh, Nina Stausholm. Infinitely Divisible Noise in the Low Privacy Regime. ArXiV

[Bharadwaj Cormode 22] Akash Bharadwaj, Graham Cormode. Sample and Threshold Differential Privacy: Histograms and applications. AISTATS 2022

67

68 of 77

Algorithms References

[Zhu et al 20] Wennan Zhu, Peter Kairouz, Brendan McMahan, Haicheng Sun, Wei Li. Federated Heavy Hitters Discovery with Differential Privacy. AISTATS 2020

[Ben Basat et al 21] Ran Ben-Basat, Michael Mitzenmacher, Shay Vargaftik. How to Send a Real Number Using a Single Bit (And Some Shared Randomness). ICALP 2021: 25:1-25:20

[Cormode Markov 21] Graham Cormode, Igor L. Markov. Bit-efficient Numerical Aggregation and Stronger Privacy for Trust in Federated Analytics. CoRR abs/2108.01521 (2021)

[Chaudhuri et al 22] Privacy-Aware Compression for Federated Data Analysis. Kamalika Chaudhuri, Chuan Guo, Mike Rabbat. CoRR abs/2203.08134 (2022)

[McMahan et al 17] McMahan, H. Brendan, Eider Moore, Daniel Ramage, and Seth Hampson. Communication-efficient learning of deep networks from decentralized data. AISTATS, 2017

68

69 of 77

Federation in Practice and Open Problems references

[Huba et al 21] Dzmitry Huba, John Nguyen, Kshitiz Malik, Ruiyu Zhu, Mike Rabbat, Ashkan Yousefpour, Carole-Jean Wu, Hongyuan Zhan, Pavel Ustinov, Harish Srinivas, Kaikai Wang, Anthony Shoumikhin, Jesik Min, Mani Malek. Papaya: Practical, Private, and Scalable Federated Learning. MLSys 2022.

[Ding et al 17] Bolin Ding, Janardhan Kulkarni, Sergey Yekhanin. Collecting Telemetry Data Privately. NIPS 2017: 3571-3580

[Kairouz et al 21] Peter Kairouz et al. Advances and Open Problems in Federated Learning. Found. Trends Mach. Learn. 14(1-2): 1-210 (2021)

69

70 of 77

Additional Slides

70

71 of 77

SMC multiplication

Multiplication of two private values [x], [y] can be achieved if we have some helper values [Beaver 91]

Beaver triples are arbitrary (randomly chosen) triples (a, b, c) such that (ab = c)
Parties receive [a], [b], [c] and compute [d] = [x] - [a] and [e] = [y] - [b]
Decrypt d and e: no leakage, since a and b are randomly chosen
Can obtain shares of the product via [x y] - ed = [c] + e[a] + d[b]

71

72 of 77

SMC logic

We can also support other primitives in the multiparty computation setting:

Inequality: less-than, greater-than, not-equal-to
Comparators: min, max

A generic approach: convert an arithmetic sharing of x to a sharing of the binary representation of x

Then inequalities can be reduced to a subtraction and inspecting the “sign bit” [Damgård et al 06]

Other approaches use various arithmetic tricks [Nishide Ohta 07]

A core trade-off: more specific protocols can cost less than more general-purpose approaches

Some protocols can require many many rounds of interaction, slowing them down
Computational cost can scale quadratically with the number of participating parties

72

73 of 77

Models of Differential Privacy

The definition of differential privacy can be applied in various places:

Central DP has a central trusted entity gather the data and add the noise to give DP on their output
Local DP has each client add enough noise to mask their own contribution (less trust, more noise)
Distributed DP: each client adds a small amount of noise which sums to give a central DP guarantee

Differential privacy can be applied at different granularities:

User-level privacy: DP is calibrated so that the entire contribution of a single user is masked
Device-level privacy: DP is calibrated so that the data from a single device is masked
Event-level privacy: DP is calibrated so that one event/action is masked by the noise added

User-level privacy can be harder to attain as some users have a large data footprint, can have multiple devices

73

74 of 77

Randomized Response

Each user has a single private bit, encoding e.g. political/sexual/religious preference, illness, etc.

The client tosses a (biased) coin and with probability p > ½, reports the true answer, else it lies
Collect responses from many users (N) and obtain an unbiased estimate of the population fraction
The error in the estimate is proportional to 1/√N
Gives (local) differential privacy with parameter ε = ln (p/(1-p))

Many more complicated algorithms can be built, e.g., to learn popular items, train ML models

BUT the noise-accuracy tradeoff for LDP is poor: we end up with N times more noise than is needed!

Recent research has focused more on different ways to achieve distributed privacy

74

+

75 of 77

Sample-and-threshold privacy overview

The intuition is that sampling introduces uncertainty over the true counts of labels

Applying the threshold ensures that small counts can’t be distinguished

For instance, suppose p = 0.2, τ = 20.

Then counts much smaller than 100 are unlikely survive the process
Cannot distinguish between, e.g., 20 or 21 copies of an label

Formalizing this requires giving appropriate tail bounds on the Binomial distribution

75

76 of 77

Sample-and-Threshold Implementation Details

How to perform the thresholding step?

Trusted entity (e.g., secure enclave) performs the steps, similar to instantiating the shuffle model

Arguably simpler/less trust than explicit noise addition

Multiple (two or more) servers obtain shares of the counts, use SMC protocols to do thresholding

Opportunity to use distributed point functions to make client communications compact

Communication could be small: client just needs to share the label (log D), not a sparse histogram (D)

Asymptotically, sample-and-threshold is weaker than distributed noise (sampling noise ≫ privacy noise)

The main benefits come when some sampling is already built in to the data collection

76

77 of 77

Case Study: Location heatmaps

Task: generate a geospatial heatmap of private locations�[Bagdasaryan et al 21]

Privacy: distributed noise (Polya dbn) and secure aggregation

Basic primitive: distributed differentially private histograms

Data Sparsity: use adaptive hierarchical histograms

Progressively refine coarse regions with sufficient activity
Conceptually similar to the trie heavy hitters approach

Accuracy: method achieves similar MSE to non-adaptive algs, but much reduced communication

Measure error between data at original resolution and the coarser representation

77