1 of 25

Privacy of Noisy SGD

Ajinkya K Mulay

More iterations without more privacy loss

2 of 25

Paper Metadata

Authors: Jason Altschuler, Kunal Talwar

Venue: Neurips 2022 (Oral Presentation)

Paper Link: https://arxiv.org/abs/2205.13710

OpenReview Link: https://openreview.net/forum?id=pDUYkwrx__w

Neurips Presentation: https://nips.cc/virtual/2022/poster/54462

3 of 25

Motivation

Noisy-SGD

Gold standard for private training
Privacy cost scales with number of iterations
More epochs preferred for better performance�

Open Questions

What is the privacy loss of Noisy-SGD as a function of iterations?
Does the privacy loss of Noisy-SGD increase ad-infinitum in iterations?

4 of 25

tl;dr

Privacy cost for Noisy-SGD does not increase ad-infinitum*

Privacy cost scales linearly for initial burn-in period
Only release final iterate
Bounded parameter domain

*for bounded and convex domain and convex, smooth and lipschitz loss functions

5 of 25

Privacy in ML models

We focus on GPT-2 and find that at least 0.1% of its text generations (a very conservative estimate) contain long verbatim strings that are “copy-pasted” from a document in its training set.

SaTML Differential Privacy Tutorial

Blog on attacking LLMs to extract sensitive data

6 of 25

Privacy in ML models

We prompt GPT-3 with the beginning of chapter 3 of Harry Potter and the Philosopher’s Stone. The model correctly reproduces about one full page of the book (about 240 words) before making its first mistake.

SaTML Differential Privacy Tutorial

Blog on attacking LLMs to extract sensitive data

7 of 25

Background: Differential Privacy

If the inclusion/removal of a data point from the training dataset does not change the output of an algorithm too much, we call this algorithm differentially private.

SaTML Differential Privacy Tutorial

Data

Joe’s Data

A(X)

Output

Adversary

8 of 25

Background: (ε, δ)-Differential Privacy

If the inclusion/removal of a data point from the training dataset does not change the output of an algorithm too much alter the algorithm’s output probability too much, we call this algorithm differentially private.

9 of 25

Background: (ε, δ)-Differential Privacy

Bounds the change in algorithm’s output probability

Multiplicative factor or privacy cost - ε

for small ε - exp(ε) ≅ 1 + ε
choose ε ≅ 1

δ is the breach probability

δ should be much smaller than 1/n

10 of 25

Differential Privacy Implications

Differential privacy protects against

Membership inference
Linkage attacks
Reconstructing datasets�

Differential privacy can

Infer patterns / statistics - outcome of the algorithm is similar irrespective of who contributes data
Erase individual identities

SaTML Differential Privacy Tutorial

11 of 25

Gaussian Noise and Differential Privacy

Add zero-mean Gaussian noise with variance proportional to the algorithm’s sensitivity to the algorithm’s output.

Sensitivity - maximum change possible in the algorithm’s output due to any single datapoint�

Notation

f: non-private algorithm
𝛥₂f: sensitivity of f
A: private algorithm

12 of 25

Background: Noisy SGD

Notation

𝜔: parameter estimates
Κ: convex, bounded parameter set
𝜂: learning rate
G_t: mini-batch gradient
Z: Gaussian noise

Each iteration requires ε privacy cost

13 of 25

Why do we care about the privacy cost?

exp(Tε) can go to infinity as T increases and Tε ≫ 1.

Privacy stacks up!

14 of 25

Contraction with small gradient steps

Under small step sizes, the simple gradient update leads to contraction of the distance between parameters

15 of 25

Intuition: Private SGD Training Dynamics

Adjacent datasets, same parameter initialization

Bounded parameters - parameters differ by at most D�

16 of 25

Intuition: Private SGD Training Dynamics

Adjacent datasets, same parameter initialization

Bounded parameters - parameters differ by at most D�

Adjacent datasets, different parameter initialization (Increases Privacy Loss)

Noisy SGD with batch sampling [Privacy Amplification By Sampling]
Gradients are indistinguishable under privacy
Sampling from the same dataset?�

17 of 25

Intuition: Private SGD Training Dynamics

Adjacent datasets, same parameter initialization

Bounded parameters - parameters differ by at most D�

Adjacent datasets, different parameter initialization (Increases Privacy Loss)

Noisy SGD with batch sampling [Privacy Amplification By Sampling]
Gradients are indistinguishable under privacy
Sampling from the same dataset?�

Same dataset, different parameter initialization (Decreases Privacy Loss)

Parameters contract with SGD [Privacy Amplification By Iterations]
Parameters indistinguishable under privacy

Balance

Losses!

18 of 25

Main Result 1: Privacy Upper Bound

Assuming D/L𝜂 is a small multiplicative factor, we have essentially constant privacy after a few epochs!

19 of 25

Main Result 1: Theory

Parameter set is bounded

Parameters cannot be too far apart

We sample each batch with probability b/n

We’ve access to Privacy Amplification By Sampling

Can we rewrite Noisy SGD into a format that allows contractions?

Parameters will be pulled closer with more iterations
Get access to Privacy Amplification by Iterations

20 of 25

Main Result 1: Can we split up privacy analysis?

T-τ iterations (burn-in period)

Least iterations required for output to be maximally distinguishable
Sampling v/s contractions

τ iterations - less or equally distinguishable compared to the T-τ model

21 of 25

Main Result 2: Tight Privacy Bound

Construct counter-example with linear loss functions
Unknown parameter distributions

Prove result for auxiliary distributions that are stochastically worse

22 of 25

Limitations

Requires Euclidean Projection & constrained parameter set

harder to analyze

Cannot release intermediate models
Requires convexity

not realistic for complex DL problems

Lacking privacy-utility bounds and empirical results

23 of 25

Takeaways

Privacy Amplification by Boundedness

Better privacy analysis under convexity and smoothness

Useful for the multi-epoch SGD �

24 of 25

Future Work

Can we create contractions without the need for convexity?
What other techniques can we leverage that provide similar non-asymptotic bounds?
How could we deploy such a system for Federated Learning? What level of trust should the server have?
Could we use such a system for hyperparameter tuning under privacy?
Extensions based on adaptive step-sizes

1 of 25

2 of 25

3 of 25

4 of 25

5 of 25

6 of 25

7 of 25

8 of 25

9 of 25

10 of 25

11 of 25

12 of 25

13 of 25

14 of 25

15 of 25

16 of 25

17 of 25

18 of 25

19 of 25

20 of 25

21 of 25

22 of 25

23 of 25

24 of 25

25 of 25