1 of 25

Privacy of Noisy SGD

Ajinkya K Mulay

More iterations without more privacy loss

2 of 25

Paper Metadata

Authors: Jason Altschuler, Kunal Talwar

Venue: Neurips 2022 (Oral Presentation)

Paper Link: https://arxiv.org/abs/2205.13710

OpenReview Link: https://openreview.net/forum?id=pDUYkwrx__w

Neurips Presentation: https://nips.cc/virtual/2022/poster/54462

3 of 25

Motivation

  • Noisy-SGD
    • Gold standard for private training
    • Privacy cost scales with number of iterations
    • More epochs preferred for better performance�
  • Open Questions
    • What is the privacy loss of Noisy-SGD as a function of iterations?
    • Does the privacy loss of Noisy-SGD increase ad-infinitum in iterations?

4 of 25

tl;dr

  • Privacy cost for Noisy-SGD does not increase ad-infinitum*
    • Privacy cost scales linearly for initial burn-in period
    • Only release final iterate
    • Bounded parameter domain

*for bounded and convex domain and convex, smooth and lipschitz loss functions

5 of 25

Privacy in ML models

We focus on GPT-2 and find that at least 0.1% of its text generations (a very conservative estimate) contain long verbatim strings that are “copy-pasted” from a document in its training set.

6 of 25

Privacy in ML models

We prompt GPT-3 with the beginning of chapter 3 of Harry Potter and the Philosopher’s Stone. The model correctly reproduces about one full page of the book (about 240 words) before making its first mistake.

7 of 25

Background: Differential Privacy

If the inclusion/removal of a data point from the training dataset does not change the output of an algorithm too much, we call this algorithm differentially private.

Data

Data

Joe’s Data

A(X)

Output

Adversary

8 of 25

Background: (ε, δ)-Differential Privacy

If the inclusion/removal of a data point from the training dataset does not change the output of an algorithm too much alter the algorithm’s output probability too much, we call this algorithm differentially private.

9 of 25

Background: (ε, δ)-Differential Privacy

Bounds the change in algorithm’s output probability

  • Multiplicative factor or privacy cost - ε
    • for small ε - exp(ε) ≅ 1 + ε
    • choose ε ≅ 1
  • δ is the breach probability
    • δ should be much smaller than 1/n

10 of 25

Differential Privacy Implications

  • Differential privacy protects against
    • Membership inference
    • Linkage attacks
    • Reconstructing datasets�
  • Differential privacy can
    • Infer patterns / statistics - outcome of the algorithm is similar irrespective of who contributes data
    • Erase individual identities

11 of 25

Gaussian Noise and Differential Privacy

  • Add zero-mean Gaussian noise with variance proportional to the algorithm’s sensitivity to the algorithm’s output.
    • Sensitivity - maximum change possible in the algorithm’s output due to any single datapoint�
  • Notation
    • f: non-private algorithm
    • 𝛥2 f: sensitivity of f
    • A: private algorithm

12 of 25

Background: Noisy SGD

Notation

  • 𝜔: parameter estimates
  • Κ: convex, bounded parameter set
  • 𝜂: learning rate
  • Gt: mini-batch gradient
  • Z: Gaussian noise

Each iteration requires ε privacy cost

13 of 25

Why do we care about the privacy cost?

exp(Tε) can go to infinity as T increases and Tε ≫ 1.

Privacy stacks up!

14 of 25

Contraction with small gradient steps

Under small step sizes, the simple gradient update leads to contraction of the distance between parameters

15 of 25

Intuition: Private SGD Training Dynamics

  1. Adjacent datasets, same parameter initialization
    1. Bounded parameters - parameters differ by at most D

16 of 25

Intuition: Private SGD Training Dynamics

  • Adjacent datasets, same parameter initialization
    • Bounded parameters - parameters differ by at most D
  • Adjacent datasets, different parameter initialization (Increases Privacy Loss)
    • Noisy SGD with batch sampling [Privacy Amplification By Sampling]
    • Gradients are indistinguishable under privacy
    • Sampling from the same dataset?�

17 of 25

Intuition: Private SGD Training Dynamics

  • Adjacent datasets, same parameter initialization
    • Bounded parameters - parameters differ by at most D
  • Adjacent datasets, different parameter initialization (Increases Privacy Loss)
    • Noisy SGD with batch sampling [Privacy Amplification By Sampling]
    • Gradients are indistinguishable under privacy
    • Sampling from the same dataset?�
  • Same dataset, different parameter initialization (Decreases Privacy Loss)
    • Parameters contract with SGD [Privacy Amplification By Iterations]
    • Parameters indistinguishable under privacy

Balance

Losses!

18 of 25

Main Result 1: Privacy Upper Bound

Assuming D/L𝜂 is a small multiplicative factor, we have essentially constant privacy after a few epochs!

19 of 25

Main Result 1: Theory

  • Parameter set is bounded
    • Parameters cannot be too far apart
  • We sample each batch with probability b/n
    • We’ve access to Privacy Amplification By Sampling
  • Can we rewrite Noisy SGD into a format that allows contractions?
    • Parameters will be pulled closer with more iterations
    • Get access to Privacy Amplification by Iterations

20 of 25

Main Result 1: Can we split up privacy analysis?

  • T-τ iterations (burn-in period)
    • Least iterations required for output to be maximally distinguishable
    • Sampling v/s contractions
  • τ iterations - less or equally distinguishable compared to the T-τ model

21 of 25

Main Result 2: Tight Privacy Bound

  • Construct counter-example with linear loss functions
  • Unknown parameter distributions
    • Prove result for auxiliary distributions that are stochastically worse

22 of 25

Limitations

  • Requires Euclidean Projection & constrained parameter set
    • harder to analyze
  • Cannot release intermediate models
  • Requires convexity
    • not realistic for complex DL problems
  • Lacking privacy-utility bounds and empirical results

23 of 25

Takeaways

  • Privacy Amplification by Boundedness
    • Better privacy analysis under convexity and smoothness
  • Useful for the multi-epoch SGD �

24 of 25

Future Work

  • Can we create contractions without the need for convexity?
  • What other techniques can we leverage that provide similar non-asymptotic bounds?
  • How could we deploy such a system for Federated Learning? What level of trust should the server have?
  • Could we use such a system for hyperparameter tuning under privacy?
  • Extensions based on adaptive step-sizes

25 of 25

QnA

Thank You