1 of 27

Privacy Accounting and Quality Control in the Sage Differentially Private ML Platform

11/01

1

With presentation slides adapted from this

2 of 27

2

3 of 27

3

4 of 27

4

5 of 27

Protect Models: Differential Privacy

  • Approach: make training algorithms DP
  • Side note: does not affect inference. Trained models can be executed multiple times�

5

6 of 27

Protect Models: Differential Privacy

  • Approach: make training algorithms DP
  • Side note: does not affect inference. Trained models can be executed multiple times�
  • Existing DP training algorithms for ML:
    • SGD
    • Regressions
    • Collaborative filtering
    • Feature and model selection
    • Collection of implementations (tensorflow-privacy)

6

7 of 27

Protect Models: Differential Privacy

  • Approach: make training algorithms DP
  • Side note: does not affect inference. Trained models can be executed multiple times�
  • Existing DP training algorithms for ML:
    • SGD
    • Regressions
    • Collaborative filtering
    • Feature and model selection
    • Collection of implementations (tensorflow-privacy)
  • Static database

7

8 of 27

Practical challenges for growing database:��- Running out of budgets?�- Balancing privacy-utility?

8

9 of 27

Sage

Key Ideas

  • Block Composition
    • Break history data into “blocks”
    • Apply budget individually
  • Privacy-Adaptive Training
    • Serve DP query with minimal budgets
    • If query fail because budget too tight, try again with 2x budget limit

9

10 of 27

Sage Overview

Enforces global privacy budgets

Access control: assign blocks to training iterations

Training:

  • Integrate DP algorithms
  • Possible fails (RETRY)

10

11 of 27

Key Ideas

  • Block Composition
    • Break history data into “blocks”
    • Apply budget individually
  • Privacy-Adaptive Training
    • Serve DP query with minimal budgets
    • If query fail because budget too tight, try again with 2x budget limit

11

12 of 27

Block Composition

  • Split database into time-based blocks
  • Combine blocks into larger datasets for model training
  • Account for privacy loss for each block

12

13 of 27

Block Composition

  • Split database into time-based blocks
  • Combine blocks into larger datasets for model training
  • Account for privacy loss for each block

13

Model 1

Model 2

14 of 27

Block Composition

  • Split database into time-based blocks
  • Combine blocks into larger datasets for model training
  • Account for privacy loss for each block

14

Model 1

Model 2

15 of 27

Block Composition

  • Split database into time-based blocks
  • Combine blocks into larger datasets for model training
  • Account for privacy loss for each block

15

Model 1

Model 2

Model 3

16 of 27

Block Composition

  • Split database into time-based blocks
  • Combine blocks into larger datasets for model training
  • Account for privacy loss for each block

16

Model 1

Model 2

Model 3

X

X

17 of 27

Block Composition

Cap on max global privacy loss

| PrivacyLoss(stream) | ⩽ maxk | PrivacyLoss(Dk) |

17

Model 1

Model 2

Model 3

X

X

18 of 27

Block Composition

Cap on max global privacy loss

| PrivacyLoss(stream) | ⩽ maxk | PrivacyLoss(Dk) |

New blocks generated with zero privacy loss

18

Model 1

Model 2

Model 3

X

X

19 of 27

Key Ideas

  • Block Composition
    • Break history data into “blocks”
    • Apply budget individually
  • Privacy-Adaptive Training
    • Serve DP query with minimal budgets
    • If query fail because budget too tight, try again with 2x budget limit

19

20 of 27

Iterative Training

  • Conserve privacy budget for more DP query
    • Queries: ML training rounds
  • Low-budget models perform terribly
    • Privacy-utility trade-off�
  • Insight: using more data (with low-budget) help training accuracy

20

21 of 27

Iterative Training

  • Data selection
    • Retire blocks with no budget left
    • Assemble dataset with available blocks
  • Conserve budget
    • Start with small budget (ε0,𝛅0)
    • If fails, retry with 2x limits and/or collect more data blocks�
    • Final budget <= 2x best possible budget
    • Total budget usage <= 4x final budget

21

Sage Access Control

22 of 27

Iterative Training - Validation

  • Success metrics
    • Trained model accuracy meets certain threshold
  • SLAed DP validation
    • Statistical tests accounted for DP randomness
    • Assures output model can server high quality prediction “with high probability”
    • Validators for Loss, Accuracy, Sum

22

Sage Access Control

23 of 27

Evaluation

  1. Benefits of block composition
  2. Importance of iterative training and DP validation
  3. Continuous operation of training new models on growing database

23

24 of 27

Benefits of block composition

  • Traditional DP
    • Split queries into sub-query per block (fig. 7b)

24

25 of 27

Iterative training and DP-aware validation

  • UC DP
    • DP SLAed validation without correction for DP impact (fig.6b, table 2)

25

Non DP

UC DP

Sage

0.2%

1.7%

0.3%

Failure rate at 1% prob. (η=0.01)

26 of 27

Continuous operation of ML pipeline

  • Block size: 1-hr of data

26

27 of 27

Discussion Questions

  • What’s the main challenge stopping Saga to support user-level privacy?
  • What approaches can be taken at the feature level to prevent user level data leakage, as noted in the paper, when feature columns are correlated (for an individual user’s submitted data).�
  • Can we reuse the abandoned block by shuffling and reorganizing them into new data blocks?
  • How do correlated events being placed in different blocks affect the privacy budget of the (now-correlated) blocks?
  • To train the ML models, it seems that Sage would pick the blocks mostly indiscriminately so long as it passes evaluation. Is there a good way to extend Sage so that it can work with algorithms that require data from specific parts of a database (instead of just any blocks randomly)?

27