1 of 27

Privacy Accounting and Quality Control in the Sage Differentially Private ML Platform

11/01

1

With presentation slides adapted from this

2 of 27

2

3 of 27

3

4 of 27

4

5 of 27

Protect Models: Differential Privacy

Approach: make training algorithms DP
Side note: does not affect inference. Trained models can be executed multiple times�

5

6 of 27

Protect Models: Differential Privacy

Approach: make training algorithms DP
Side note: does not affect inference. Trained models can be executed multiple times�
Existing DP training algorithms for ML:

SGD
Regressions
Collaborative filtering
Feature and model selection
Collection of implementations (tensorflow-privacy)

6

7 of 27

Protect Models: Differential Privacy

Approach: make training algorithms DP
Side note: does not affect inference. Trained models can be executed multiple times�
Existing DP training algorithms for ML:

SGD
Regressions
Collaborative filtering
Feature and model selection
Collection of implementations (tensorflow-privacy)

Static database

7

8 of 27

Practical challenges for growing database:��- Running out of budgets?�- Balancing privacy-utility?

8

9 of 27

Sage

Key Ideas

Block Composition

Break history data into “blocks”
Apply budget individually

Privacy-Adaptive Training

Serve DP query with minimal budgets
If query fail because budget too tight, try again with 2x budget limit

9

10 of 27

Sage Overview

Enforces global privacy budgets

Access control: assign blocks to training iterations

Training:

Integrate DP algorithms
Possible fails (RETRY)

10

11 of 27

Key Ideas

Block Composition

Break history data into “blocks”
Apply budget individually

Privacy-Adaptive Training

Serve DP query with minimal budgets
If query fail because budget too tight, try again with 2x budget limit

11

12 of 27

Block Composition

Split database into time-based blocks
Combine blocks into larger datasets for model training
Account for privacy loss for each block

12

13 of 27

Block Composition

Split database into time-based blocks
Combine blocks into larger datasets for model training
Account for privacy loss for each block

13

Model 1

Model 2

14 of 27

Block Composition

Split database into time-based blocks
Combine blocks into larger datasets for model training
Account for privacy loss for each block

14

Model 1

Model 2

15 of 27

Block Composition

Split database into time-based blocks
Combine blocks into larger datasets for model training
Account for privacy loss for each block

15

Model 1

Model 2

Model 3

16 of 27

Block Composition

Split database into time-based blocks
Combine blocks into larger datasets for model training
Account for privacy loss for each block

16

Model 1

Model 2

Model 3

X

17 of 27

Block Composition

Cap on max global privacy loss

| PrivacyLoss(stream) | ⩽ max_k | PrivacyLoss(D_k) |

17

Model 1

Model 2

Model 3

X

18 of 27

Block Composition

Cap on max global privacy loss

| PrivacyLoss(stream) | ⩽ max_k | PrivacyLoss(D_k) |

New blocks generated with zero privacy loss

18

Model 1

Model 2

Model 3

X

19 of 27

Key Ideas

Block Composition

Break history data into “blocks”
Apply budget individually

Privacy-Adaptive Training

Serve DP query with minimal budgets
If query fail because budget too tight, try again with 2x budget limit

19

20 of 27

Iterative Training

Conserve privacy budget for more DP query

Queries: ML training rounds

Low-budget models perform terribly

Privacy-utility trade-off�

Insight: using more data (with low-budget) help training accuracy

20

21 of 27

Iterative Training

Data selection

Retire blocks with no budget left
Assemble dataset with available blocks

Conserve budget

Start with small budget (ε₀,𝛅₀)
If fails, retry with 2x limits and/or collect more data blocks�
Final budget <= 2x best possible budget
Total budget usage <= 4x final budget

21

Sage Access Control

22 of 27

Iterative Training - Validation

Success metrics

Trained model accuracy meets certain threshold

SLAed DP validation

Statistical tests accounted for DP randomness
Assures output model can server high quality prediction “with high probability”
Validators for Loss, Accuracy, Sum

22

Sage Access Control

23 of 27

Evaluation

Benefits of block composition
Importance of iterative training and DP validation
Continuous operation of training new models on growing database

23

24 of 27

Benefits of block composition

Traditional DP

Split queries into sub-query per block (fig. 7b)

24

25 of 27

Iterative training and DP-aware validation

UC DP

DP SLAed validation without correction for DP impact (fig.6b, table 2)

25

Non DP	UC DP	Sage
0.2%	1.7%	0.3%

Failure rate at 1% prob. (η=0.01)

26 of 27

Continuous operation of ML pipeline

Block size: 1-hr of data

26

27 of 27

Discussion Questions

What’s the main challenge stopping Saga to support user-level privacy?
What approaches can be taken at the feature level to prevent user level data leakage, as noted in the paper, when feature columns are correlated (for an individual user’s submitted data).�
Can we reuse the abandoned block by shuffling and reorganizing them into new data blocks?
How do correlated events being placed in different blocks affect the privacy budget of the (now-correlated) blocks?
To train the ML models, it seems that Sage would pick the blocks mostly indiscriminately so long as it passes evaluation. Is there a good way to extend Sage so that it can work with algorithms that require data from specific parts of a database (instead of just any blocks randomly)?

27