1 of 18

Scalable Privacy-Preserving Distributed Learning

D. Froelicher, J. R. Troncoso-pastoriza, A. Pyrgelis, S. Sav, J. Sa Sousa, J.-P. Bossuat, J.-P. Hubaux�

Laboratory for Data Security (LDS)

1

2 of 18

Distributed Learning - Motivation

The training of accurate and generalisable machine learning models requires large and diverse datasets

→ Multiple entities collaborate to train a machine learning model on their joint data

2

3 of 18

Distributed Learning - Problem

Sensitive/personal data are difficult to share

  • Stringent regulations, e.g., GDPR.
  • Complex/costly data-access agreements
  • High repercussions in case of data leakage
  • Competition among stakeholders

Sensitive data are often siloed� → Each entity trains independently less diversity→ smaller training dataset less accurate

3

4 of 18

Distributed Learning - Current Approaches

4

Raw data

(a) Fully centralized

Trusted

3rd party

Examples:

- All of Us

- EGA

- Genomics England

(b) Meta-analysis

Example:

- https://covidclinical.net/�- sPLINK

Aggregated data

Trusted

3rd party

(c) Decentralized

(d) Differential Privacy Decentralized

= Partial Results Obfuscation

Examples:�- M. Kim et al. "Secure and Differentially Private Logistic Regression for Horizontally Distributed Data," TIFS 2019

- M. Abadi et al. Deep learning with differential privacy. In ACM CCS, 2016.

- Chaudhuri and C. Monteleoni. Privacy-preserving logistic regression. In NIPS, 2009.

(e) Cryptographic (SMC, HE) Decentralized

Examples:�- A. Gascón et al.. Privacy-preserving distributed linear regression on high-dimensional data. PETS, 2017.

- P. Mohassel and Y. Zhang. SecureML: A system for scalable privacy-preserving machine learning. In IEEE S&P, 2017.

Aggregated data

Examples:

- http://www.datashield.ac.uk�- Personalized Health Train (PHT)

- vantage6

Secret shared/encrypted data

Data Leakage

Data Leakage

Data Leakage

Introduce Bias

Limited #parties

SPINDLE (Multiparty Homomorphic Encryption (MHE))

Data + Model Confidentiality� as long as 1 entity is honest

→ No data outsourcing

→ Scale with #parties

→ Exact results

5 of 18

Building Blocks

Generic Secure Federated Learning with �Data Confidentiality + Model Confidentiality

5

SPINDLE

Multiparty Homomorphic Encryption (MHE)

Cooperative Gradient Descent

Extended MapReduce Abstraction

Generalized Linear Models:�- linear and logistic regressions

- Multinomial regression

6 of 18

MapReduce for Distributed Learning

6

DP1

DP2

DP3

DP4

DP5

DP6

DP7

PREPARE

PREPARE

PREPARE

PREPARE

PREPARE

PREPARE

PREPARE

MapReduce Distributed Learning

PREPARE:�DPs agree on secu/learning params.

Each DPi pre-process its data

DPs agree on initial global model

DP = Data Provider

7 of 18

MapReduce for Distributed Learning

7

MapReduce Distributed Learning

PREPARE

DP1

DP2

DP3

DP4

DP5

DP6

DP7

MAP → W1

MAP → W2

MAP → W7

MAP → W6

MAP → W5

MAP → W4

MAP → W3

Iterate

MAP:

Each DPi locally trains on its data and uses the global model WG to update its local model Wi

8 of 18

MapReduce for Distributed Learning

8

DP1

DP2

DP3

DP4

DP5

DP6

DP7

Iterate

MAP:�Each DPi locally train on its data and uses the global model WG to update its local model Wi

COMBINE → W4 + W3+ W5

COMBINE → W2 + W6+ W7

COMBINE → ∑ Wi

COMBINE:

All Wi are combined

MapReduce Distributed Learning

PREPARE

9 of 18

MapReduce for Distributed Learning

9

DP1

DP2

DP3

DP4

DP5

DP6

DP7

Iterate

MAP:

Each DPi locally train on its data and uses the global model WG to update its local model Wi

COMBINE:

All Wi are combined

REDUCE:�DP1 updates the global model WG

REDUCE → WG = f(Wi, WG(iter-1))

MapReduce Distributed Learning

PREPARE

10 of 18

MapReduce for Distributed Learning

10

DP1

DP2

DP3

DP4

DP5

DP6

DP7

Iterate

MAP:

Each DPi locally train on its data and uses the global model WG to update its local model Wi

COMBINE:

All Wi are combined

REDUCE:�DP1 updates the global model WG

REDUCE → WG = f(Wi, WG(iter-1))

PREDICTION:

DPi uses WG to predict on new data

MapReduce Distributed Learning

PREPARE

11 of 18

MapReduce for Secure Distributed Learning

11

DP1

DP2

DP3

DP4

DP5

DP6

DP7

Iterate

MAP:

Each DPi locally train on its data and uses the global model WG to update its local model Wi

COMBINE:

All Wi are combined

REDUCE:�DP1 updates the global model WG

PREDICTION:

DPi uses WG to predict on new data

= f( )

= f( )

Secret Key

Public Key

Collective Public Key

MapReduce Secure Distributed Learning

PREPARE

12 of 18

Multiparty Homomorphic Encryption (MHE)

12

Public collective key

[1] J. H. Cheon, A. Kim, M. Kim, and Y. Song. Homomorphic encryption for arithmetic of approximate numbers. In ASIACRYPT, 2017.

[2] C. Mouchet, J. R. Troncoso-pastoriza, J.-P. Bossuat, and J. P. Hubaux. Multiparty homomorphic encryption: From theory to practice. In PETS’21 on Thursday 15th of July in Session 7B.

  • Each DP can independently:
    • Encrypt vector of values in one ciphertext with
    • Compute (add, mult, rotations, …) on ciphertexts encrypted with
  • All DPs collaborate to:
    • Decrypt ciphertexts encrypted with
    • Replace costly cryptographic operations by lightweight interactive protocols

We rely on the adaptation to CKKS[1] of the multiparty scheme proposed by Mouchet et al. [2]

DPs’ public keys (corresponding secret keys are never revealed)

= f( )

Ciphertext: E (v1, …., vN)

13 of 18

Extended MapReduce for Secure Distributed Learning

13

Iterate

MAP:

Each DPi locally train on its data and uses the global model WG to update its local model Wi

COMBINE:

All Wi are combined

REDUCE:�DP1 updates the global model WG

PREDICTION:

DPi uses WG to predict on new data

= f( )

MapReduce Secure Distributed Learning

PREPARE

Stochastic Gradient Descent:

features

Samples’ batch

x

Activation

x

x

f(features, samples’ batch)

x

x

Row-based Approach

Diagonal Approach

Activation

:

:

Encrypted

Cleartext

Least Square Approx.

Input Dimension

Encrypted

Cleartext

14 of 18

Extended MapReduce for Secure Distributed Learning

14

=

+

+

Iterate

MAP:

Each DPi locally train on its data and uses the global model WG to update its local model Wi

COMBINE:

All Wi are combined

REDUCE:�DP1 updates the global model WG

PREDICTION:

DPi uses WG to predict on new data

= f( )

MapReduce Secure Distributed Learning

PREPARE

Parameterization

Input dimensions

Cryptographic/Security Parameters

Approximation Parameters

Learning Parameters

15 of 18

SPINDLE Evaluation: Accuracy

SPINDLE achieves accuracy close to centralized solution and (almost) same accuracy as non-secure distributed solutions

15

Logistic regression

L.R. one-Vs-all

Multinomial Regression

Evaluation Parameters

10 Data providers

128-bit security level

Legend

Dataset: Name [#samples x #features]

(1) Pima = Pima Indians Diabetes https://www.kaggle.com/uciml/pima-indians-diabetes-database

(2) BCW = Breast cancer Wisconsin (original)

https://archive.ics.uci.edu/ml/datasets/ breast+cancer+wisconsin+(original)

(3,4) MNIST

Y. LeCun and C. Cortes. Handwritten digit database. 2010.

16 of 18

SPINDLE Evaluation: Performance

16

5 data providers, 25600 record, 128-bit security; Each data provider: 2 Intel Xeon E5-2680 v3 CPUs, 2.5GHz frequency, 24 threads on 12 cores, 256GB RAM. Communication: 100Mbps, delay 20ms

Better than logarithmic increase with the number of features

17 of 18

SPINDLE Evaluation: Performance

17

128-bit security level; Default number of features = 32, |S|= # data providers; n = global dataset size, b is the batch size used in the stochastic gradient descent, One data provider: 2 Intel Xeon E5-2680 v3 CPUs, 2.5GHz frequency, 24 threads on 12 cores, 256GB RAM. Communication: 100Mbps, delay 20ms

Scales almost independently with �the number of data providers |S|

Scales linearly with the size of �the data providers datasets n

Efficient workload distribution

18 of 18

Conclusion

18

  • Generic framework to perform a cooperative gradient descent
    • Model and data confidentiality over the entire ML workflow
    • Data stays at the data providers’ premises
    • Similar accuracy as non-secure distributed solutions
    • Same tradeoff as non-secure solutions: performance vs. accuracy and not between privacy and accuracy
    • Scales with the number of DPs, with the DPs’ dataset sizes and with the number of features

Future Work: build on our widely-applicable solution to�

  • Securely train more complex ML models�Sinem Sav, Apostolos Pyrgelis, Juan R. Troncoso-Pastoriza, David Froelicher, Jean-Philippe Bossuat, Joao Sa Sousa, �and Jean-Pierre Hubaux. "POSEIDON: Privacy-preserving federated neural network learning." In NDSS 21.
  • Target specific applications, e.g., Genome-Wide Association Study�David Froelicher, Juan R. Troncoso-Pastoriza, Jean Louis Raisaro, Michel Cuendet, Joao Sa Sousa, Jacques Fellay, and Jean-Pierre Hubaux. "Truly Privacy-Preserving Federated Analytics for Precision Medicine with Multiparty Homomorphic Encryption." bioRxiv (2021).

lds.epfl.ch