Scalable Privacy-Preserving Distributed Learning
D. Froelicher, J. R. Troncoso-pastoriza, A. Pyrgelis, S. Sav, J. Sa Sousa, J.-P. Bossuat, J.-P. Hubaux�
Laboratory for Data Security (LDS)
1
Distributed Learning - Motivation
The training of accurate and generalisable machine learning models requires large and diverse datasets
→ Multiple entities collaborate to train a machine learning model on their joint data
2
Distributed Learning - Problem
Sensitive/personal data are difficult to share
→ Sensitive data are often siloed� → Each entity trains independently� → less diversity� → smaller training dataset� → less accurate
3
Distributed Learning - Current Approaches
4
Raw data
(a) Fully centralized
Trusted
3rd party
Examples:
- All of Us
- EGA
- Genomics England
(b) Meta-analysis
Example:
- https://covidclinical.net/�- sPLINK
Aggregated data
Trusted
3rd party
(c) Decentralized
(d) Differential Privacy Decentralized
= Partial Results Obfuscation
Examples:�- M. Kim et al. "Secure and Differentially Private Logistic Regression for Horizontally Distributed Data," TIFS 2019
- M. Abadi et al. Deep learning with differential privacy. In ACM CCS, 2016.
- Chaudhuri and C. Monteleoni. Privacy-preserving logistic regression. In NIPS, 2009.
(e) Cryptographic (SMC, HE) Decentralized
Examples:�- A. Gascón et al.. Privacy-preserving distributed linear regression on high-dimensional data. PETS, 2017.
- P. Mohassel and Y. Zhang. SecureML: A system for scalable privacy-preserving machine learning. In IEEE S&P, 2017.
Aggregated data
Secret shared/encrypted data
Data Leakage
Data Leakage
Data Leakage
Introduce Bias
Limited #parties
SPINDLE (Multiparty Homomorphic Encryption (MHE))
→ Data + Model Confidentiality� as long as 1 entity is honest
→ No data outsourcing
→ Scale with #parties
→ Exact results
Building Blocks
Generic Secure Federated Learning with �Data Confidentiality + Model Confidentiality
5
SPINDLE
Multiparty Homomorphic Encryption (MHE)
Cooperative Gradient Descent
Extended MapReduce Abstraction
Generalized Linear Models:�- linear and logistic regressions
- Multinomial regression
MapReduce for Distributed Learning
6
DP1
DP2
DP3
DP4
DP5
DP6
DP7
PREPARE
PREPARE
PREPARE
PREPARE
PREPARE
PREPARE
PREPARE
MapReduce Distributed Learning
PREPARE:�DPs agree on secu/learning params.
Each DPi pre-process its data
DPs agree on initial global model
DP = Data Provider
MapReduce for Distributed Learning
7
MapReduce Distributed Learning
PREPARE
DP1
DP2
DP3
DP4
DP5
DP6
DP7
MAP → W1
MAP → W2
MAP → W7
MAP → W6
MAP → W5
MAP → W4
MAP → W3
Iterate
MAP:
Each DPi locally trains on its data and uses the global model WG to update its local model Wi
MapReduce for Distributed Learning
8
DP1
DP2
DP3
DP4
DP5
DP6
DP7
Iterate
MAP:�Each DPi locally train on its data and uses the global model WG to update its local model Wi
COMBINE → W4 + W3+ W5
COMBINE → W2 + W6+ W7
COMBINE → ∑ Wi
COMBINE:
All Wi are combined
MapReduce Distributed Learning
PREPARE
MapReduce for Distributed Learning
9
DP1
DP2
DP3
DP4
DP5
DP6
DP7
Iterate
MAP:
Each DPi locally train on its data and uses the global model WG to update its local model Wi
COMBINE:
All Wi are combined
REDUCE:�DP1 updates the global model WG
REDUCE → WG = f(∑ Wi, WG(iter-1))
MapReduce Distributed Learning
PREPARE
MapReduce for Distributed Learning
10
DP1
DP2
DP3
DP4
DP5
DP6
DP7
Iterate
MAP:
Each DPi locally train on its data and uses the global model WG to update its local model Wi
COMBINE:
All Wi are combined
REDUCE:�DP1 updates the global model WG
REDUCE → WG = f(∑ Wi, WG(iter-1))
PREDICTION:
DPi uses WG to predict on new data
MapReduce Distributed Learning
PREPARE
MapReduce for Secure Distributed Learning
11
DP1
DP2
DP3
DP4
DP5
DP6
DP7
Iterate
MAP:
Each DPi locally train on its data and uses the global model ⟨WG⟩ to update its local model ⟨Wi⟩
COMBINE:
All ⟨Wi⟩ are combined
REDUCE:�DP1 updates the global model ⟨WG⟩
PREDICTION:
DPi uses ⟨WG⟩ to predict on ⟨new data⟩
= f( )
= f( )
Secret Key
Public Key
Collective Public Key
MapReduce Secure Distributed Learning
PREPARE
Multiparty Homomorphic Encryption (MHE)
12
Public collective key
[1] J. H. Cheon, A. Kim, M. Kim, and Y. Song. Homomorphic encryption for arithmetic of approximate numbers. In ASIACRYPT, 2017.
[2] C. Mouchet, J. R. Troncoso-pastoriza, J.-P. Bossuat, and J. P. Hubaux. Multiparty homomorphic encryption: From theory to practice. In PETS’21 on Thursday 15th of July in Session 7B.
We rely on the adaptation to CKKS[1] of the multiparty scheme proposed by Mouchet et al. [2]
DPs’ public keys (corresponding secret keys are never revealed)
= f( )
Ciphertext: E (v1, …., vN)
Extended MapReduce for Secure Distributed Learning
13
Iterate
MAP:
Each DPi locally train on its data and uses the global model ⟨WG⟩ to update its local model ⟨Wi⟩
COMBINE:
All ⟨Wi⟩ are combined
REDUCE:�DP1 updates the global model ⟨WG⟩
PREDICTION:
DPi uses ⟨WG⟩ to predict on new data
= f( )
MapReduce Secure Distributed Learning
PREPARE
Stochastic Gradient Descent:
features
Samples’ batch
x
Activation
x
x
f(features, samples’ batch)
x
x
Row-based Approach
Diagonal Approach
Activation
:
:
Encrypted
Cleartext
Least Square Approx.
Input Dimension
Encrypted
Cleartext
Extended MapReduce for Secure Distributed Learning
14
=
+
+
Iterate
MAP:
Each DPi locally train on its data and uses the global model ⟨WG⟩ to update its local model ⟨Wi⟩
COMBINE:
All ⟨Wi⟩ are combined
REDUCE:�DP1 updates the global model ⟨WG⟩
PREDICTION:
DPi uses ⟨WG⟩ to predict on new data
= f( )
MapReduce Secure Distributed Learning
PREPARE
Parameterization
Input dimensions
Cryptographic/Security Parameters
Approximation Parameters
Learning Parameters
SPINDLE Evaluation: Accuracy
SPINDLE achieves accuracy close to centralized solution and (almost) same accuracy as non-secure distributed solutions
15
Logistic regression
L.R. one-Vs-all
Multinomial Regression
Evaluation Parameters
10 Data providers
128-bit security level
Legend
Dataset: Name [#samples x #features]
(1) Pima = Pima Indians Diabetes https://www.kaggle.com/uciml/pima-indians-diabetes-database
(2) BCW = Breast cancer Wisconsin (original)
https://archive.ics.uci.edu/ml/datasets/ breast+cancer+wisconsin+(original)
(3,4) MNIST
Y. LeCun and C. Cortes. Handwritten digit database. 2010.
SPINDLE Evaluation: Performance
16
5 data providers, 25600 record, 128-bit security; Each data provider: 2 Intel Xeon E5-2680 v3 CPUs, 2.5GHz frequency, 24 threads on 12 cores, 256GB RAM. Communication: 100Mbps, delay 20ms
Better than logarithmic increase with the number of features
SPINDLE Evaluation: Performance
17
128-bit security level; Default number of features = 32, |S|= # data providers; n = global dataset size, b is the batch size used in the stochastic gradient descent, One data provider: 2 Intel Xeon E5-2680 v3 CPUs, 2.5GHz frequency, 24 threads on 12 cores, 256GB RAM. Communication: 100Mbps, delay 20ms
Scales almost independently with �the number of data providers |S|
Scales linearly with the size of �the data providers datasets n
Efficient workload distribution
Conclusion
18
Future Work: build on our widely-applicable solution to�
lds.epfl.ch