1 of 31

Machine Learning in Dark Mode

Federated Learning and Data Privacy

Michael Tang ‘24

2 of 31

Federated

Learning

3 of 31

What is federated learning (FL)?

  • Est. 2016
  • Central server, decentralized training data (i.e. stored locally on your device)
  • Problem: data is unbalanced, communication bandwidth is limited
  • Sits at the intersection of cryptography, databases, machine learning

4 of 31

(Google Research 2017)

5 of 31

(Google Research 2021)

6 of 31

7 of 31

8 of 31

9 of 31

10 of 31

Federated learning in 2022

  • FATE
  • Substra
  • PySyft
  • TensorFlow Federated
  • IBM Federated Learning
  • NVIDIA Clara

11 of 31

Key FL challenges

  • Communication is expensive
  • Systems heterogeneity (i.e. different devices and systems)
  • Statistical heterogeneity (i.e. users may act very differently)
    • One direction: personalized modeling
  • Data privacy

12 of 31

Data

Privacy

13 of 31

General Data Protection Regulation (GDPR)

  1. Lawfulness
  2. Fairness and transparency
  3. Purpose limitation
  4. Data minimization
  5. Accuracy
  6. Storage limitation
  7. Integrity and confidentiality
  8. Accountability

14 of 31

General Data Protection Regulation (GDPR)

  • Lawfulness
  • Fairness and transparency
  • Purpose limitation
  • Data minimization
  • Accuracy
  • Storage limitation
  • Integrity and confidentiality
  • Accountability

“Your cybersecurity measures need to be appropriate to the size and use of your network and information systems”

“You should identify the minimum amount of personal data you need to fulfil your purpose”

15 of 31

16 of 31

17 of 31

Data anonymization

  • Goal: protect against linkage attacks
  • Techniques
    • Generalization and suppression
    • Anatomization
    • Perturbation (most widely referred to for differential privacy)

18 of 31

Linking

Linking

19 of 31

Anatomization

20 of 31

Perturbation

  • Data swapping
  • Additive noise (case study: 2020 census)
  • Synthetic data generation

21 of 31

22 of 31

23 of 31

Secure multi-party computation (SMC)

  • Shamir's Secret Sharing demo

24 of 31

Homomorphic encryption

25 of 31

FL challenges

  • Inference attacks
  • Poisoning attacks
  • Malicious coordination server
    • Passive vs. active
  • Secure communication medium

26 of 31

Inference attacks

27 of 31

Poisoning attacks

28 of 31

FL challenges

  • Inference attacks → use SMC, e.g. secure aggregation
  • Poisoning attacks → anomaly detection? unclear
  • Malicious coordination server
    • Passive vs. active
  • Secure communication medium
  • Integrate differential privacy techniques at batch and user level

29 of 31

Secure Aggregation

30 of 31

Outlook

  • Data privacy is a top concern
    • Formerly-Facebook and Cambridge Analytica
    • Contact tracing
    • Apple CSAM delays
  • FL is naturally GDPR-compliant
  • Further room for growth
    • SMC, differential privacy, encrypted transfer learning
    • Transparency issues — the privacy-accuracy-interpretability tradeoff

31 of 31

Further Reading

  1. https://ai.googleblog.com/2017/04/federated-learning-collaborative.html
  2. https://arxiv.org/abs/1602.05629
  3. https://eprint.iacr.org/2017/281
  4. https://arxiv.org/abs/1912.04977 - highly recommended
  5. https://arxiv.org/abs/2003.02133
  6. https://arxiv.org/pdf/1908.07873.pdf
  7. https://arxiv.org/pdf/2011.05411.pdf
  8. https://arxiv.org/pdf/1710.06963.pdf — for folks interested in the technical ML
  9. https://arxiv.org/pdf/2011.05411.pdf — from the GDPR perspective