1 of 42

Collaborative and private AI model training to combat algorithmic bias

Val & Yuriko, Community Privacy Residency 2025

2 of 42

What is federated learning or client-side model training?

In traditional machine learning, all data must be centralized in one database before training a model.

Federated learning (often referred to as collaborative learning) is a decentralized approach to training machine learning models that allows collaborative training of models across multiple data owners without sharing their raw datasets. To enhance privacy in federated learning, multi-party computation can be leveraged for secure communication and computation during model training.

3 of 42

4 of 42

What makes a good use case for federated learning?

  1. Data Privacy & Security Concerns
    • When data is sensitive, such as:
      • Healthcare data (patient records, medical imaging).
      • Financial data (bank transactions, credit scores).
      • Personal communications (private messages, voice data).
  2. Data is Naturally Distributed Across Many Devices or Locations
    • When data is generated and stored across multiple devices, servers, or institutions.
  3. Collaboration Between Multiple Organizations Without Centralizing Data
    • Useful for when independent institutions want to collaborate on improving models without revealing proprietary data.

5 of 42

Existing use cases for federated learning

  • Healthcare & Biomedical Research
    • Medical Imaging: Hospitals train AI models on patient scans (e.g., MRIs, CT scans) without sharing sensitive patient data. Example: NVIDIA Clara and the Federated Tumor Segmentation (FeTS) challenge.
  • Finance & Banking
    • Credit Scoring & Risk Assessment: Federated learning allows financial institutions to improve credit risk models without exposing customer data.
  • Retail & E-Commerce
    • Personalized Recommendations: Retailers refine recommendation algorithms across devices and locations while preserving user privacy.
  • Cybersecurity & Threat Detection
    • Malware Detection: Organizations train AI models across different systems to detect malware patterns without exposing sensitive security logs.

6 of 42

Going to bed after a fun day hanging out with all my new friends from the Community Privacy Residency

Why can’t I stop thinking about collaborative & private model training for mitigating algorithmic bias?!?!!?!?!

7 of 42

8 of 42

9 of 42

10 of 42

11 of 42

A primary cause of algorithmic bias is a lack of diversity in training data

“Because the algorithm used the results of its own predictions to improve its accuracy, it got stuck in a pattern of sexism against female candidates.”

12 of 42

Types of Bias in Algorithms

👎🏻 Sexism: Bias against women and marginalized genders

👎🏼 Racism: Bias or discrimination against racial minorities or BIPOC folks

👎🏽 Homophobia & Transphobia: Bias against LGBTQIA+ community

👎🏾 Ableism: Bias against folks with disabilities

👎🏿 Fatphobia: Bias against fat people

👎🏻 Whorephobia: Bias against whores / sex workers

👎🏼 Language Bias: Bias against people who speak with non-dominant accents, dialects, or languages

13 of 42

Why is mitigating algorithmic bias a good use case for federated learning?

Clear incentives to collaborate on both sides with a need to ensure privacy of data

By working together, organizations can bring more diverse datasets from various sources together, which helps ensure that the model learns from a wide range of experiences and perspectives. This diversity is crucial because bias often arises when a model is trained on a dataset that is too homogeneous or unrepresentative of the wider population.

When sensitive data is used to train models, individuals may be hesitant to participate due to privacy concerns. This can lead to an underrepresentation of certain groups. With federated learning, users can contribute data without exposing it, which encourages wider participation.

14 of 42

15 of 42

Regulatory Pressure i.e. EU AI Act

The EU AI Act, passed in 2024, is the world’s most comprehensive AI regulation so far. It requires mandatory bias and risk assessments for AI systems in high-risk areas like hiring, healthcare, policing, and banking.

Companies must demonstrate that their training data is representative and fair.

Non-compliance can result in fines up to 6% of global revenue.

This is the first time that legal liability for algorithmic bias is being directly tied to corporate profits.

16 of 42

17 of 42

Non-profit and Business (For-profit)

Example: Hiring algorithm

Two potential collaborating entities:

  1. Indeed.com (Job search platform) : Private company
  2. Community-based job placement center: Nonprofit

Datasets from the local community-based job placement center can be used to help correct Indeed’s model so that it stops discriminating against job-seekers from under-represented communities. But, these sensitive datasets need to be kept private from the other entity.

18 of 42

Non-profit and Business (For-profit)

Example: Content moderation & Detection algorithms

Two potential collaborating entities:

  • Instagram (🤮) or Bluesky : For-profit
  • Lips.social : Nonprofit

Datasets from Lips can be used to help correct Instagram’s model so that it stops discriminating against queer people and sex workers. But, the Lips and Instagram datasets need to be kept private each other. Especially in the case of sex workers and protecting their privacy!

19 of 42

Alternative Data Governance Approaches

  • Data co-ops (i.e. The Movement Cooperative)
  • Personal, local AI (Kwaai.ai)
  • Community-governed AI (Lips.social, Metagov KOI)

20 of 42

21 of 42

Areas for further research

  • Researching potential partnering entities with whom to collaborate on model training and speaking with them about what data they need, what their model is good for, why we might want to collaborate (or why not), etc.
  • Participatory data governance
    • Data usage education
    • Consent interfaces
  • Fleshing out of the data coalition / data co-op / data collective model: who governs the datasets?
  • Interviews!

22 of 42

Interviews!

✅ Joseph Lacey (SysOps for nonprofits & vulnerable communities)

✅ Rohini (Researcher & Technologist, Non-consensual image abuse expert)

✅ Elo (Applied cryptography for private credentials)

✅ Rudy Fraser (Developer of BlackSky Algorithms - Black creator feeds for Bluesky)

✅ Annie Brown (Founder of Reliabl.ai, Algorithmic bias researcher)

✅ Duncan McElfresh (Machine learning engineer)

✅ io (Cybersecurity expert for vulnerable communities)

✅ Josh Tan (Public AI advocate)

✅ Luke Miller (Community governed AI Developer)

✅ Josue Guillen (Texas Organizing Project, The Movement Co-operative)

💚 Sonam Jindal (Partnerships in AI, met at RightsCon)

💚 Dr. Carolina Are (Blogger on a Pole, Algorithmic bias researcher & sex worker)

23 of 42

24 of 42

The more I learn, the more questions I have :)

Therefore, this presentation is a smattering of random but relevant rabbit holes I went down this week that get me both closer to and farther from an understanding of potential use cases of private and collaborative AI model training.

Let’s go!

25 of 42

26 of 42

Existing Federated Learning Use Cases

  • Smartphone keyboards: predictive text and autocorrect, voice assistants for speech recognition, biometric authentication (Big Tech)
  • Healthcare: by training on datasets from different hospitals, research centers, etc. enables models to recognize rare diseases and improve diagnostic accuracy across diverse populations (Big Healthcare)
  • Self-driving cars: real time learning (Big Cars)

27 of 42

Private & Federated Learning in Healthcare

28 of 42

29 of 42

Examples of the algorithms where these biases are commonly found:

  • Determining evictions
  • Hiring
  • Calculating interest rates on loans
  • Determining whether or not a payment processor will work with your business
  • Content recommendation / moderation
  • Determining bond amounts
  • Social services eligibility
  • Insurance policies
  • Period prediction
  • etc.

30 of 42

“To be included or not be included, THAT is the question…”

31 of 42

Government AI Procurement

“When government AI systems base their determinations on biased data, their outputs can perpetuate harmful biases and strip marginalized beneficiaries of the government benefits they deserve.” - Outsourced and Automated

32 of 42

The report notes that governments are increasingly adopting AI systems due to the increased pressure to meet demand for public and social services (as a result of austerity)...

33 of 42

Further research into laws that prohibit discrimination where AI is common:

  • The Patient Protection and Affordable Care Act (“ACA”) prohibits insurers from discriminating on the basis of health status.
  • The Genetic Information Nondiscrimination Act (“GINA”) prohibits discrimination by covered health insurers (and employers, who often provide health insurance) on the basis of genetic information.
  • Employers are prohibited from considering sex, race, age, and disability in hiring decisions, even though these factors can be directly predictive of neutral objectives, like maximizing employee hours worked or total sales.

34 of 42

Proxy Discrimination and the Limits of Legal Anti-Discrimination Law

“The continued evolution of AI and big data will cause proxy discrimination to increase substantially whenever anti-discrimination law seeks to prohibit the use of characteristics that are directly predictive of risk…

For these reasons, anti-discrimination laws that prohibit discrimination based on directly predictive characteristics must adapt to combat proxy discrimination in the age of AI and big data.”

35 of 42

Key Takeaways

  • AI is not inevitable
    • There are plenty of use cases where no model is the best solution
    • The first questions should be: what is the model for? who owns it? what is it being built to do? by whom, for whom?
  • The privacy needs of different communities vary (surprise, surprise!)
    • Human and organizational trust is sometimes enough to justify model training on private data (i.e. Healthcare)
    • Emerging legal frameworks for trusted third parties that protect users (i.e. GLIAnet Alliance)
  • When DWeb WINS… with greater data sovereignty comes greater model sovereignty
    • Data sovereignty: Individual and community-controlled data governance
    • Agency > privacy: Agency means having a choice between various providers and having access to the relevant information needed to make an informed, educated decision
    • Once individuals and communities regain greater control over our data, the possibilities for local AI and collaborative model training open wide.
  • Federated learning can exist with and without cryptographic privacy guarantees
    • The level of privacy that cryptographic methods ensure is really great for highly sensitive and vulnerable data (i.e. period trackers, abortion and reproductive justice databases), and might not make sense for other use cases.
    • Continued development will rely on lots more computational power, hardware upgrades, etc.

36 of 42

Model or No Model?

👍🏻 LLMs (ideally public AI)

👍🏼 Content moderation (automatic image detection for extreme content no human should have to review/be exposed to)

👍🏽 Recommendation models for information or products

👍🏾 Healthcare: disease prediction

👍🏿 Decentralized personal AI assistants (digital twins)

*In these examples, we may want to use AI and therefore be invested in making these systems more inclusive, fair, and accurate.

👎🏻 Social services determinations

👎🏾 Facial or body recognition systems for surveillance

👎🏿 Income, employment, or credit verification

*For these examples, we might think about how relying exclusively on manual processes also enable bias and perhaps solutions require some combination of manual review and fairness-ensuring automation.

37 of 42

Emerging Legal Frameworks

38 of 42

Emerging Community Data Governance Protocols & Practices

39 of 42

Areas for further research

  • Framework for understanding privacy in AI partnerships & procurement
  • Deep Dive: Content Moderation Use Case
    • Lips and BlackSky collaborative model training
    • Interview other potential collaborators for content moderation system-building to support marginalized communities and combat algorithmic bias
  • Data sovereignty & governance movement must WIN!
    • Building data co-ops, coalitions, trusts, etc.

40 of 42

Thoughts? Questions? Concerns?

41 of 42

Appendix

42 of 42