1 of 59

Measuring Fairness

Machine Learning in Production

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

2 of 59

Diving into Fairness...

2

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

3 of 59

Reading

Required:

Recommended:

3

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

4 of 59

Learning Goals

  • Understand different definitions of fairness
  • Discuss methods for measuring fairness
  • Outline interventions to improve fairness at the model level

4

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

5 of 59

Fairness: Measurements

How do we measure fairness of an ML model?

5

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

6 of 59

Fairness is still an actively studied & disputed concept!

6

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

7 of 59

Fairness: Measurements

  • Anti-classification (fairness through blindness)
  • Group fairness (independence)
  • Equalized odds (separation)
  • ...and numerous others and variations!

7

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

8 of 59

Running Example: Mortgage Applications

  • Large loans repaid over long periods
  • Home ownership is key path to build generational wealth
  • Past decisions often discriminatory (redlining)
  • Replace biased human decisions by accurate ML model
    • income, other debt, home value
    • past debt and payment behavior (credit score)

8

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

9 of 59

What is fair in mortgage applications?

9

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

10 of 59

What is fair in university admissions?

10

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

11 of 59

Recall: What is fair?

Fairness discourse asks questions about how to treat people and whether treating different groups of people differently is ethical. If two groups of people are systematically treated differently, this is often considered unfair.

11

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

12 of 59

Recall: What is fair?

12

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

13 of 59

What is fair in mortgage applications?

  1. Distribute loans equally across all groups of protected attribute(s) (e.g., ethnicity)
  2. Prioritize those who are more likely to pay back (e.g., higher income, good credit history)
  3. Prioritize those who are more in need

...

13

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

14 of 59

How mortgage pisses people off: Redlining

Withold services (e.g., mortgage, education, retail) from people in neighborhoods deemed "risky"

Map of Philadelphia, 1936, Home Owners' Loan Corps. (HOLC)

  • Classification based on estimated "riskiness" of loans

14

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

15 of 59

How mortgage pisses people off: Past bias

15

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

16 of 59

Caveat on Intersectionality

Individuals can and do fall into multiple groups!

Subgroup fairness gets extremely technically complicated quickly.

We therefore focus on the simple cases for the purposes of the material in this class.

16

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

17 of 59

Fairness: Measurements

  • Anti-classification (fairness through blindness)
  • Group fairness (independence)
  • Equalized odds (separation)
  • ...and numerous others and variations!

17

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

18 of 59

Anti-Classification

  • Also called fairness through blindness or fairness through unawareness
  • Ignore certain sensitive attributes when making a decision
  • Example: Remove gender and race from mortgage model

18

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

19 of 59

Anti-Classification: Example

"After Ms. Horton removed all signs of Blackness, a second appraisal valued a Jacksonville home owned by her and her husband, Alex Horton, at 40 percent higher."

19

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

20 of 59

Anti-Classification

Easy to implement, but any limitations?

20

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

21 of 59

Recall: Proxies

Features correlate with protected attributes

21

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

22 of 59

Also, recall: Not all discrimination is harmful

  • Loan lending: Gender discrimination is illegal.
  • Medical related: Gender-specificity may be desirable (women generally pay less for life insurance than men, since they tend to live longer.)

22

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

23 of 59

Anti-Classification

  • Ignore certain sensitive attributes when making a decision
  • Advantage: Easy to implement and test
  • Limitations
    • Sensitive attributes may be correlated with other features
    • Some ML tasks need sensitive attributes (e.g., medical diagnosis)

23

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

24 of 59

Ensuring Anti-Classification

How to train models that are fair w.r.t. anti-classification?

24

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

25 of 59

Ensuring Anti-Classification

How to train models that are fair w.r.t. anti-classification?

  • Simply remove features for protected attributes from training and inference data
  • Null/randomize protected attribute during inference

(does not account for correlated attributes, is not required to)

25

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

26 of 59

Anti-Classification Example

26

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

27 of 59

Testing Anti-Classification

How do we test that a classifier achieves anti-classification?

27

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

28 of 59

Testing Anti-Classification

Straightforward invariant for classifier f and protected attribute p:

(does not account for correlated attributes, is not required to)

  • Test with any test data, e.g., purely random data or existing test data
  • Any single inconsistency shows that the protected attribute was used. Can also report percentage of inconsistencies.

28

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

29 of 59

Breakout: Cancer Prognosis

In groups, post to #lecture tagging members:

Does the model meet anti-classification fairness w.r.t. gender?

Write your calculation and reasoning!

29

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

30 of 59

Anti-Classification: Discussion

Testing of anti-classification barely needed, because easy to ensure by constructing during training or inference!

  • Anti-classification is a good starting point to think about protected attributes
  • Useful baseline for comparison
  • Easy to implement, but only effective if (1) no proxies among features and (2) protected attributes add no predictive power

30

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

31 of 59

Fairness: Measurements

  • Anti-classification (fairness through blindness)
  • Group fairness (independence)
  • Equalized odds (separation)
  • ...and numerous others and variations!

31

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

32 of 59

Group fairness

Key idea: Outcomes matter, not accuracy!

Compare outcomes across two groups

  • Similar rates of accepted loans across racial/gender groups?
  • Similar chance of being hired/promoted between gender groups?
  • Similar rates of (predicted) recidivism across racial groups?

32

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

33 of 59

Disparate impact vs. disparate treatment

Disparate treatment: Practices or rules that treat a certain protected group(s) differently from others

  • e.g., Apply different mortgage rules for people from different backgrounds

Disparate impact: Neutral rules, but outcome is worse for one or more protected groups

  • Same rules are applied, but certain groups have a harder time obtaining mortgage in a particular neighborhood

33

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

34 of 59

Group fairness in discrimination law

Relates to disparate impact and the four-fifth rule

Can sue organizations for discrimination if they

  • mostly reject job applications from one minority group (identified by protected classes) and hire mostly from another
  • reject most loans from one minority group and more frequently accept applicants from another

Four-fifths rule: If the selection rate for a protected group is less than 80% of the selection rate for the group with the highest selection rate, there is adverse impact.

34

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

35 of 59

Notation

35

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

36 of 59

Group Fairness

36

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

37 of 59

Group Fairness Limitations

What are limitations of group fairness?

37

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

38 of 59

Group Fairness Limitations

  • Ignores possible correlation between Y and A
    • Rules out perfect predictor Y' = Y when Y & A are correlated!
  • Permits abuse and laziness: Can be satisfied by randomly assigning a positive outcome (Y' = 1) to protected groups
    • e.g., Randomly promote people (regardless of their job performance) to match the rate across all groups

38

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

39 of 59

Adjusting Thresholds for Group Fairness

39

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

40 of 59

Group Fairness Example

40

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

41 of 59

Adjusting Thresholds for Group Fairness

Mortgage application: P[R > 0.6 | A = 0] = P[R > 0.8 | A = 1]

Wouldn't group A = 1 argue it's unfair? When does this type of adjustment make sense?

41

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

42 of 59

Testing Group Fairness

How would you test whether a classifier achieves group fairness?

42

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

43 of 59

Testing Group Fairness

Collect realistic, representative data (not randomly generated!)

  • Use existing validation/test data
  • Monitor production data
  • (Somehow) generate realistic test data, e.g. from probability distribution of population

Separately measure the rate of positive predictions

  • e.g., P[promoted = 1 | gender = M], P[promoted = 1 | gender = F] = ?

Report issue if the rates differ beyond some threshold 𝜖 across groups

43

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

44 of 59

Breakout Cont': Cancer Prognosis

In groups, post to #lecture tagging members:

  • ~Does the model meet anti-classification fairness w.r.t. gender?~
  • Does the model meet group fairness?

P[Y' = 1 | A = a] = P[Y' = 1 | A = b]

44

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

45 of 59

Equalized odds

  • Anti-classification (fairness through blindness)
  • Group fairness (independence)
  • Equalized odds (separation)
  • ...and numerous others and variations!

45

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

46 of 59

Equalized odds

Key idea: Focus on accuracy (not outcomes) across two groups

  • Similar default rates on accepted loans across racial/gender groups?
  • Similar rate of "bad hires" and "missed stars" between gender groups?
  • Similar accuracy of predicted recidivism vs actual recidivism across racial groups?

Accuracy matters, not outcomes!

46

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

47 of 59

Equalized odds in discrimination law

Relates to disparate treatment

Typically, lawsuits claim that protected attributes (e.g., race, gender) were used in decisions even though they were irrelevant

  • e.g., fired over complaint because of being Latino, whereas other White employees were not fired with similar complaints

Must prove that the defendant had intention to discriminate

  • Often difficult: Relying on shifting justifications, inconsistent application of rules, or explicit remarks overheard or documented

47

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

48 of 59

Equalized odds

P[Y'=1∣Y=0,A=a] = P[Y'=1∣Y=0,A=b]

P[Y'=0∣Y=1,A=a] = P[Y'=0∣Y=1,A=b]

Statistical property of separation: Y' A | Y

  • Prediction must be independent of the sensitive attribute conditional on the target variable

48

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

49 of 59

Review: Confusion Matrix

Can we explain separation in terms of model errors?

  • P[Y'=1∣Y=0,A=a] = P[Y'=1∣Y=0,A=b]
  • P[Y'=0∣Y=1,A=a] = P[Y'=0∣Y=1,A=b]

49

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

50 of 59

Separation

P[Y'=1∣Y=0,A=a] = P[Y'=1∣Y=0,A=b] (FPR parity)

P[Y'=0∣Y=1,A=a] = P[Y'=0∣Y=1,A=b] (FNR parity)

  • Y' A | Y: Prediction must be independent of the sensitive attribute conditional on the target variable
  • i.e., All groups are susceptible to the same false positive/negative rates
  • Example: Y': Promotion decision, A: Gender of applicant: Y: Actual job performance

50

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

51 of 59

Equalized odds Example

51

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

52 of 59

Testing Separation

Requires realistic representative test data (telemetry or representative test data, not random)

Separately measure false positive and false negative rates

  • e..g, for FNR, compare P[promoted = 0 | female, good employee] vs P[promoted = 0 | male, good employee]

How is this different from testing group fairness?

52

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

53 of 59

Breakout Cont': Cancer Prognosis

In groups, post to #lecture tagging members:

  • ~Does the model meet anti-classification fairness w.r.t. gender?~
  • ~Does the model meet group fairness?~
  • Does the model meet equalized odds?
  • Is the model fair enough to use?

53

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

54 of 59

Other fairness measures

  • Anti-classification (fairness through blindness)
  • Group fairness (independence)
  • Equalized odds (separation)**
  • ...and numerous others and variations!

54

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

55 of 59

55

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

56 of 59

Many measures

Many measures proposed

Some specialized for tasks (e.g., ranking, NLP)

Some consider downstream utility of various outcomes

Most are similar to the three discussed

  • Comparing different measures in the error matrix (e.g., false positive rate, lift)

56

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

57 of 59

Outlook: Building Fair ML-Based Products

Next lecture: Fairness is a system-wide concern

  • Identifying and negotiating fairness requirements
  • Fairness beyond model predictions (product design, mitigations, data collection)
  • Fairness in process and teamwork, barriers and responsibilities
  • Documenting fairness at the interface
  • Monitoring
  • Promoting best practices

57

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

58 of 59

Summary

  • Three definitions of fairness: Anti-classification, group fairness, equalized odds
  • Tradeoffs between fairness criteria
    • What is the goal?
    • Key: how to deal with unequal starting positions
  • Improving fairness of a model
    • In all pipeline stages: data collection, data cleaning, training, inference, evaluation

58

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

59 of 59

Further Readings

59

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025