1 of 98

Part IV: Fairness in �Machine Learning

Shahab Asoodeh (McMaster University)

Flavio P. Calmon (Harvard University)

Mario Diaz (Universidad Nacional Autónoma de México)

Haewon Jeong (UC Santa Barbara)

2022 IEEE International Symposium on Information Theory (ISIT)

2 of 98

Data-driven algorithms are increasingly applied to individual-level data to support decision-making in applications of individual-level consequence.

Recidivism prediction

Employment decisions/hiring

Personal finance

3 of 98

Data-driven algorithms are increasingly applied to individual-level data to support decision-making in applications of individual-level consequence.

Recidivism prediction

Employment decisions/hiring

Personal finance

Do these algorithms discriminate on race, sex, and/or some other protected attribute?

4 of 98

Discrimination in Machine Learning

Discrimination is the prejudicial treatment of an individual based on membership in a protected group (e.g., race or sex).

US Equal Employment Opportunity Commission (EEOC)

[Barocas and Selbst’16]

5 of 98

Discrimination in Machine Learning

Discrimination is the prejudicial treatment of an individual based on membership in a protected group (e.g., race or sex).

US Equal Employment Opportunity Commission (EEOC)

[Barocas and Selbst’16]

Google Translate

(recorded 06/22/22)

6 of 98

Discrimination in Machine Learning

Discrimination is the prejudicial treatment of an individual based on membership in a protected group (e.g., race or sex).

US Equal Employment Opportunity Commission (EEOC)

[Barocas and Selbst’16]

Google Translate

(recorded 06/22/22)

7 of 98

Discrimination in Machine Learning

ProPublica’16

8 of 98

How can information-theoretic tools help ensure �fair machine learning?

9 of 98

How can information-theoretic tools help ensure �fair machine learning?

Fairness metrics widely used in the literature

Group fairness
Individual fairness

10 of 98

How can information-theoretic tools help ensure �fair machine learning?

A case study in applying information-theoretic tools to fair ML:

To split or not to split?

Fairness metrics widely used in the literature

Group fairness
Individual fairness

11 of 98

How can information-theoretic tools help ensure �fair machine learning?

A case study in applying information-theoretic tools to fair ML:

To split or not to split?

Emerging aspects of fair ML:

Predictive multiplicity
Fair Use

Fairness metrics widely used in the literature

Group fairness
Individual fairness

12 of 98

How can information-theoretic tools help ensure �fair machine learning?

A case study in applying information-theoretic tools to fair ML:

To split or not to split?

Emerging aspects of fair ML:

Predictive multiplicity
Fair Use

Fairness metrics widely used in the literature

Group fairness
Individual fairness

13 of 98

Setup

Model

Classifier

(e.g. past grades, questionnaire answers)

(e.g. academic performance)

14 of 98

Disparate Treatment vs. Disparate Impact

Model

Classifier

(e.g. past grades, questionnaire answers)

(e.g. academic performance)

Disparate treatment (direct discrimination): occurs when the protected attributes are used directly in decision making.

(e.g. sex, age, race)

[Barocas and Selbst’16]

15 of 98

Disparate Treatment vs. Disparate Impact

Model

Classifier

(e.g. past grades, questionnaire answers)

(e.g. academic performance)

(e.g. sex, age, race)

Disparate impact: group attributes are not used directly, but reliance on variables correlated with them lead to different outcome distributions for different groups.

[Barocas and Selbst’16]

16 of 98

Disparate Treatment vs. Disparate Impact

Model

Classifier

(e.g. past grades, questionnaire answers)

(e.g. academic performance)

(e.g. sex, age, race)

Changes in input distribution…

17 of 98

Disparate Treatment vs. Disparate Impact

Model

Classifier

(e.g. past grades, questionnaire answers)

(e.g. academic performance)

(e.g. sex, age, race)

Changes in input distribution…

disparate

impact

Performance

18 of 98

Disparate Treatment vs. Disparate Impact

Model

(e.g. past grades, questionnaire answers)

(e.g. sex, age, race)

Changes in input distribution…

disparate

impact

Performance

Group fairness metrics

19 of 98

Disparate Treatment vs. Disparate Impact

Model

(e.g. past grades, questionnaire answers)

(e.g. sex, age, race)

Group fairness metrics

Group fairness metrics are usually defined in terms of differences between average outcomes and error rates across different populations

20 of 98

Building Group Fairness Metrics

Population

FPR

TPR

TNR

FNR

[Narayan, FAT* Tutorial 2017]

21 of 98

Building Group Fairness Metrics

Population

FPR

TPR

TNR

FNR

FPR

TPR

TNR

FNR

Statistical parity

[Narayan, FAT* Tutorial 2017]

22 of 98

Building Group Fairness Metrics

Population

FPR

TPR

TNR

FNR

FPR

TPR

TNR

FNR

Equalized odds

[Hardt, Price Srebro’16]

23 of 98

Building Group Fairness Metrics

Population

FPR

TPR

TNR

FNR

FPR

TPR

TNR

FNR

Equal opportunity

[Hardt, Price Srebro’16]

24 of 98

Building Group Fairness Metrics

Population

FPR

TPR

TNR

FNR

FPR

TPR

TNR

FNR

There are (combinatorically many) other metrics and trade-offs you can derive from this table

For trade-offs see, for example, [Lipton et al., 2018; Chouldechova, 2017; Kleinberg et al., 2016; Pleiss et al., 2017]

25 of 98

Building Group Fairness Metrics

Population

FPR

TPR

TNR

FNR

FPR

TPR

TNR

FNR

There are (combinatorically many) other metrics and trade-offs you can derive from this table

For trade-offs see, for example, [Lipton et al., 2018; Chouldechova, 2017; Kleinberg et al., 2016; Pleiss et al., 2017]

26 of 98

Population

FPR

TPR

TNR

FNR

FPR

TPR

TNR

FNR

There are (combinatorically many) other metrics and trade-offs you can derive from this table

Important: Several group fairness metrics can be written as linear constraints on the classifier

[Lemma 1 in Alghamdi et al. ISIT’20]

27 of 98

Group fairness violations

Classifier

Group (un)fairness:

Average performances changes conditioned on a group attribute (e.g., race, age)

Facial recognition:

Healthcare

[Buolamwini, Gebru’18]

[Science’18]

Education

[The Guardian’20]

28 of 98

Ensuring group fairness (more on this later)

Classifier

Group (un)fairness:

Average performances changes conditioned on a group attribute (e.g., race, age)

Pre-processing

[Hajian’13], [Zemel et al.’13], [Kamiran & Calders’12], [Hajian & Domingo-Ferrer’13], [Ruggieri’14],[C et. al’17], [Madras et al’18], [Ghassam et al.18], [Wang et al.’19] and more!

In-processing

[Fish et al.’16], [Zafar et al.’16],

[T. Kamishima, S. Akaho * J. Sakuma’11], [A. Agarwal et al.’18]

Post-processing

[Hardt, Price, Srebro’16], [Wei, Ramamurthy, C. 20], [Menon & Williamson’18], [Yang et al.’20], [Alghamdi et al.’20]

29 of 98

Ensuring group fairness (more on this later)

[Frideler et al.’19, “A Comparative Study on Fairness-Enhancing interventions in ML”]

Benchmarks!

Classifier

Group (un)fairness:

Average performances changes conditioned on a group attribute (e.g., race, age)

30 of 98

“Individual” fairness

[Dwork et al. 2011]

Key idea: “similar” individuals are treated “similarly”.

Formulation: “Lispschitz” constraint on classifier output:

31 of 98

How can information-theoretic tools help ensure �fair machine learning?

Emerging aspects of fair ML:

Predictive multiplicity
Fair Use

Fairness metrics widely used in the literature

Group fairness
Individual fairness

A case study in applying information-theoretic tools to fair ML:

To split or not to split?

32 of 98

How can information-theoretic tools help ensure �fair machine learning?

Emerging aspects of fair ML:

Predictive multiplicity
Fair Use

Fairness metrics widely used in the literature

Group fairness
Individual fairness

A case study in applying information-theoretic tools to fair ML:

To split or not to split?

33 of 98

Motivation

Image is from: https://stanfordmlgroup.github.io/projects/chexnet/

label

features

group

attribute

chest X-ray

pneumonia

sex

F

M

N

…

Classifier

34 of 98

label

features

group

attribute

chest X-ray

sex

F

M

N

…

Classifier

🤔

Should I use the group attribute?

pneumonia

Motivation

35 of 98

Strategy 1: group-blind classifier

features

label

Negative

Positive

Negative

Training time

36 of 98

Strategy 1: group-blind classifier

features

label

features

predicted probability

Pneumonia positive: 10%

Testing time

Training time

Negative

Positive

Negative

37 of 98

Strategy 2: coupled classifier

features

label

group

attribute

M

F

M

Training time

Negative

Positive

Negative

38 of 98

Strategy 2: coupled classifier

features

label

features

group

attribute

M

F

M

Testing time

Training time

Negative

Positive

Negative

Pneumonia positive: 10%

predicted probability

39 of 98

Strategy 3: split classifiers

Positive

Negative

Positive

Negative

sex: F

sex: M

Training time

Split classifiers are also called decoupled classifiers in [Dwork et al., 2018] and [Ustun et al, 2019]

40 of 98

Strategy 3: split classifiers

Testing time

sex: F

sex: M

Training time

sex?

F

M

F

Split classifiers are also called decoupled classifiers in [Dwork et al., 2018] and [Ustun et al., 2019]

Positive

Negative

Positive

Negative

Pneumonia positive: 10%

Pneumonia positive: 15%

predicted probability

41 of 98

Comparison

group-blind classifier

split classifiers

group

attribute

coupled classifier

42 of 98

Comparison

group-blind classifier

split classifiers

group

attribute

coupled classifier

43 of 98

Comparison

group-blind classifier

split classifiers

group

attribute

coupled classifier

44 of 98

Related work

group-blind classifier

split classifiers

group

attribute

coupled classifier

Split classifiers are also called decoupled classifiers in [Dwork et al., 2018] and [Ustun et al, 2019]

How can a group attribute be used in ML systems given that it is ethical and legal to do so?

[Dwork et al., 2018]

45 of 98

This work

group-blind classifier

split classifiers

group

attribute

coupled classifier

Is it always beneficial to use a group attribute in the prediction task?
When does using a group attribute benefit model performance the most?

46 of 98

Drawback of group-blind classifier

group-blind classifier

split classifiers

group

attribute

coupled classifier

-

+

-

Different

probability

distributions

47 of 98

Drawback of split classifiers

group-blind classifier

split classifiers

group

attribute

coupled classifier

Limited amount

of samples

-

+

-

+

-

+

-

48 of 98

Factors

group-blind classifier

split classifiers

group

attribute

coupled classifier

data distributions
number of samples
hypothesis class
…

49 of 98

Goal

group-blind classifier

split classifiers

group

attribute

coupled classifier

What is the gain of incorporating a group attribute in a classifier?

50 of 98

Notation

binary label

(e.g., pneumonia)

features

(chest X ray)

binary

group

attribute

(e.g., sex)

0

1

0

…

1

51 of 98

Notation: unlabeled distribution and labeling function

unlabeled

distribution

labeling

function

features

(chest X ray)

binary

group

attribute

(e.g., sex)

1

0

…

[Ben-David et al., 2010]

binary label

(e.g., pneumonia)

0

1

52 of 98

Notation: classifier and loss function

features

Pneumonia positive: 10%

probabilistic classifier

predicted probability

53 of 98

Benefit-of-splitting An information-theoretic quantify

group-blind classifier

split classifiers

worst-case loss

group 0

group 1

[Ben-David et al., 2010]

Loss function

54 of 98

Benefit-of-splitting An information-theoretic quantify

group-blind classifier

split classifiers

worst-case loss

group 0

group 1

Loss function

55 of 98

Benefit-of-splitting An information-theoretic quantify

group-blind classifier

split classifiers

group 0

group 1

Loss function

56 of 98

Benefit-of-splitting An information-theoretic quantify

group-blind classifier

split classifiers

group 0

group 1

57 of 98

Benefit-of-splitting: warm up

Information-theoretically, splitting classifiers never harms any group.

58 of 98

Benefit-of-splitting: warm up

Information-theoretically, splitting classifiers never harms any group.

When does splitting benefit model accuracy the most?

59 of 98

Bounding the benefit-of-splitting: two factors

60 of 98

Bounding the benefit-of-splitting: two factors

Disagreement between labeling functions

Similarity between unlabeled distributions

61 of 98

Bounding the benefit-of-splitting: taxonomy of splitting

Colors = subgroups (e.g. male/female)

+ / - = label

62 of 98

Bounding the benefit-of-splitting: taxonomy of splitting

Colors = subgroups (e.g. male/female)

+ / - = label

63 of 98

Bounding the benefit-of-splitting: taxonomy of splitting

Splitting does not

bring much benefit

Colors = subgroups (e.g. male/female)

+ / - = label

64 of 98

Bounding the benefit-of-splitting: taxonomy of splitting

Splitting does not necessarily

bring much benefit

Colors = subgroups (e.g. male/female)

+ / - = label

65 of 98

Bounding the benefit-of-splitting: taxonomy of splitting

Splitting benefits the most!

Colors = subgroups (e.g. male/female)

+ / - = label

66 of 98

Bounding the benefit-of-splitting: taxonomy of splitting

Splitting benefits the most!

Colors = subgroups (e.g. male/female)

+ / - = label

67 of 98

How can information-theoretic tools help ensure �fair machine learning?

A case study in applying information-theoretic tools to fair ML:

To split or not to split?

Emerging aspects of fair ML:

Fair Use
Predictive Multiplicity

Fairness metrics widely used in the literature

Group fairness
Individual fairness

68 of 98

Fair Use of group attributes

Non-maleficience: do not harm

Beneficience: do good

[Ustun et al. ICML’19]

Users of a model should also be incentivized to report their data features truthfully.

When collecting a group attributes, we must ensure

69 of 98

Fair Use of group attributes

Non-maleficience: do not harm

Beneficience: do good

[Ustun et al. ICML’19]

Rationality

Envy-freeness

Users of a model should also be incentivized to report their data features truthfully.

When collecting a group attributes, we must ensure

70 of 98

Fair Use of group attributes

[Ustun et al. ICML’19]

Rationality

Envy-freeness

71 of 98

Fair Use of group attributes

[Ustun et al. ICML’19] , [Paes, Long, Ustun, Calmon’22]

72 of 98

Fair Use of group attributes

[Ustun et al. ICML’19] , [Paes, Long, Ustun, Calmon’22]

Rationality:

Envy-freeness:

73 of 98

Fair Use of group attributes

[Ustun et al. ICML’19] , [Paes, Long, Ustun, Calmon’22]

Rationality:

Envy-freeness:

Less stringent fairness metric than group-fairness;
Relevant to healthcare applications (e.g., personalized medicine) where collection of new data may be challenging;
Might be impossible to verify in practice for many intersectional groups.

74 of 98

Fair Use of group attributes

[Ustun et al. ICML’19] , [Paes, Long, Ustun, Calmon’22]

Rationality:

Less stringent fairness metric than group-fairness;
Relevant to healthcare applications (e.g., personalized medicine) where collection of new data may be challenging;
Might be impossible to verify in practice for many intersectional groups.

Minimax converse bound: at most 20 binary group attributes!

75 of 98

Predictive Multiplicity

features

label

Training time

Negative

Positive

Negative

76 of 98

Predictive Multiplicity

features

label

Training time

Negative

Positive

Negative

Set random seed;
Initialize weights;
Set dropout parameters;
Randomly shuffle the dataset and split in batches;
Etc.

77 of 98

Predictive Multiplicity

features

label

Training time

Negative

Positive

Negative

Random Initialization

78 of 98

Predictive Multiplicity

features

label

Training time

Negative

Positive

Negative

79 of 98

features

label

features

predicted probability

Pneumonia positive: 10%

Testing time

Training time

Negative

Positive

Negative

80 of 98

features

predicted probability

Pneumonia positive: 10%

Testing time

81 of 98

features

predicted probability

Pneumonia positive: 10%

Testing time

Set of model parameters

82 of 98

Set of model parameters

Random Initialization

83 of 98

[Breiman’01, Fisher et al.’19, Semenova et al.’19, Marx, C, Ustun’20]

84 of 98

[Breiman’01, Fisher et al.’19, Semenova et al.’19, Marx, C, Ustun’20]

features

Pneumonia positive: 23%

Pneumonia positive: 62%

Pneumonia positive: 50%

Predictive Multiplicity: models with similar average performance may produce conflicting predictions on individual samples.

85 of 98

[Breiman’01, Fisher et al.’19, Semenova et al.’19, Marx, C, Ustun’20]

features

Pneumonia positive: 23%

Pneumonia positive: 62%

Pneumonia positive: 50%

Predictive multiplicity: models with similar average performance may produce conflicting predictions on individual samples.

86 of 98

[Hsu, C arxiv’22]

Predictive multiplicity: models with similar average performance may produce conflicting predictions on individual samples.

87 of 98

[Creel and Hellman’21]

If predictive multiplicity is not accounted for, decisions supported by ML models may depend on arbitrary and unjustified choices (e.g., model initialization).

In sectors dominated by a few algorithms (algorithmic leviathans used in credit scoring, government services), this may lead to arbitrary loss of opportunities to certain individuals

88 of 98

Where tools from information theory can help:

1. Discovering multiplicity

Can we delineate the Rashomon Set without (re)training thousands of models?

89 of 98

Where tools from information theory can help:

2. Reporting multiplicity

[Hsu and C’22]

Pneumonia positive: 23%

Pneumonia positive: 62%

Pneumonia positive: 50%

90 of 98

2. Reporting multiplicity

[Hsu and C’22]

91 of 98

[Hsu and C’22]

Channel capacity can be used to quantify multiplicity!

Rashomon Capacity:

92 of 98

[Hsu and C’22]

Channel capacity can be used to quantify multiplicity!

Distribution over models in the Rashomon Set

Rashomon Capacity:

93 of 98

[Hsu and C’22]

Channel capacity can be used to quantify multiplicity!

Distribution over models in the Rashomon Set

Rashomon Capacity:

94 of 98

[Hsu and C’22]

Between 1 and c;
1 iff predictions of all models match;
c iff there are models in the Rashomon Set; that assign each of the c classes;
Monotonic in the size of the Rashomon Set.

Rashomon Capacity:

95 of 98

Where tools from information theory can help:

3. Resolving multiplicity

Which model should we use?

[Hsu and C’22]

96 of 98

Where tools from information theory can help:

3. Resolving multiplicity

[Hsu and C’22]

A direct application of Caratheodóry’s Theorem yields that for each sample at most c models capture score variation as measured by Rashomon Capacity.

97 of 98

Take-aways:

Group-fairness is, at this point, the most used fairs metric.

Fairness notions such as fair use and predictive multiplicity are still mostly unexplored, but very relevant to practice.

Information-theoretic tools can be used to derive converse results on auditing and ensuring fairness, as well as limits on fairness-accuracy trade-offs.

98 of 98

Up next: ensuring fairness in practice

Thanks!