1 of 1

Model Selection's Disparate Impact in Real-World Deep Learning Applications

Jessica Zosa Forde*¹, A. Feder Cooper*², Kweku Kwegyir-Aggrey¹, Chris De Sa², Michael Littman¹

¹Brown University ²Cornell University

tldr:

Model selection is under-explored as a potential site for introducing bias in ML pipelines
Measuring model variability shows models with similar overall variability can have different variability across subpopulations
Specifying and justifying the preferences that drive model selection decisions can facilitate greater accountability in ML

Sritisava, et. al., 2019

Which model is fairest? — Skin cancer detection

What if we considered — Breast cancer detection

Context can impact model selection choices. False negatives are more costly than false positives

Choice of comparison metric for model selection can implicate fairness

Model selection is under-explored as a potential site for introducing bias

Vox, “Glad You Asked”, S2 E2,

“Are We Automating Racism”, 2021

Model selection

Tobin, 2019

Hyperparameter optimization can significantly impact conclusions about algorithm performance and subsequent model selection

Model Selection has also been studied for its role in reproducibility

Replicating CheXNet (Rajpurkar et al., 2017)

DenseNet-121

Pretrain

Finetune 50x, varying batching

CheXNet

Transfer

CheXNet

NIH-14

ImageNet

We measure model variability to show models with similar overall performance can have different performance across subpopulations

Subpopulations are domain-specific

NIH-14 we look at differences in true positive rate for male and female patients

These differences also vary by predicted class (finding)

CheXNet has fixed initialization, so varying random seed only affects data ordering.

Because optimizer settings are fixed, replicates have similar overall performance.

Low variability in AUC does not guarantee similarity of other metrics.