Model Selection's Disparate Impact in Real-World Deep Learning Applications
Jessica Zosa Forde*1, A. Feder Cooper*2, Kweku Kwegyir-Aggrey1, Chris De Sa2, Michael Littman1
1Brown University 2Cornell University
tldr:
Sritisava, et. al., 2019
Which model is fairest? — Skin cancer detection
What if we considered — Breast cancer detection
Context can impact model selection choices. False negatives are more costly than false positives
Choice of comparison metric for model selection can implicate fairness
Model selection is under-explored as a potential site for introducing bias
Vox, “Glad You Asked”, S2 E2,
“Are We Automating Racism”, 2021
Model selection
Tobin, 2019
Hyperparameter optimization can significantly impact conclusions about algorithm performance and subsequent model selection
Model Selection has also been studied for its role in reproducibility
Replicating CheXNet (Rajpurkar et al., 2017)
DenseNet-121
Pretrain
Finetune 50x, varying batching
CheXNet
Transfer
CheXNet
0
49
NIH-14
ImageNet
We measure model variability to show models with similar overall performance can have different performance across subpopulations
Subpopulations are domain-specific
NIH-14 we look at differences in true positive rate for male and female patients
These differences also vary by predicted class (finding)
CheXNet has fixed initialization, so varying random seed only affects data ordering.
Because optimizer settings are fixed, replicates have similar overall performance.
Low variability in AUC does not guarantee similarity of other metrics.