Generalization challenges and making models right for the right reasons in medicine
(with a focus on chest X-ray diagnostics)
Joseph Paul Cohen�Postdoctoral Fellow�Mila, Université de Montréal
1/28
Conflicts of Interest
None
2/28
Chester: a free open source tool to try deep learning
3/28
Kidney Donor Risk Index (KDRI)
[Feng 2006]
The Framingham Heart Study
Cardiovascular Disease Risk
[Online ~2018]
Emergency Room (Time limited human)
Rural Hospital (no radiologist nearby)
Chester: AI Radiology Assistant
[Cohen 2019]
Triage of cases by non-expert
[Cohen, Chester: A Web Delivered Locally Computed Chest X-Ray Disease Prediction System, 2019] https://arxiv.org/abs/1901.11210
As an educational tool in school
*NOT FOR MEDICAL USE YET*
Chapter 1
Cross-domain generalization
4
What would lead to such strange results?
An online post about the system indicated some contention about these labels.
Bálint Botz - Evaluating chest x-rays using AI in your browser? — testing Chester:
| Test data (AUC) | |
| NIH (Maryland, US) | PadChest (Spain) |
Mass | 0.88 | 0.89 |
Nodule | 0.81 | 0.74 |
Pneumonia | 0.73 | 0.83 |
Consolidation | 0.82 | 0.91 |
Infiltration | 0.73 | 0.60 |
Initial results when evaluating this model on an external dataset from Spain.
6/28
To investigate, a cross domain evaluation is performed. The 5 largest datasets are trained and evaluated on.
https://arxiv.org/abs/2002.02497
Each dataset's labels are generated using a different method. Some automatic and some manual.
Good
Medium
Variable
7/28
We model:
We may blame poor performance on a shift in x (covariate shift) but that would not account why for some y it works well.
It seems more likely that there is some shift in y (concept shift) which would force us to condition the prediction.
Possibly reality
But we want objective predictions!
8/28
We may think that training on local data is addressing covariate shift
However, training on local data provides better performance than using all other data (>100k examples).
Likely only adapting to the local biases in the data which may not match the reality in the images
Cross domain validation analysis. Average over 3 seeds for all labels.
What is causing this shift?
9/28
10/28
High AUC
Low Kappa
Low AUC
High Kappa
Average Kappa between models on a specific dataset. Sorted by generalization accuracy
Common labels provide more consistency.
Are labels omitted because they are subject to a lot of interrater variability?
11/28
How to study concept drift?
We can use the weight vector at the classification layer for a specific task (just a logistic regression)
Network figure credit: Sara Sheehan
...
For each class
a: feature vector length
t: number of tasks
d: number of domains
Minimize pairwise distances between each weight vector of the same task.
If each weight vector doesn't merge together then some concept drift is pulling them apart.
12/28
13/28
Do distances between weight vectors explain anything about generalization?
Sorted based on average distance over 3 seeds some tasks are grouped together easier than others.
This distance plotted against average generalization performance shows a slight trend.
Discussion
14/28
Chapter 2
Incorrect feature attribution
15
Incorrect feature attribution
NIH
PADCHEST
Example: Systematic discrepancy between
average image in datasets
Models can overfit to confounding variables in the data.
Overfitting while predicting Emphysema [Vivano 2019]
[Zeck, Confounding variables can degrade generalization performance of radiological deep learning models, 2018]
[Viviano, Underwhelming Generalization Improvements From Controlling Feature Attribution, 2019]
[Simpson, GradMask: Reduce Overfitting by Regularizing Saliency, 2019]
[Ross, Right for the Right Reasons, 2017]
Mitigation approaches
Feature engineering
During training
17/28
What if feature artifact is correlated with target label?�Is the reason that should be used for prediction known?
What if it is not known?
18/28
GradMask Contrast loss
Right for the Right Reasons loss
Task: emphysema prediction
19/28
Although the saliancy mask appears more correct the model does not improve.
SPC=site-pathology correlation.
Pr. Yoshua Bengio, PhD
Francis Dutil
Martin Weiss
Tristan Sylvain
Margaux Luck,
PhD
Assya Trofimov
Vincent Frappier,
PhD
Joseph Paul Cohen, PhD
Medical Research Lead
Shawn Tan
Sina Honari
Geneviève Boucher
Mandana Samiei
Georgy Derevyanko, PhD
Paul Bertin
Hashir Khan
Tobias Wuerfl
Becks Simpson
Karsten Roth
Hadrien Bertrand, PhD
Joseph Viviano
End