1 of 21

Generalization challenges and making models right for the right reasons in medicine

(with a focus on chest X-ray diagnostics)

Joseph Paul Cohen�Postdoctoral Fellow�Mila, Université de Montréal

1/28

2 of 21

Conflicts of Interest

None

2/28

3 of 21

Chester: a free open source tool to try deep learning

3/28

Kidney Donor Risk Index (KDRI)

[Feng 2006]

The Framingham Heart Study

Cardiovascular Disease Risk

[Online ~2018]

Emergency Room (Time limited human)

Rural Hospital (no radiologist nearby)

Chester: AI Radiology Assistant

[Cohen 2019]

Triage of cases by non-expert

[Cohen, Chester: A Web Delivered Locally Computed Chest X-Ray Disease Prediction System, 2019] https://arxiv.org/abs/1901.11210

As an educational tool in school

*NOT FOR MEDICAL USE YET*

4 of 21

Chapter 1

Cross-domain generalization

4

5 of 21

What would lead to such strange results?

An online post about the system indicated some contention about these labels.

Bálint Botz - Evaluating chest x-rays using AI in your browser? — testing Chester:

Test data (AUC)

NIH

(Maryland, US)

PadChest

(Spain)

Mass

0.88

0.89

Nodule

0.81

0.74

Pneumonia

0.73

0.83

Consolidation

0.82

0.91

Infiltration

0.73

0.60

Initial results when evaluating this model on an external dataset from Spain.

6 of 21

6/28

To investigate, a cross domain evaluation is performed. The 5 largest datasets are trained and evaluated on.

https://arxiv.org/abs/2002.02497

Each dataset's labels are generated using a different method. Some automatic and some manual.

Good

Medium

Variable

7 of 21

7/28

We model:

We may blame poor performance on a shift in x (covariate shift) but that would not account why for some y it works well.

It seems more likely that there is some shift in y (concept shift) which would force us to condition the prediction.

Possibly reality

But we want objective predictions!

8 of 21

8/28

We may think that training on local data is addressing covariate shift

However, training on local data provides better performance than using all other data (>100k examples).

Likely only adapting to the local biases in the data which may not match the reality in the images

Cross domain validation analysis. Average over 3 seeds for all labels.

9 of 21

What is causing this shift?

9/28

  • Errors in labelling as discussed by Oakden-Rayner (2019) and Majkowska et al. (2019), in part due to automatic labellers.�
  • Discrepancy between the radiologist’s vs clinician’s vs automatic labeller’s understanding of a radiology report (Brady et al., 2012).�
  • Bias in clinical practice between doctors and their clinics (Busby et al., 2018) or limitations in objectivity (Cockshott & Park, 1983; Garland, 1949).�
  • Interobserver variability (Moncada et al., 2011). It can be related to the medicalculture, language, textbooks, or politics. Possibly even conceptually (e.g. footballs between USA and the world).

10 of 21

10/28

High AUC

Low Kappa

Low AUC

High Kappa

Average Kappa between models on a specific dataset. Sorted by generalization accuracy

Common labels provide more consistency.

Are labels omitted because they are subject to a lot of interrater variability?

11 of 21

11/28

How to study concept drift?

We can use the weight vector at the classification layer for a specific task (just a logistic regression)

Network figure credit: Sara Sheehan

...

For each class

a: feature vector length

t: number of tasks

d: number of domains

Minimize pairwise distances between each weight vector of the same task.

If each weight vector doesn't merge together then some concept drift is pulling them apart.

12 of 21

12/28

13 of 21

13/28

Do distances between weight vectors explain anything about generalization?

Sorted based on average distance over 3 seeds some tasks are grouped together easier than others.

This distance plotted against average generalization performance shows a slight trend.

14 of 21

Discussion

  • We believe generalization is not due to a shift in the images but instead a shift in the labels.
  • Better automatic labeling may not be the solution.
  • General disagreement between radiologists and subjectivity in what is clinically relevant to include in a report.
  • We should consider each task prediction as defined by its training data such as "NIH Pneumonia''. One can present the output of multiple models to a user.
  • We assert that a solution is not to train on a local data from a hospital.

14/28

15 of 21

Chapter 2

Incorrect feature attribution

15

16 of 21

Incorrect feature attribution

NIH

PADCHEST

Example: Systematic discrepancy between

average image in datasets

Models can overfit to confounding variables in the data.

  • Merging datasets with different class imbalance�(confounding artifacts from each hospital)�
  • Labels confounding with each other�
  • Demographics confounding with labels

Overfitting while predicting Emphysema [Vivano 2019]

[Zeck, Confounding variables can degrade generalization performance of radiological deep learning models, 2018]

[Viviano, Underwhelming Generalization Improvements From Controlling Feature Attribution, 2019]

[Simpson, GradMask: Reduce Overfitting by Regularizing Saliency, 2019]

[Ross, Right for the Right Reasons, 2017]

17 of 21

Mitigation approaches

Feature engineering

  • Range normalization ( /max)
  • Subspace alignment (align data using their eigenbasis based on a feature)

During training

  • Reverse gradient (make intermediate layer invariant to a label) [Ganin & Lempitsky, 2014]
  • Right for the Right Reasons (regularize saliency map) [Ross, Hughes, & Finale Doshi-Velez, 2017]
  • GradMask (regularize contrast saliency map between classes) [Simpson, 2019]
  • ActivDiff (regularize representation to focus on pathology) [Viviano, 2019]

17/28

What if feature artifact is correlated with target label?�Is the reason that should be used for prediction known?

What if it is not known?

18 of 21

18/28

GradMask Contrast loss

Right for the Right Reasons loss

19 of 21

Task: emphysema prediction

19/28

Although the saliancy mask appears more correct the model does not improve.

SPC=site-pathology correlation.

20 of 21

Pr. Yoshua Bengio, PhD

Francis Dutil

Martin Weiss

Tristan Sylvain

Margaux Luck,

PhD

Assya Trofimov

Vincent Frappier,

PhD

Joseph Paul Cohen, PhD

Medical Research Lead

Shawn Tan

Sina Honari

Geneviève Boucher

Mandana Samiei

Georgy Derevyanko, PhD

Paul Bertin

Hashir Khan

Tobias Wuerfl

Becks Simpson

Karsten Roth

Hadrien Bertrand, PhD

Joseph Viviano

21 of 21

End