1 of 28

Is the ML reproducibility crisis a natural consequence?��The Reproducibility Crisis in ML‑based Science Workshop

Michael Roberts�Department of Applied Mathematics and Theoretical Physics

Michael Roberts on behalf of the AIX-COVNET Collaboration

email: mr808@cam.ac.uk

2 of 28

A pandemic of reproducibility issues

ML for COVID-19

ML for Incomplete Data

ML at scale

ML and code

3 of 28

ML for COVID-19 imaging

Joint with Derek Driggs, Matthew Thorpe, Julian Gilbey, Angelica I. Aviles-Rivero, James Rudd, Evis Sala, Carola-Bibiane Schönlieb and many AIX-COVNET members

Issues with …

Roberts, M., Driggs, D., Thorpe, M. et al. �Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence 3, 199–217 (2021). https://doi.org/10.1038/s42256-021-00307-0

4 of 28

COVID-19: A perfect test case for image based ML

Never before have we had access to data:

at such scale for a single disease
all collected around the same time period
collected on the same machines
for a disease with a short infection time�

So why did image based ML fail to contribute significantly to the COVID-19 pandemic?

5 of 28

Systematic Review

Eligibility: Any papers using ML and CXR/CT imaging for COVID-19 diagnosis or prognosis.

5

Screened Eligible for Review After Quality Review Of Safe Clinical Utility � 2,212 320 62 0

[1] Roberts, M., Driggs, D., Thorpe, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for� COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence 3, 199–217 (2021). https://doi.org/10.1038/s42256-021-00307-0

6 of 28

Basic pitfalls: Frankenstein datasets

Know where your data comes from!

6

Figure from Cruz et al. Public Covid-19 X-ray datasets and their impact on model bias – A systematic review of a significant problem (2021)

7 of 28

Basic pitfalls: Frankenstein datasets

Know where your data comes from!

7

Figure from Cruz et al. Public Covid-19 X-ray datasets and their impact on model bias – A systematic review of a significant problem (2021)

8 of 28

Basic pitfalls: biases in images

8

Know where your data comes from!
Appreciate the biases in your data.

9 of 28

Basic pitfalls: biases in labels

9

Know where your data comes from!
Appreciate the biases in your data.
Ground truth assigned based on images.

Positive

Negative

Positive

Negative

10 of 28

Basic pitfalls: biases in models

10

Know where your data comes from!
Appreciate the biases in your data.
Ground truth assigned based on images.
Resolution driven by pretrained networks

11 of 28

ML for Incomplete Data

Imputation

Classification

Joint with Tolou Shadbahr, Julian Gilbey, Jan Stanczuk, Philip Teare, Sören Dittmer, John Aston, Carola-Bibiane Schönlieb and many AIX-COVNET members

Issues with …

Shadbahr, T., Roberts, M., Stanczuk, J., Gilbey, J., Teare, P., Dittmer, S., ... & Schönlieb, C. B. �Classification of datasets with imputed missing values: does imputation quality matter? (2022). arXiv preprint arXiv:2206.08478.

12 of 28

Why is imputation important?

QRISK is a model for predicting your risk of a cardiovascular event in the next 10 years.

Initially, the published paper [1] found no link between cholesterol and outcome.

Other researchers [2] found that when only the complete data was considered, the link to cholesterol was found.

After improving the imputation method, the link to cholesterol was recovered.

The algorithm is now a standard used in the UK NHS.

[1] Hippisley-Cox J, Coupland C, Vinogradova Y, Robson J, May M, Brindle P et al. Derivation and validation of QRISK, a new cardiovascular disease risk score for the United Kingdom: prospective open cohort study BMJ 2007; 335 :136 doi:10.1136/bmj.39261.471806.55

[2] https://www.bmj.com/rapid-response/2011/11/01/multiple-imputation-needs-be-used-care-and-reported-detail

13 of 28

How should we measure quality?

True Data Conditional Expectation MissForest MICE

MSE = 1.001641

W₁ = 0.000286

MSE = 1.094625�W₁ = 0.000147

MSE = 3.007393�W₁ = 0.000134

Mean / root mean square error is a common metric for measuring imputation quality.

14 of 28

How should we measure quality?

True Data Conditional Expectation MissForest MICE

MSE = 1.001641

W₁ = 0.000286

MSE = 1.094625�W₁ = 0.000147

MSE = 3.007393�W₁ = 0.000134

Mean / root mean square error is a common metric for measuring imputation quality.

W₁ = Wasserstein distance between imputed and real value distributions

15 of 28

How should we measure quality?

True Data Conditional Expectation MissForest MICE

MSE = 1.001641

W₁ = 0.000286

MSE = 1.094625�W₁ = 0.000147

MSE = 3.007393�W₁ = 0.000134

Mean / root mean square error is a common metric for measuring imputation quality.

W₁ = Wasserstein distance between imputed and real value distributions

16 of 28

What are the issues?

We find that many new methods fail both to:

recreate data distributions
give stable imputations

This compromises model interpretability
Missingness may be informative for the models

17 of 28

ML at scale

Joint with Nick Gleadall, Daniel Kreuter, Samuel Tull, Julian Gilbey, James Rudd, John Aston, Carola-Bibiane Schönlieb, Willem Ouwehand and other members of AIX-COVNET and BloodCounts!

BloodCounts!

Issues with …

18 of 28

Motivation

Most commonly used medical test, 3.6 billion globally per year, 10 million per day.

Essential for decision making in primary and secondary care

We are linking together thousands of machines.

19 of 28

Biases for machine learning

Observed FBC values

Error introduced due to machine

Error introduced due to clinical practice

True value

20 of 28

Example biases

Change in white cell count after bleeding the patient

Original

Corrected

Time between bleed and testing

Temperature of the storage

Differing service at weekends

Seasonal variations in the blood

Clinical biases

Machine biases

Original

Corrected

Measurements vary between machines

At surveillance scale, we will have 1000s of machines in the network

We must model and remove this variability

21 of 28

What are the issues?

There are no publications which strip out this bias
Many hospitals reserve one machine for urgent care
Models cannot generalise across sites where clinical practice differs

22 of 28

ML and code

Joint with Sören Dittmer, James Rudd, John Aston, Carola-Bibiane Schönlieb

Issues with …

23 of 28

Marauding as software engineers

10 years ago, this was a software engineer….

Now it is also a:

Data scientist

Mathematician

Statistician

Machine learning engineer

24 of 28

Issues with the new world

An explosion of data and sources

We must often structure and preprocess this data ourselves

Without deep domain knowledge, we have lost control of the biases in the data

25 of 28

Issues with the new world

Idealistic illustration of machine learning

The reality …

Data loading

Data �pre-processing

Model training

Evaluation

26 of 28

Issues with the new world

Code is built as a monolith
It is tinkered with until it stops giving errors
Train using all the data and then analyse the results

?

Typical workflow for an ML project

STOP

27 of 28

General recommendations / thoughts

We need to rethink the incentive structure for machine learning based research.
Clinical trials are slow, and in phases, for a reason…
Create a study plan in advance addressing imbalance, power, etc
The literature needs purifying.
We all need to use checklists, journals need to enforce their use
Complex datasets + complex systems = a nightmare for reproducibility
Practitioners need to rethink coding methodologies, e.g. AGILE
If you don’t enforce your assumptions in the code, someone will break it

28 of 28

Thank you