1 of 28

Is the ML reproducibility crisis a natural consequence?��The Reproducibility Crisis in ML‑based Science Workshop

Michael Roberts�Department of Applied Mathematics and Theoretical Physics

Michael Roberts on behalf of the AIX-COVNET Collaboration

email: mr808@cam.ac.uk

2 of 28

A pandemic of reproducibility issues

ML for COVID-19

ML for Incomplete Data

ML at scale

ML and code

3 of 28

ML for COVID-19 imaging

Joint with Derek Driggs, Matthew Thorpe, Julian Gilbey, Angelica I. Aviles-Rivero, James Rudd, Evis Sala, Carola-Bibiane Schönlieb and many AIX-COVNET members

Issues with …

Roberts, M., Driggs, D., Thorpe, M. et al. �Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence 3, 199–217 (2021). https://doi.org/10.1038/s42256-021-00307-0

4 of 28

COVID-19: A perfect test case for image based ML

  • Never before have we had access to data:
    • at such scale for a single disease
    • all collected around the same time period
    • collected on the same machines
    • for a disease with a short infection time�

So why did image based ML fail to contribute significantly to the COVID-19 pandemic?

5 of 28

Systematic Review

  • Eligibility: Any papers using ML and CXR/CT imaging for COVID-19 diagnosis or prognosis.

5

Screened Eligible for Review After Quality Review Of Safe Clinical Utility � 2,212 320 62 0

[1] Roberts, M., Driggs, D., Thorpe, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for� COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence 3, 199–217 (2021). https://doi.org/10.1038/s42256-021-00307-0

6 of 28

Basic pitfalls: Frankenstein datasets

  • Know where your data comes from!

6

Figure from Cruz et al. Public Covid-19 X-ray datasets and their impact on model bias – A systematic review of a significant problem (2021)

7 of 28

Basic pitfalls: Frankenstein datasets

  • Know where your data comes from!

7

Figure from Cruz et al. Public Covid-19 X-ray datasets and their impact on model bias – A systematic review of a significant problem (2021)

8 of 28

Basic pitfalls: biases in images

8

  • Know where your data comes from!
  • Appreciate the biases in your data.

9 of 28

Basic pitfalls: biases in labels

9

  • Know where your data comes from!
  • Appreciate the biases in your data.
  • Ground truth assigned based on images.

Positive

Negative

Positive

Negative

10 of 28

Basic pitfalls: biases in models

10

  • Know where your data comes from!
  • Appreciate the biases in your data.
  • Ground truth assigned based on images.
  • Resolution driven by pretrained networks

11 of 28

ML for Incomplete Data

Imputation

Classification

Joint with Tolou Shadbahr, Julian Gilbey, Jan Stanczuk, Philip Teare, Sören Dittmer, John Aston, Carola-Bibiane Schönlieb and many AIX-COVNET members

Issues with …

Shadbahr, T., Roberts, M., Stanczuk, J., Gilbey, J., Teare, P., Dittmer, S., ... & Schönlieb, C. B. �Classification of datasets with imputed missing values: does imputation quality matter? (2022).  arXiv preprint arXiv:2206.08478.

12 of 28

Why is imputation important?

  • QRISK is a model for predicting your risk of a cardiovascular event in the next 10 years.

  • Initially, the published paper [1] found no link between cholesterol and outcome.

  • Other researchers [2] found that when only the complete data was considered, the link to cholesterol was found.

  • After improving the imputation method, the link to cholesterol was recovered.

  • The algorithm is now a standard used in the UK NHS.

[1] Hippisley-Cox J, Coupland C, Vinogradova Y, Robson J, May M, Brindle P et al. Derivation and validation of QRISK, a new cardiovascular disease risk score for the United Kingdom: prospective open cohort study BMJ 2007; 335 :136 doi:10.1136/bmj.39261.471806.55

[2] https://www.bmj.com/rapid-response/2011/11/01/multiple-imputation-needs-be-used-care-and-reported-detail

13 of 28

How should we measure quality?

True Data Conditional Expectation MissForest MICE

MSE = 1.001641

W1 = 0.000286

MSE = 1.094625�W1 = 0.000147

MSE = 3.007393�W1 = 0.000134

Mean / root mean square error is a common metric for measuring imputation quality.

14 of 28

How should we measure quality?

True Data Conditional Expectation MissForest MICE

MSE = 1.001641

W1 = 0.000286

MSE = 1.094625�W1 = 0.000147

MSE = 3.007393�W1 = 0.000134

Mean / root mean square error is a common metric for measuring imputation quality.

W1 = Wasserstein distance between imputed and real value distributions

15 of 28

How should we measure quality?

True Data Conditional Expectation MissForest MICE

MSE = 1.001641

W1 = 0.000286

MSE = 1.094625�W1 = 0.000147

MSE = 3.007393�W1 = 0.000134

Mean / root mean square error is a common metric for measuring imputation quality.

W1 = Wasserstein distance between imputed and real value distributions

16 of 28

What are the issues?

  • We find that many new methods fail both to:
    • recreate data distributions
    • give stable imputations
  • This compromises model interpretability
  • Missingness may be informative for the models

17 of 28

ML at scale

Joint with Nick Gleadall, Daniel Kreuter, Samuel Tull, Julian Gilbey, James Rudd, John Aston, Carola-Bibiane Schönlieb, Willem Ouwehand and other members of AIX-COVNET and BloodCounts!

BloodCounts!

Issues with …

18 of 28

Motivation

  • Most commonly used medical test, 3.6 billion globally per year, 10 million per day.

  • Essential for decision making in primary and secondary care

  • We are linking together thousands of machines.

19 of 28

Biases for machine learning

  •  

Observed FBC values

Error introduced due to machine

Error introduced due to clinical practice

True value

20 of 28

Example biases

Change in white cell count after bleeding the patient

Original

Corrected

  • Time between bleed and testing

  • Temperature of the storage

  • Differing service at weekends

  • Seasonal variations in the blood

Clinical biases

Machine biases

Original

Corrected

  • Measurements vary between machines

  • At surveillance scale, we will have 1000s of machines in the network

  • We must model and remove this variability

21 of 28

What are the issues?

  • There are no publications which strip out this bias
  • Many hospitals reserve one machine for urgent care
  • Models cannot generalise across sites where clinical practice differs

22 of 28

ML and code

Joint with Sören Dittmer, James Rudd, John Aston, Carola-Bibiane Schönlieb

Issues with …

23 of 28

Marauding as software engineers

10 years ago, this was a software engineer….

Now it is also a:

  • Data scientist

  • Mathematician

  • Statistician

  • Machine learning engineer

24 of 28

Issues with the new world

An explosion of data and sources

We must often structure and preprocess this data ourselves

Without deep domain knowledge, we have lost control of the biases in the data

25 of 28

Issues with the new world

Idealistic illustration of machine learning

The reality …

Data loading

Data �pre-processing

Model training

Evaluation

26 of 28

Issues with the new world

  • Code is built as a monolith
  • It is tinkered with until it stops giving errors
  • Train using all the data and then analyse the results

?

Typical workflow for an ML project

STOP

27 of 28

General recommendations / thoughts

  • We need to rethink the incentive structure for machine learning based research.
  • Clinical trials are slow, and in phases, for a reason…
  • Create a study plan in advance addressing imbalance, power, etc
  • The literature needs purifying.
  • We all need to use checklists, journals need to enforce their use
  • Complex datasets + complex systems = a nightmare for reproducibility
  • Practitioners need to rethink coding methodologies, e.g. AGILE
  • If you don’t enforce your assumptions in the code, someone will break it

28 of 28

Thank you