Is the ML reproducibility crisis a natural consequence?��The Reproducibility Crisis in ML‑based Science Workshop
Michael Roberts�Department of Applied Mathematics and Theoretical Physics
Michael Roberts on behalf of the AIX-COVNET Collaboration
email: mr808@cam.ac.uk
A pandemic of reproducibility issues
ML for COVID-19
ML for Incomplete Data
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
ML at scale
ML and code
ML for COVID-19 imaging
Joint with Derek Driggs, Matthew Thorpe, Julian Gilbey, Angelica I. Aviles-Rivero, James Rudd, Evis Sala, Carola-Bibiane Schönlieb and many AIX-COVNET members
Issues with …
Roberts, M., Driggs, D., Thorpe, M. et al. �Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence 3, 199–217 (2021). https://doi.org/10.1038/s42256-021-00307-0
COVID-19: A perfect test case for image based ML
So why did image based ML fail to contribute significantly to the COVID-19 pandemic?
Systematic Review
5
Screened Eligible for Review After Quality Review Of Safe Clinical Utility � 2,212 320 62 0
[1] Roberts, M., Driggs, D., Thorpe, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for� COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence 3, 199–217 (2021). https://doi.org/10.1038/s42256-021-00307-0
Basic pitfalls: Frankenstein datasets
6
Figure from Cruz et al. Public Covid-19 X-ray datasets and their impact on model bias – A systematic review of a significant problem (2021)
Basic pitfalls: Frankenstein datasets
7
Figure from Cruz et al. Public Covid-19 X-ray datasets and their impact on model bias – A systematic review of a significant problem (2021)
Basic pitfalls: biases in images
8
Basic pitfalls: biases in labels
9
Positive
Negative
Positive
Negative
Basic pitfalls: biases in models
10
ML for Incomplete Data
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
Imputation
Classification
Joint with Tolou Shadbahr, Julian Gilbey, Jan Stanczuk, Philip Teare, Sören Dittmer, John Aston, Carola-Bibiane Schönlieb and many AIX-COVNET members
Issues with …
Shadbahr, T., Roberts, M., Stanczuk, J., Gilbey, J., Teare, P., Dittmer, S., ... & Schönlieb, C. B. �Classification of datasets with imputed missing values: does imputation quality matter? (2022). arXiv preprint arXiv:2206.08478.
Why is imputation important?
[1] Hippisley-Cox J, Coupland C, Vinogradova Y, Robson J, May M, Brindle P et al. Derivation and validation of QRISK, a new cardiovascular disease risk score for the United Kingdom: prospective open cohort study BMJ 2007; 335 :136 doi:10.1136/bmj.39261.471806.55
[2] https://www.bmj.com/rapid-response/2011/11/01/multiple-imputation-needs-be-used-care-and-reported-detail
How should we measure quality?
True Data Conditional Expectation MissForest MICE
MSE = 1.001641
W1 = 0.000286
MSE = 1.094625�W1 = 0.000147
MSE = 3.007393�W1 = 0.000134
Mean / root mean square error is a common metric for measuring imputation quality.
How should we measure quality?
True Data Conditional Expectation MissForest MICE
MSE = 1.001641
W1 = 0.000286
MSE = 1.094625�W1 = 0.000147
MSE = 3.007393�W1 = 0.000134
Mean / root mean square error is a common metric for measuring imputation quality.
W1 = Wasserstein distance between imputed and real value distributions
How should we measure quality?
True Data Conditional Expectation MissForest MICE
MSE = 1.001641
W1 = 0.000286
MSE = 1.094625�W1 = 0.000147
MSE = 3.007393�W1 = 0.000134
Mean / root mean square error is a common metric for measuring imputation quality.
W1 = Wasserstein distance between imputed and real value distributions
What are the issues?
ML at scale
Joint with Nick Gleadall, Daniel Kreuter, Samuel Tull, Julian Gilbey, James Rudd, John Aston, Carola-Bibiane Schönlieb, Willem Ouwehand and other members of AIX-COVNET and BloodCounts!
BloodCounts!
Issues with …
Motivation
Biases for machine learning
Observed FBC values
Error introduced due to machine
Error introduced due to clinical practice
True value
Example biases
Change in white cell count after bleeding the patient
Original
Corrected
Clinical biases
Machine biases
Original
Corrected
What are the issues?
ML and code
Joint with Sören Dittmer, James Rudd, John Aston, Carola-Bibiane Schönlieb
Issues with …
Marauding as software engineers
10 years ago, this was a software engineer….
Now it is also a:
Issues with the new world
An explosion of data and sources
We must often structure and preprocess this data ourselves
Without deep domain knowledge, we have lost control of the biases in the data
Issues with the new world
Idealistic illustration of machine learning
The reality …
Data loading
Data �pre-processing
Model training
Evaluation
Issues with the new world
?
Typical workflow for an ML project
STOP
General recommendations / thoughts
Thank you