1 of 30

Reproducible and replicable Deep Learning

Tracking and automating experiments

Erik Ylipää

Linköping University

AIDA Data Hub AI Support Lead

National Bioinformatics Infrastructure Sweden (NIBS)

SciLifeLab

2 of 30

Goal - minimal manual experimentation

MLOps level 0: Manual process

MLOps: Continuous delivery and automation pipelines in machine learning

https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

3 of 30

MLFlow

4 of 30

Experiment tracking

API for tracking parameters, metrics, models, images, tables, artifacts and more

Automatic logging for many frameworks (but we will use the manual logging)

5 of 30

Comes with graphical viewer

Easy to analyze parameters, follow training progress and find logged artifacts

6 of 30

MLFlow - much more

Handling project dependencies and containerization of model

Model server - easy to build prediction services

Model tracking

7 of 30

Other tools

  • Tensorboard - easy to use with TensorFLow and Pytorch
    • mainly for tracking experiments and basic analysis, not deployment or automation
  • Weights and Biases (wandb) - popular experiment tracking, SaaS model difficult to integrate with sensitive data platforms

8 of 30

Reproducibility in deep learning

9 of 30

Which kind of reproducibility would you prefer for machine learning experiments?

  • Exact Reproducibility: Achieving the same results using the exact same code, data, and environment.
  • Approximate Reproducibility: Achieving similar results even if the environment or some hyperparameters are slightly different.
  • Conceptual Reproducibility: Reproducing the core findings and insights even if different methods or data are used.

10 of 30

Getting the exact same results in machine learning is like watching a recording of an experiment

Good as a verification

But has the watcher reproduced the experiment in a meaningful way?

11 of 30

Useful machine learning results should give similar performance on my data!*

Wait, it was data all along?

Always has been.

*assuming my task is similar to the one I’m replicating

Reproducibility is too weak,

the goal should be replicability

12 of 30

You only have access to limited data, but you should strive to do as much with it as possible

With clinical data, we rarely have the luxury of external test sets but should try to make the most of the ones we have

Wait, it was data all along?

Always has been.

13 of 30

14 of 30

Why use canonical test sets?

  • Because you must - for getting published
    • If possible - include the canonical test set performance as a point in your performance distribution

Human ML researchers overfitting to the test set, inspired by Hieronymus Bosch

15 of 30

Recipe for sampling

  • We want to answer the question - how robust is our method to sampling variation?
    • Cross validation is a good default resampling strategy
    • Resist the urge to have a single split unless you have huge amounts of data

16 of 30

Nested cross validation

  • Deep learning models typically need a development set (often confusingly called a validation set)
    • Try to control for the variation in dataset split using nested cross-validation

17 of 30

Machine learning results should replicate independent of which human runs the experiment!

Grad student searching for the optimal hyper parameters by manually - Grad Student Descent

Assume that the method will need different hyper parameters for different datasets. These should not be discovered manually.

18 of 30

Hyper Parameter Optimization

Shahriari, Bobak, et al. "Taking the human out of the loop: A review of Bayesian optimization." Proceedings of the IEEE 104.1 (2015): 148-175.

19 of 30

20 of 30

Taking it to the extreme

  • If we care a lot about finding the most robust hyper parameters, we can make resampling for each hyper parameter trial
  • If you are limited by computation, search for hyper parameters on a single split
  • It’s still a generalizable method and the nested cross validation will tell you if it was good

21 of 30

Is it worth it?

  • Setting up experiments for proper evaluation is more engineering work
    • Is often amortized over future experiments
  • Running enough experiments to get reliable results is expensive
    • Doing nested cross validation increases number of experiments multiplicatively
      • E.g. train 100 models instead of 10 for nested 10x10 cross validation
      • With hyper parameter optimization (HPO), this can become untenable, 20 HPO runs would increase this 20-fold (20x10x10 = 2000 model trainings)
    • But as compute becomes more powerful, this is a very easy way to make use of it

22 of 30

What hyper parameters to search over

  • Search budget is always limited,but these are good to include
    • Optimizer (e.g. AdamW):
      • Learning rate - most important
      • Momentum parameters (Beta 1 and 2)
      • Weight decay
      • Learning rate scheduler
    • Effective batch size (using gradient accumulation)
    • Regularization parameters
      • Dropout
      • Normalization layers
    • Neural network architecture
      • Number of layers
      • Size of layers
      • Random seed?

23 of 30

The random weight initialization

Åkesson, Julius, Johannes Töger, and Einar Heiberg. "Random effects during training: Implications for deep learning-based medical image segmentation." Computers in Biology and Medicine 180 (2024): 108944.

24 of 30

We need to talk about random seeds

  • Fixing the random seed is often done for “reproducibility”, but it’s often chosen as some arbitrary value (e.g. 42)
  • What the initial weights of the network are matter a lot, it’s part of the architecture just like size
  • Any arbitrary random seed is likely to produce a model which performs worse than a tuned random seed

Bethard, Steven. "We need to talk about random seeds." arXiv preprint arXiv:2210.13393 (2022).

25 of 30

In summary

  • Amount of compute is constantly increasing - we can use the increased capacity to improve replicability
    • Normalize not getting a machine learning result in a couple of hours
  • Use (nested) cross validation
  • Automate as much as possible:
    • Track everything
    • Use hyper parameter optimization

26 of 30

Thank you!

erik.ylipaa@scilifelab.se

27 of 30

Scientific value:

  • Exact reproducibility: This is the gold standard for reproducibility because it verifies that results are not only theoretically sound but also practically achievable under the same conditions. It allows for a direct verification of claims made by the original research.
  • Conceptual reproducibility: This is highly valuable as it demonstrates that the underlying principles and findings are robust across different implementations and datasets. It contributes to the generalization of findings and their applicability to real-world scenarios.
  • Approximate reproducibility Useful for validating that results are not highly sensitive to minor variations. While not as stringent as exact reproducibility, it still provides confidence in the robustness of findings.

28 of 30

Scientific value:

  • Exact reproducibility: This is the gold standard for reproducibility because it verifies that results are not only theoretically sound but also practically achievable under the same conditions. It allows for a direct verification of claims made by the original research.
  • Conceptual reproducibility: This is highly valuable as it demonstrates that the underlying principles and findings are robust across different implementations and datasets. It contributes to the generalization of findings and their applicability to real-world scenarios.
  • Approximate reproducibility Useful for validating that results are not highly sensitive to minor variations. While not as stringent as exact reproducibility, it still provides confidence in the robustness of findings.

29 of 30

What makes a good algorithm?

Abstract: In this work we present a novel algorithm which can sort a list in O(1) time, this is considerably faster than existing algorithms which sorts lists in O(N log N) for comparison-based algorithm or O(N) for index-based sorting. We demonstrate the performance of our algorithm on the commonly used range(10) benchmark.

def constant_sort(l):

return [0,1,2,3,4,5,6,7,8,9]

30 of 30

What is best?

  • Each time you train the machine, you get the exact same result
  • Each time you train the machine, you get slightly different results