Intro to Reproducibility!
Candace Savonen
Image created by Candace Savonen using Avataars.
Variable A
Variable B
R = 0.893
This code runs well on my computer, let me email it to you!
So exciting!
Image created by Candace Savonen using Avataars.
ERROR ERROR ERROR ERROR ERROR ERROR �ERROR ERROR
ERROR ERROR
Ruby’s code and data
Image created by Candace Savonen using Avataars.
Error: file path “Ruby’s computer/Ruby’s file/final_version10.R” not found
Re:Re:Re: Data
Hi Ruby, I don’t understand what this code is supposed to be doing...
Re:Re:Re: Data
Hi Avi, It works for me?
Image created by Candace Savonen using Avataars.
Variable A
Variable B
Variable A
Variable B
R2 = 0.893
R2 = 0.891
Ruby’s code and data
Image created by Candace Savonen using Avataars.
Data
Code
Variable A
Variable B
Results
Ruby the Researcher
Repeatable: keeping everything the same but repeating the analysis - do we get the same results?
Image created by Candace Savonen using Avataars.
Data
Code
Variable A
Variable B
Results
Ruby the Researcher
Reproducible: using the same data and analysis but in the hands of another researcher - do we get the same results?
Avi the Associate
Data
Code
Image created by Candace Savonen using Avataars.
New Data
Same Code
Ruby the Researcher
Replicable: with new data do we obtain the same inferences?
Avi the Associate
Variable A and B are positively correlated
Code
Based off of a figure from Essawy et al, 2020 https://doi.org/10.1016/j.envsoft.2020.104753
Effort
Time
Replicability
new researcher, new data
Reproducibility
new researcher, same data
Repeatability
same researcher, same data
Reproducibility saves everyone time and effort!
Image created by Candace Savonen using Avataars.
Ruby’s code
ERROR
Ruby’s code
Now Ruby
Future Ruby
Image created by Candace Savonen using Avataars.
Ruby’s code - not as reproducible
ERROR
ERROR
ERROR
ERROR
ERROR
ERROR
ERROR
Image created by Candace Savonen using Avataars.
Ruby’s code - made reproducibly
ERROR
Reproducibility is a tortoise’s game - it’s an incremental and slow process but it has high payoffs!
Science is a community effort – we each are contributing a domino!
Reproducibility is iterative work!
Image created by Candace Savonen.
Ran once
Re-runs sometimes
Re-runs in every situation and gets the same result every time
Re-runs reliably in most contexts
Perfectly reproducible
No analysis reaches here
Not repeatable
Every analysis starts here
A tale of two differential expression analyses…
Example 1:
Example 2:
Reproducibility != Correctness
Reproducibility ~ Consistency
But you could be consistently wrong in the same way….
Tips for reproducibility:
*Everything that doesn’t violate IRB privacy and ethical data handling guidelines
Now!
Tips for reproducibility:
*Everything that doesn’t violate IRB privacy and ethical data handling guidelines
Tomorrow!
Tips for reproducibility:
*Everything that doesn’t violate IRB privacy and ethical data handling guidelines
Will cover more about this Day 3!
Metadata - the often overlooked but mightily crucial data
What are metadata?
Anything and everything that should be known about your samples!
A B C D
E F G H
sample_id | mouse_id | processing_date | treatment | … |
A | 1 | 3-10-21 | None | … |
B | 1 | 4-12-21 | None | … |
C | 2 | 3-10-21 | None | … |
D | 2 | 4-12-21 | None | … |
E | 3 | 3-10-21 | Morphine | … |
F | 3 | 4-12-21 | Morphine | … |
G | 4 | 3-10-21 | Morphine | … |
H | 4 | 4-12-21 | Morphine | … |
I know everything I need to know about these samples from their metadata!
Examples of metadata categories:
Rules for creating metadata (from Broman & Woo, 2017)
Be Consistent
Choose good names for things
Write Dates as YYYY-MM-DD
No Empty Cells
Put Just One Thing in a Cell
Make it a Rectangle
1
Rules for creating metadata continued (from Broman & Woo, 2017)
Create a Data Dictionary
No Calculations in the Raw Data Files
Do Not Use Font Color or Highlighting as Data
Make Backups
Use Data Validation to Avoid Errors
If you have human data the metadata probably is loaded with PII and/or PHI
Tips for reproducibility:
*Everything that doesn’t violate IRB privacy and ethical data handling guidelines
A useable, well-documented analysis is more likely to be used and disseminated!
CC-BY by jhudatascience.org
Documentation that every project should have!
Documentation that every project should have!
LLM | What is it really good at? | What does it struggle with? |
Bard |
|
|
ChatGPT |
|
|
Claude |
|
|
Phind |
|
|
LLM | What is it really good at? | What does it struggle with? |
Bard |
|
|
ChatGPT |
|
|
Claude |
|
|
Phind |
|
|
Use AI tools that are trained for the task you are trying to do!
LLMs can be good at telling you what historical or confusing code is doing
Image created by Candace Savonen using Avataars.
def my_function(x):
result = x
for i in range(10):
for j in range(5):
result = result + 2 * (i + 1) * (j + 1) * (i % 2 == 0 and j % 2 == 0) - 1
return result
Wait, what is this code even?
plot-data-2020-9-11.tsv
plot-data-20-10-2020.tsv
plot-data-20-10-2020-clean.tsv
plot_final.R
plot_final_FINAL.R
plot_final_old.R
plot.py
functions.R
functions-old.R
plot-final.png
plot-new.png
AI Coding Assistants
Free options!
Getting a basic strategy for how to write code for something
Example prompts:
Reviewing existing code for improvements
Example prompts:
Annotating or improving documentation
Example prompts:
https://hutchdatascience.org/AI_for_software/annotating-your-code.html
Tips for reproducibility:
*Everything that doesn’t violate IRB privacy and ethical data handling guidelines
Goals for an organizational scheme:
Image created by Candace Savonen.
Chaos reigns - nothing can be found
You lose sleep worrying about your file naming
Perfectly organized but maybe not maintainable
Disorganized and unmanageable
Maintainably organized
Project Organization tips (not one size fits all)
TIP | Example |
Use informative names | metadata_df expressed_gene_list�02_tumor_heatmap.py |
Number scripts in the order they are run | 01_download_data.sh |
Keep like files with like files | Keeping results and raw data each in their own folders |
Central document (like a README) | README.md |
Dates in file names aren’t necessary | run_analysis.sh |
A central script that re-runs the whole thing | run_analysis.sh |
Don’t be afraid of folders!
Building Reproducibility Skill Sets!
Part 2
Tips for reproducibility:
*Everything that doesn’t violate IRB privacy and ethical data handling guidelines
final_final_version_100%_up_to_date
final_version_2021
final_version_edit_10_2021
final_version_edit5
final_version_edit4
for_realz_final_edit2
final_final_version_100%_up_to_date
final_version_2021
final_version_edit5
final_version_edit4
for_realz_final_edit2
Now was it “final_final_version_100%_up_to_date” or “final_version_edit5” that I was working from?
Course: Intro to Reproducibility and Advanced Reproducibility
final_final_version_100%_up_to_date
final_version_2021
final_version_edit_10_2021
final_version_edit5
final_version_edit4
for_realz_final_edit2
final_final_version_100%_up_to_date
final_version_2021
final_version_edit5
final_version_edit4
for_realz_final_edit2
Now was it “final_final_version_100%_up_to_date” or “final_version_edit5” that I was working from?
Course: Intro to Reproducibility and Advanced Reproducibility
VERSION CONTROL
NEEDED!
Image created by Candace Savonen using Avataars
Cancer data genomics atlas study
American cancer journal of America
“Code and data are available upon request by email.”
I’m requesting the code from your recent paper in the journal....
The corresponding author’s inbox
999,999,565473
Course: Intro to Reproducibility and Advanced Reproducibility
Image created by Candace Savonen using Avataars
Cancer data genomics atlas study
American cancer journal of America
“Code and data are available upon request by email.”
I’m requesting the code from your recent paper in the journal....
The corresponding author’s inbox
999,999,565473
Course: Intro to Reproducibility and Advanced Reproducibility
OPEN SOURCE
NEEDED!
Open source! Code is publicly available for reuse and repurpose
Course: Intro to Reproducibility and Advanced Reproducibility
Oh no! My computer broke! – Good thing Github has all my work.
Back Ups! For the worst case scenario
Course: Intro to Reproducibility and Advanced Reproducibility
Why did we write the code this way? I don’t remember… Good thing through git tracking I can look into this file’s history and remind myself how it became this.
Keep better records! Cut down on refinding the same thing
Course: Intro to Reproducibility and Advanced Reproducibility
main
test analysis
main with test analysis added
time/work
Keep the “perfected” version safe! Free to experiment with easy restore capabilities
Course: Intro to Reproducibility and Advanced Reproducibility
main
Avi’s changes added
Avi’s and Ruby’s changes added
Easier to work simultaneously with teammates!
No emailing files back and forth
Course: Intro to Reproducibility and Advanced Reproducibility
Code is best done as a team activity!
Code is best done as a team activity!
If this isn’t an option – at least a fake person: aka AI
Why GitHub?
What’s a pull request?
A way to propose changes so that they can be discussed before they are incorporated
https://github.com/FredHutch/gimap/pull/33
Pull request model
Direct changes
Less verification
Knowledge silos more likely to happen
Mistakes may infiltrate the public version
Records of the decisions made may be more diffuse or nonexistent
More eyes to check the changes
More people know what’s happening
(Knowledge transfer)
Better chance of catching mistakes sooner
Better records of communication about the analysis/software
Intro to GitHub -
Tips for reproducibility:
*Everything that doesn’t violate IRB privacy and ethical data handling guidelines
We don’t work alone!
Sometimes the collaborator with questions is “Future You”.
Readability >>>>> Cleverness
Course: Intro to Reproducibility and Advanced Reproducibility
Image created by Candace Savonen using Avataars.
Ruby’s code
ERROR
Ruby’s code
Now Ruby
Future Ruby
AI tools make it so even if you are a “lone programmer” you can still have a code review!
Variable A
Variable B
R = 0.893
This code runs well on my computer, let me email it to you!
So exciting!
Course: Intro to Reproducibility and Advanced Reproducibility
Main goals of an original code author
Course: Intro to Reproducibility and Advanced Reproducibility
Ruby wants to merge 5 commits into main from a-new-branch
Reviewers
Avi
Code changes #10
I updated the project and added new files.
Owner
...
Ruby commented 10 minutes ago
1870
0
10
2
Ummm… I’m a bit confused. Can you explain the context of these changes? This is a lot for me to try to follow. I’m also unsure what kind of feedback you are looking for.
Owner
...
Avi commented 8 minutes ago
Ruby requested a review from Avi 10 minutes ago
Course: Intro to Reproducibility and Advanced Reproducibility
Ruby wants to merge 5 commits into main from a-new-branch
Reviewers
Avi
Update documentation for heatmap-script.R
Background
In this previous PR we updated the heatmap-script.R file. But now the documentation in the
README is out of date. This PR aims to update the README accordingly.
Approach
I updated the README with information on the new arguments we added. This also required me to update the Usage section and recommendations there.
Feedback needed
Can you look at the Usage section and try running the command and steps described there? I am concerned that this section is not clear enough but I am not sure how to add clarity. Please let me know if you have suggestions on this point.
Owner
...
Ruby commented 10 minutes ago
1
0
2
2
Ruby this is great! I was able to dig into this and give you feedback at the places you asked. Let me know what you think of my ideas and comments!
Owner
...
Avi commented 8 minutes ago
Ruby requested a review from Avi 10 minutes ago
Course: Intro to Reproducibility and Advanced Reproducibility
Image from https://phauer.com/2018/code-review-guidelines/
Main goals of a reviewer
Course: Intro to Reproducibility and Advanced Reproducibility
Empathy is an important part of effective code review!
Image from https://quickbirdstudios.com/blog/code-review-best-practices-guidelines/
Remember the author of the pull request has been putting time and effort into this!
Course: Intro to Reproducibility and Advanced Reproducibility
This code needs work.
Don’t use the formattR package it’s inefficient and takes forever to run.
You didn’t style the last chunk of code.
Owner
...
Avi commented 8 minutes ago
Ruby requested a review from Avi 10 minutes ago
Course: Intro to Reproducibility and Advanced Reproducibility
Ruby, thanks for all this work! This is a great start! I have a few questions so we can further polish this code.
Owner
...
Avi commented 8 minutes ago
Ruby requested a review from Avi 10 minutes ago
Course: Intro to Reproducibility and Advanced Reproducibility
Tips for reproducibility:
*Everything that doesn’t violate IRB privacy and ethical data handling guidelines
CI/CD:
Continuous Integration/
Continuous Deployment
– A software concept that is useful for reproducible data analyses!
Course: Github Automation for Scientists
Every time a change is proposed we check it before incorporating it
Change made Run a series of tests
CI/CD: Continuous Integration/ Continuous Deployment
Course: Github Automation for Scientists
I will re-run this analysis whenever changes are made to it.
Course: Github Automation for Scientists
Think smarter not harder: automation in a nutshell
Course: Github Automation for Scientists
Continuous integration / continuous deployment
You’re a construction manager
Should you:
OR
Course: Github Automation for Scientists
Problems accumulate without using CI/CD
Time/Effort
Re-run
Re-run
A bug being introduced
3 bugs to track down!
Course: Github Automation for Scientists
Catching changes/problems early with CI/CD
Time/Effort
Re-run
Re-run
Re-run
Re-run
1 bug
1 bug
1 bug
Course: Github Automation for Scientists
Catching changes/problems early with CI/CD
R = 0.902
R = 0.902
R = 0.905
R = 0.902
Time/Effort
Re-run
Re-run
Re-run
Re-run
Course: Github Automation for Scientists
This is only the beginning!
This is only the beginning!