ITN: GitHub and Reproducibility Skillsets
Slides: https://bit.ly/itcr_2024_GitHub
Candace Savonen, Carrie Wright, and Kate Isaac
ITN Workshops!
While you are waiting:
Join at slido.com�#7691632
ⓘ
Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.
Have your phone
(or a separate tab) handy for interactive polls!
Join at slido.com�#7691 632
What is your favorite candy?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
This helps us keep you informed of upcoming workshops, and to follow up and see how you are using what you learned.
How confident do you feel about using GitHub?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
What is the ITN?
ITCR Training Network
Catalyzing informatics research through training opportunities
User preparedness
Gap
Tool usability
Informatics research is hindered by a gap between different types of experts
CC-BY jhudatascience.org - Image made by Candace Savonone using https://getavataaars.com/ and https://thenounproject.com/ a
User preparedness
Gap
Tool usability
Catalyzing Informatics for Research
CC-BY jhudatascience.org - Image made by Candace Savonone using https://getavataaars.com/ and https://thenounproject.com/ a
Elements of ITN:
ITN courses
Write durable code
Organize your project
Understand the importance of code review
Use computational notebooks
Concepts discussed in Introductory Reproducibility in Cancer Informatics course:
Make your project open source with GitHub
Manage package versions
Document analyses
CC-BY jhudatascience.org
Use automation (GitHub actions)
Engage in code review
Use a Docker image
Discussed in the sequel course: Advanced Reproducibility for Cancer Informatics
Utilize version control
Get comfortable with GitHub concepts and workflow
Modify a Docker image
CC-BY jhudatascience.org
What is reproducibility?
Image created by Candace Savonen using Avataars.
Variable A
Variable B
R = 0.893
This code runs well on my computer, let me email it to you!
So exciting!
Image created by Candace Savonen using Avataars.
ERROR ERROR ERROR ERROR ERROR ERROR �ERROR ERROR
ERROR ERROR
Ruby’s code and data
Image created by Candace Savonen using Avataars.
Error: file path “Ruby’s computer/Ruby’s file/final_version10.R” not found
Re:Re:Re: Data
Hi Ruby, I don’t understand what this code is supposed to be doing...
Re:Re:Re: Data
Hi Avi, It works for me?
Image created by Candace Savonen using Avataars.
Variable A
Variable B
Variable A
Variable B
R = 0.893
R = 0.891
Ruby’s code and data
Based off of a figure from Essawy et al, 2020 https://doi.org/10.1016/j.envsoft.2020.104753
Effort
Time
Replicability
new researcher, new data
Reproducibility
new researcher, same data
Repeatability
same researcher, same data
Reproducibility saves everyone time and effort!
Image created by Candace Savonen using Avataars.
Ruby’s code
ERROR
Ruby’s code
Now Ruby
Future Ruby
Image created by Candace Savonen using Avataars.
Ruby’s code - not as reproducible
ERROR
ERROR
ERROR
ERROR
ERROR
ERROR
ERROR
Image created by Candace Savonen using Avataars.
Ruby’s code - made reproducibly
ERROR
Reproducibility is a tortoise’s game - it’s an incremental and slow process but it has high payoffs!
Image by Candace Savonen
I will re-run the code consistently and instantly upon whatever trigger I’m given.
Robots can help you with reproducibility! )
Reproducibility is iterative work!
Image created by Candace Savonen.
Ran once
Re-runs sometimes
Re-runs in every situation and gets the same result every time
Re-runs reliably in most contexts
Perfectly reproducible
No analysis reaches here
Not repeatable
Every analysis starts here
Reproducibility != Correctness
Reproducibility ~ Consistency
But you could be consistently wrong in the same way….
Tips for reproducibility:
*Everything that doesn’t violate IRB privacy and ethical data handling guidelines
Tips for reproducibility:
*Everything that doesn’t violate IRB privacy and ethical data handling guidelines
Next session!
Tips for reproducibility:
*Everything that doesn’t violate IRB privacy and ethical data handling guidelines
A useable, well-documented analysis is more likely to be used and disseminated!
CC-BY by jhudatascience.org
Documentation that every project should have!
Documentation that every project should have!
LLM | What is it really good at? | What does it struggle with? |
Bard |
|
|
ChatGPT |
|
|
Claude |
|
|
Phind |
|
|
LLM | What is it really good at? | What does it struggle with? |
Bard |
|
|
ChatGPT |
|
|
Claude |
|
|
Phind |
|
|
Use AI tools that are trained for the task you are trying to do!
LLMs can be good at telling you what historical or confusing code is doing
Image created by Candace Savonen using Avataars.
def my_function(x):
result = x
for i in range(10):
for j in range(5):
result = result + 2 * (i + 1) * (j + 1) * (i % 2 == 0 and j % 2 == 0) - 1
return result
Wait, what is this code even?
plot-data-2020-9-11.tsv
plot-data-20-10-2020.tsv
plot-data-20-10-2020-clean.tsv
plot_final.R
plot_final_FINAL.R
plot_final_old.R
plot.py
functions.R
functions-old.R
plot-final.png
plot-new.png
AI Coding Assistants
Free options!
Getting a basic strategy for how to write code for something
Example prompts:
Reviewing existing code for improvements
Example prompts:
Annotating or improving documentation
Example prompts:
https://hutchdatascience.org/AI_for_software/annotating-your-code.html
Tips for reproducibility:
*Everything that doesn’t violate IRB privacy and ethical data handling guidelines
Goals for an organizational scheme:
Image created by Candace Savonen.
Chaos reigns - nothing can be found
You lose sleep worrying about your file naming
Perfectly organized but maybe not maintainable
Disorganized and unmanageable
Maintainably organized
Project Organization tips (not one size fits all)
TIP | Example |
Use informative names | metadata_df expressed_gene_list�02_tumor_heatmap.py |
Number scripts in the order they are run | 01_download_data.sh |
Keep like files with like files | Keeping results and raw data each in their own folders |
Central document (like a README) | README.md |
Dates in file names aren’t necessary | run_analysis.sh |
A central script that re-runs the whole thing | run_analysis.sh |
Tips for reproducibility:
*Everything that doesn’t violate IRB privacy and ethical data handling guidelines
final_final_version_100%_up_to_date
final_version_2021
final_version_edit_10_2021
final_version_edit5
final_version_edit4
for_realz_final_edit2
final_final_version_100%_up_to_date
final_version_2021
final_version_edit5
final_version_edit4
for_realz_final_edit2
Now was it “final_final_version_100%_up_to_date” or “final_version_edit5” that I was working from?
Course: Intro to Reproducibility and Advanced Reproducibility
final_final_version_100%_up_to_date
final_version_2021
final_version_edit_10_2021
final_version_edit5
final_version_edit4
for_realz_final_edit2
final_final_version_100%_up_to_date
final_version_2021
final_version_edit5
final_version_edit4
for_realz_final_edit2
Now was it “final_final_version_100%_up_to_date” or “final_version_edit5” that I was working from?
Course: Intro to Reproducibility and Advanced Reproducibility
VERSION CONTROL
NEEDED!
Image created by Candace Savonen using Avataars
Cancer data genomics atlas study
American cancer journal of America
“Code and data are available upon request by email.”
I’m requesting the code from your recent paper in the journal....
The corresponding author’s inbox
999,999,565473
Course: Intro to Reproducibility and Advanced Reproducibility
Image created by Candace Savonen using Avataars
Cancer data genomics atlas study
American cancer journal of America
“Code and data are available upon request by email.”
I’m requesting the code from your recent paper in the journal....
The corresponding author’s inbox
999,999,565473
Course: Intro to Reproducibility and Advanced Reproducibility
OPEN SOURCE
NEEDED!
Open source! Code is publicly available for reuse and repurpose
Course: Intro to Reproducibility and Advanced Reproducibility
Oh no! My computer broke! – Good thing Github has all my work.
Back Ups! For the worst case scenario
Course: Intro to Reproducibility and Advanced Reproducibility
Why did we write the code this way? I don’t remember… Good thing through git tracking I can look into this file’s history and remind myself how it became this.
Keep better records! Cut down on refinding the same thing
Course: Intro to Reproducibility and Advanced Reproducibility
main
test analysis
main with test analysis added
time/work
Keep the “perfected” version safe! Free to experiment with easy restore capabilities
Course: Intro to Reproducibility and Advanced Reproducibility
main
Avi’s changes added
Avi’s and Ruby’s changes added
Easier to work simultaneously with teammates!
No emailing files back and forth
Course: Intro to Reproducibility and Advanced Reproducibility
Code is best done as a team activity!
Code is best done as a team activity!
If this isn’t an option – at least a fake person: aka AI
Why GitHub?
Activity 1: Git-ting familiar with branches
Go to the Website for the ITN Workshops for this conference: ��https://bit.ly/ITCR_2024
Create the repository for this activity
Let’s use this sandbox repository to practice GitHub things: https://github.com/fhdsl/reproducibility-sandbox
https://bit.ly/ITCR_2024
Getting familiar with the repository
Create your own copy of this repository
What’s a pull request?
A way to propose changes so that they can be discussed before they are incorporated
Real Life Example:
Pull request model
Direct changes
Less verification
Knowledge silos more likely to happen
Mistakes may infiltrate the public version
Records of the decisions made may be more diffuse or nonexistent
More eyes to check the changes
More people know what’s happening
(Knowledge transfer)
Better chance of catching mistakes sooner
Better records of communication about the analysis/software
Open up GitHub Desktop
Log in to GitHub when prompted
Cloning our repository!
Image by Candace Savonen
remote repository = project that is stored on the internet
e.g. https://github.com/your-username/repository-name
local repository = project copy on your computer
e.g. ~/some-filepath/repository-name
repository-name
repository-name
Image by Candace Savonen
https://github.com/your-username/repository-name
~/yourfilepath/repository-name
clone = copy a remote repository to your local computer
repository-name
repository-name
Image by Candace Savonen
reproducible-R-example
the main branch = the main copy of your project
The main branch is curated, working, and always ready for others to use!
Window Juggling!
GitHub Desktop
Window Juggling!
Text editor
GitHub Desktop
Window Juggling!
Text editor
GitHub Desktop
Online GitHub
Window Juggling!
Text editor
GitHub Desktop
Online GitHub
Files
Window Juggling!
GitHub Desktop
Text editor
Online GitHub
Files
This part might differ for you!
OR it might look like this…
GitHub Desktop
Documentation about setting up a different default code editor:
GitHub Desktop
Image by Candace Savonen
Create a new branch! This will be your working copy
GitHub Desktop
Image by Candace Savonen
repository-name
main
readme-edit
repository-name
We can do what we like with
readme-edit knowing that main will remain safe
Create an AI generated README
AI can be good at summarizing things!
Click on the Show in Finder button
Open up 01-heatmap.Rmd and copy all the text in the file
Text editor
Files
Go to Phind and ask it to write a readme from pasted text from 01-heatmap.Rmd
Text editor
Files
Get the results from Phind
Scroll to the bottom and click the squares to copy
Text editor
Files
Go to the text editor to open up
README.md
Text editor
Files
Paste the results from Phind into README.md
Save the README.md file after editing
Adding changes to a branch
GitHub Desktop
Go back to GitHub Desktop…
Image by Candace Savonen
The changes you make to any files in this repository should show up here
Type in a commit message
Click commit!
Check box
1.
2.
3.
4.
git add README.md
git commit -m “add stuff to README”
git push --set-upstream origin a-new-branch
GitHub Desktop
Image by Candace Savonen
push = add changes that are on a-new-branch
to the remote repository on GitHub
repository-name
repository-name
repository-name
your-username/repository-name
Making the pull request
Image by Candace Savonen
repository-name
main
a-new-branch
repository-name
commits to a-new-branch
The version of the code that has a nifty improvement
A pull request will show the difference between main and a-new-branch so you scrutinize this feature before adding it to the main
Add some information about the changes in the PR and the reasoning for it
Check that the branches makes sense - should be merging into main automatically
Awesome you just made a pull request (PR)!
Repeat 5 - 7 until it you’ve addressed the update you had in mind
Only needs to be done once per repository/project!
Summary
2. Open file
3. Edit file
4. Add changes
5. Commit changes
6. Publish changes
7. Open PR
The check boxes
In your editor
In your editor
Command line summary
git clone https://github.com/username/repo-name
git checkout -b “a-new-branch”
Use code editor
git add README.md
git commit -m “add stuff to README”
git push --set-upstream origin a-new-branch
Go to GitHub
Tips for reproducibility:
*Everything that doesn’t violate IRB privacy and ethical data handling guidelines
We don’t work alone!
Sometimes the collaborator with questions is “Future You”.
Readability >>>>> Cleverness
Course: Intro to Reproducibility and Advanced Reproducibility
Image created by Candace Savonen using Avataars.
Ruby’s code
ERROR
Ruby’s code
Now Ruby
Future Ruby
DRY = Don’t Repeat Yourself
AI tools make it so even if you are a “lone programmer” you can still have a code review!
Variable A
Variable B
R = 0.893
This code runs well on my computer, let me email it to you!
So exciting!
Course: Intro to Reproducibility and Advanced Reproducibility
Main goals of an original code author
Course: Intro to Reproducibility and Advanced Reproducibility
Ruby wants to merge 5 commits into main from a-new-branch
Reviewers
Avi
Code changes #10
I updated the project and added new files.
Owner
...
Ruby commented 10 minutes ago
1870
0
10
2
Ummm… I’m a bit confused. Can you explain the context of these changes? This is a lot for me to try to follow. I’m also unsure what kind of feedback you are looking for.
Owner
...
Avi commented 8 minutes ago
Ruby requested a review from Avi 10 minutes ago
Course: Intro to Reproducibility and Advanced Reproducibility
Ruby wants to merge 5 commits into main from a-new-branch
Reviewers
Avi
Update documentation for heatmap-script.R
Background
In this previous PR we updated the heatmap-script.R file. But now the documentation in the
README is out of date. This PR aims to update the README accordingly.
Approach
I updated the README with information on the new arguments we added. This also required me to update the Usage section and recommendations there.
Feedback needed
Can you look at the Usage section and try running the command and steps described there? I am concerned that this section is not clear enough but I am not sure how to add clarity. Please let me know if you have suggestions on this point.
Owner
...
Ruby commented 10 minutes ago
1
0
2
2
Ruby this is great! I was able to dig into this and give you feedback at the places you asked. Let me know what you think of my ideas and comments!
Owner
...
Avi commented 8 minutes ago
Ruby requested a review from Avi 10 minutes ago
Course: Intro to Reproducibility and Advanced Reproducibility
Image from https://phauer.com/2018/code-review-guidelines/
Main goals of a reviewer
Course: Intro to Reproducibility and Advanced Reproducibility
Communication and empathy is an important part of effective code review!
Image from https://quickbirdstudios.com/blog/code-review-best-practices-guidelines/
Remember the author of the pull request has been putting time and effort into this!
Course: Intro to Reproducibility and Advanced Reproducibility
This code needs work.
Don’t use the formattR package it’s inefficient and takes forever to run.
You didn’t style the last chunk of code.
Owner
...
Avi commented 8 minutes ago
Ruby requested a review from Avi 10 minutes ago
Course: Intro to Reproducibility and Advanced Reproducibility
Ruby, thanks for all this work! This is a great start! I have a few questions so we can further polish this code.
Owner
...
Avi commented 8 minutes ago
Ruby requested a review from Avi 10 minutes ago
Course: Intro to Reproducibility and Advanced Reproducibility
Tips for reproducibility:
*Everything that doesn’t violate IRB privacy and ethical data handling guidelines
CI/CD:
Continuous Integration/
Continuous Deployment
– A software concept that is useful for reproducible data analyses!
Course: Github Automation for Scientists
Every time a change is proposed we check it before incorporating it
Change made Run a series of tests
CI/CD: Continuous Integration/ Continuous Deployment
Course: Github Automation for Scientists
I will re-run this analysis whenever changes are made to it.
Course: Github Automation for Scientists
Think smarter not harder: automation in a nutshell
Course: Github Automation for Scientists
Continuous integration / continuous deployment
You’re a construction manager
Should you:
OR
Course: Github Automation for Scientists
Problems accumulate without using CI/CD
Time/Effort
Re-run
Re-run
A bug being introduced
3 bugs to track down!
Course: Github Automation for Scientists
Catching changes/problems early with CI/CD
Time/Effort
Re-run
Re-run
Re-run
Re-run
1 bug
1 bug
1 bug
Course: Github Automation for Scientists
Catching changes/problems early with CI/CD
R = 0.902
R = 0.902
R = 0.905
R = 0.902
Time/Effort
Re-run
Re-run
Re-run
Re-run
Course: Github Automation for Scientists
This is only the beginning!
Activity 2: Navigating PR components
Activity 2a: Leaving inline comments
Return to your pull request on GitHub
Click the Files Changed tab
Click any plus sign to leave a comment
Click the “add a suggestion” button and add a change, then click “add single comment”
Then click Add single comment
Commit the new suggestion!
Go back to the Conversation tab
Scroll to the bottom.
You should see a new commit!
Activity 2c: Updating local branches
Click the Fetch origin button
Follow the prompt to pull origin
What happened? We can check the history.
We just got the change to the readme we did on GitHub, so it is now in our local version!
If we check the file on our computer now - it will show the change!
Activity 2b: Exploring GitHub Actions
Scroll to the bottom of your PR
This is a GitHub Action Log!
Activity 2d: Merge your PR
Go to the Conversations tab.
Scroll down and click the Merge button
You have now merged the new changes into the main branch!
Summary: what did we do overall?
How confident do you feel about using GitHub and reproducibility skill sets *now*?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
How likely is this workshop to have a positive impact on your work?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
How likely would you be to recommend this workshop?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
What did you like most about the workshop?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Please share any recommendations you have for improvements.
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Demographics Survey
If you are a presenter, go get a ribbon