1 of 92

Intro to Reproducibility!

Candace Savonen

2 of 92

Image created by Candace Savonen using Avataars.

Variable A

Variable B

R = 0.893

This code runs well on my computer, let me email it to you!

So exciting!

3 of 92

Image created by Candace Savonen using Avataars.

ERROR ERROR ERROR ERROR ERROR ERROR �ERROR ERROR

ERROR ERROR

Ruby’s code and data

4 of 92

Image created by Candace Savonen using Avataars.

Error: file path “Ruby’s computer/Ruby’s file/final_version10.R” not found

Re:Re:Re: Data

Hi Ruby, I don’t understand what this code is supposed to be doing...

Re:Re:Re: Data

Hi Avi, It works for me?

5 of 92

Image created by Candace Savonen using Avataars.

Variable A

Variable B

Variable A

Variable B

R2 = 0.893

R2 = 0.891

Ruby’s code and data

6 of 92

Image created by Candace Savonen using Avataars.

Data

Code

Variable A

Variable B

Results

Ruby the Researcher

Repeatable: keeping everything the same but repeating the analysis - do we get the same results?

7 of 92

Image created by Candace Savonen using Avataars.

Data

Code

Variable A

Variable B

Results

Ruby the Researcher

Reproducible: using the same data and analysis but in the hands of another researcher - do we get the same results?

Avi the Associate

Data

Code

8 of 92

Image created by Candace Savonen using Avataars.

New Data

Same Code

Ruby the Researcher

Replicable: with new data do we obtain the same inferences?

Avi the Associate

Variable A and B are positively correlated

Code

9 of 92

Based off of a figure from Essawy et al, 2020 https://doi.org/10.1016/j.envsoft.2020.104753

Effort

Time

Replicability

new researcher, new data

Reproducibility

new researcher, same data

Repeatability

same researcher, same data

10 of 92

Reproducibility saves everyone time and effort!

11 of 92

Image created by Candace Savonen using Avataars.

Ruby’s code

ERROR

Ruby’s code

Now Ruby

Future Ruby

12 of 92

Image created by Candace Savonen using Avataars.

Ruby’s code - not as reproducible

ERROR

ERROR

ERROR

ERROR

ERROR

ERROR

ERROR

13 of 92

Image created by Candace Savonen using Avataars.

Ruby’s code - made reproducibly

ERROR

14 of 92

Reproducibility is a tortoise’s game - it’s an incremental and slow process but it has high payoffs!

15 of 92

Science is a community effort – we each are contributing a domino!

16 of 92

Reproducibility is iterative work!

Image created by Candace Savonen.

Ran once

Re-runs sometimes

Re-runs in every situation and gets the same result every time

Re-runs reliably in most contexts

Perfectly reproducible

No analysis reaches here

Not repeatable

Every analysis starts here

17 of 92

A tale of two differential expression analyses…

18 of 92

Example 1:

19 of 92

Example 2:

20 of 92

Reproducibility != Correctness

Reproducibility ~ Consistency

But you could be consistently wrong in the same way….

21 of 92

Tips for reproducibility:

  1. Write it down! (Documentation!)
  2. Organize and re-organize
  3. Managing computing environments (Containers)
  4. Have everything* publicly available (GitHub)
  5. Have others look it over (code review)
  6. Automate monotonous stuff (CI/CD)

*Everything that doesn’t violate IRB privacy and ethical data handling guidelines

Now!

22 of 92

Tips for reproducibility:

  • Write it down! (Documentation!)
  • Organize and re-organize
  • Managing computing environments (Containers)
  • Have everything* publicly available (GitHub)
  • Have others look it over (code review)
  • Automate monotonous stuff (CI/CD)

*Everything that doesn’t violate IRB privacy and ethical data handling guidelines

Tomorrow!

23 of 92

Tips for reproducibility:

  • Write it down! (Documentation!)
  • Organize and re-organize
  • Managing computing environments (Containers)
  • Have everything* publicly available (GitHub)
  • Have others look it over (code review)
  • Automate monotonous stuff (CI/CD)

*Everything that doesn’t violate IRB privacy and ethical data handling guidelines

Will cover more about this Day 3!

24 of 92

Metadata - the often overlooked but mightily crucial data

25 of 92

What are metadata?

Anything and everything that should be known about your samples!

A B C D

E F G H

sample_id

mouse_id

processing_date

treatment

A

1

3-10-21

None

B

1

4-12-21

None

C

2

3-10-21

None

D

2

4-12-21

None

E

3

3-10-21

Morphine

F

3

4-12-21

Morphine

G

4

3-10-21

Morphine

H

4

4-12-21

Morphine

I know everything I need to know about these samples from their metadata!

26 of 92

Examples of metadata categories:

  • Patient/organism of origin
  • Patient/organism information
    • Demographics
    • Disease state
    • Treatment state
    • Time point (if applicable)
  • Processing information
    • Batch information
    • Processing details (E.g. Isolation methods: Poly-A vs Ribo-minus)
  • Anything that should be known about the samples and their handling!

27 of 92

Rules for creating metadata (from Broman & Woo, 2017)

Be Consistent

Choose good names for things

Write Dates as YYYY-MM-DD

No Empty Cells

Put Just One Thing in a Cell

Make it a Rectangle

1

28 of 92

Rules for creating metadata continued (from Broman & Woo, 2017)

Create a Data Dictionary

No Calculations in the Raw Data Files

Do Not Use Font Color or Highlighting as Data

Make Backups

Use Data Validation to Avoid Errors

29 of 92

If you have human data the metadata probably is loaded with PII and/or PHI

30 of 92

31 of 92

Tips for reproducibility:

  • Write it down! (Documentation)
  • Organize and re-organize
  • Managing computing environments (Containers)
  • Have everything* publicly available (GitHub)
  • Have others look it over (code review)
  • Automate monotonous stuff (CI/CD)

*Everything that doesn’t violate IRB privacy and ethical data handling guidelines

32 of 92

A useable, well-documented analysis is more likely to be used and disseminated!

CC-BY by jhudatascience.org

33 of 92

34 of 92

Documentation that every project should have!

  1. READMEs
    1. Background knowledge
    2. Usage info
    3. Software requirements to run the thing
    4. Basics on how the files are organized

35 of 92

36 of 92

Documentation that every project should have!

  • READMEs
    • Background knowledge
    • Usage info
    • Software requirements to run the thing
    • Basics on how the files are organized
  • Code annotations:
    • Explain historical decisions
    • Explain “quirks” of the code
    • Say where more development is needed (TODO)
    • Summarize the goals!

37 of 92

LLM

What is it really good at?

What does it struggle with?

Bard

  • Most human-like interaction
  • Answers oddball questions
  • Willing to answer “I don’t know”
  • Gives least amount of detail in answers
  • Has been known to give incorrect answers

ChatGPT

  • Most popular, which means most tested
  • Good all-around LLM
  • Unlikely to change answer even when told previous answer was wrong
  • Invents citations
  • Known hallucination issues

Claude

  • Good all-around LLM
  • Offers specific advice when editing a writing sample for tone
  • Best understanding of clever word play
  • Can sometimes require prodding to give additional detail
  • Doesn’t easily save threads at this time (but this is changing!)

Phind

  • Great for technical programming questions
  • Provides links to sources unprompted
  • Offers many programming options at once
  • Tends to plagiarize sources directly when used for writing

38 of 92

LLM

What is it really good at?

What does it struggle with?

Bard

  • Most human-like interaction
  • Answers oddball questions
  • Willing to answer “I don’t know”
  • Gives least amount of detail in answers
  • Has been known to give incorrect answers

ChatGPT

  • Most popular, which means most tested
  • Good all-around LLM
  • Unlikely to change answer even when told previous answer was wrong
  • Invents citations
  • Known hallucination issues

Claude

  • Good all-around LLM
  • Offers specific advice when editing a writing sample for tone
  • Best understanding of clever word play
  • Can sometimes require prodding to give additional detail
  • Doesn’t easily save threads at this time (but this is changing!)

Phind

  • Great for technical programming questions
  • Provides links to sources unprompted
  • Offers many programming options at once
  • Tends to plagiarize sources directly when used for writing

Use AI tools that are trained for the task you are trying to do!

39 of 92

LLMs can be good at telling you what historical or confusing code is doing

Image created by Candace Savonen using Avataars.

def my_function(x):

result = x

for i in range(10):

for j in range(5):

result = result + 2 * (i + 1) * (j + 1) * (i % 2 == 0 and j % 2 == 0) - 1

return result

Wait, what is this code even?

plot-data-2020-9-11.tsv

plot-data-20-10-2020.tsv

plot-data-20-10-2020-clean.tsv

plot_final.R

plot_final_FINAL.R

plot_final_old.R

plot.py

functions.R

functions-old.R

plot-final.png

plot-new.png

40 of 92

AI Coding Assistants

Free options!

41 of 92

Getting a basic strategy for how to write code for something

Example prompts:

    • How might I go about doing ______ ?
    • How could I structure code that would do ______ ?
    • Is it possible to create a package that does ______?
    • What packages could I use to make code that does _______ ?

42 of 92

Reviewing existing code for improvements

Example prompts:

    • Can you tell me how I could make this code more readable?
    • Can you help fix the formatting, styling, and indent errors on this code?
    • Can you recommend how I could make this code more reproducible?

43 of 92

Annotating or improving documentation

Example prompts:

    • Can you annotate this code?
    • Can you explain to me what this code is doing?
    • Can you create a README for this code?

https://hutchdatascience.org/AI_for_software/annotating-your-code.html

44 of 92

Tips for reproducibility:

  • Write it down! (Documentation)
  • Organize and re-organize
  • Managing computing environments (Containers)
  • Have everything* publicly available (GitHub)
  • Have others look it over (code review)
  • Automate monotonous stuff (CI/CD)

*Everything that doesn’t violate IRB privacy and ethical data handling guidelines

45 of 92

Goals for an organizational scheme:

46 of 92

Image created by Candace Savonen.

Chaos reigns - nothing can be found

You lose sleep worrying about your file naming

Perfectly organized but maybe not maintainable

Disorganized and unmanageable

Maintainably organized

47 of 92

Project Organization tips (not one size fits all)

TIP

Example

Use informative names

metadata_df

expressed_gene_list�02_tumor_heatmap.py

Number scripts in the order they are run

01_download_data.sh

Keep like files with like files

Keeping results and raw data each in their own folders

Central document (like a README)

README.md

Dates in file names aren’t necessary

run_analysis.sh

A central script that re-runs the whole thing

run_analysis.sh

48 of 92

Don’t be afraid of folders!

49 of 92

Building Reproducibility Skill Sets!

Part 2

50 of 92

Tips for reproducibility:

  • Write it down!
  • Organize and re-organize
  • Managing computing environments
  • Have everything* publicly available (GitHub)
  • Have others look it over (code review)
  • Automate monotonous stuff (CI/CD)

*Everything that doesn’t violate IRB privacy and ethical data handling guidelines

51 of 92

final_final_version_100%_up_to_date

final_version_2021

final_version_edit_10_2021

final_version_edit5

final_version_edit4

for_realz_final_edit2

final_final_version_100%_up_to_date

final_version_2021

final_version_edit5

final_version_edit4

for_realz_final_edit2

Now was it “final_final_version_100%_up_to_date” or “final_version_edit5” that I was working from?

52 of 92

final_final_version_100%_up_to_date

final_version_2021

final_version_edit_10_2021

final_version_edit5

final_version_edit4

for_realz_final_edit2

final_final_version_100%_up_to_date

final_version_2021

final_version_edit5

final_version_edit4

for_realz_final_edit2

Now was it “final_final_version_100%_up_to_date” or “final_version_edit5” that I was working from?

VERSION CONTROL

NEEDED!

53 of 92

Image created by Candace Savonen using Avataars

Cancer data genomics atlas study

American cancer journal of America

“Code and data are available upon request by email.”

I’m requesting the code from your recent paper in the journal....

The corresponding author’s inbox

999,999,565473

54 of 92

Image created by Candace Savonen using Avataars

Cancer data genomics atlas study

American cancer journal of America

“Code and data are available upon request by email.”

I’m requesting the code from your recent paper in the journal....

The corresponding author’s inbox

999,999,565473

OPEN SOURCE

NEEDED!

55 of 92

Open source! Code is publicly available for reuse and repurpose

56 of 92

Oh no! My computer broke! – Good thing Github has all my work.

Back Ups! For the worst case scenario

57 of 92

Why did we write the code this way? I don’t remember… Good thing through git tracking I can look into this file’s history and remind myself how it became this.

Keep better records! Cut down on refinding the same thing

58 of 92

main

test analysis

main with test analysis added

time/work

Keep the “perfected” version safe! Free to experiment with easy restore capabilities

59 of 92

main

Avi’s changes added

Avi’s and Ruby’s changes added

Easier to work simultaneously with teammates!

No emailing files back and forth

60 of 92

Code is best done as a team activity!

61 of 92

Code is best done as a team activity!

If this isn’t an option – at least a fake person: aka AI

62 of 92

Why GitHub?

  • Open Source - Transparency and helping the community by sharing!

  • Version Controlled - Better records; better back ups; better teamwork!

  • Free

  • Popular (Bitbucket is fine too)

63 of 92

What’s a pull request?

A way to propose changes so that they can be discussed before they are incorporated

https://github.com/FredHutch/gimap/pull/33

64 of 92

Pull request model

Direct changes

Less verification

Knowledge silos more likely to happen

Mistakes may infiltrate the public version

Records of the decisions made may be more diffuse or nonexistent

More eyes to check the changes

More people know what’s happening

(Knowledge transfer)

Better chance of catching mistakes sooner

Better records of communication about the analysis/software

65 of 92

66 of 92

Intro to GitHub -

How to file a pull request!

67 of 92

Tips for reproducibility:

  • Write it down! (Documentation!)
  • Organize and re-organize
  • Managing computing environments (Containers)
  • Have everything* publicly available (GitHub)
  • Have others look it over (code review)
  • Automate monotonous stuff (CI/CD)

*Everything that doesn’t violate IRB privacy and ethical data handling guidelines

68 of 92

We don’t work alone!

Sometimes the collaborator with questions is “Future You”.

Readability >>>>> Cleverness

69 of 92

Image created by Candace Savonen using Avataars.

Ruby’s code

ERROR

Ruby’s code

Now Ruby

Future Ruby

70 of 92

71 of 92

AI tools make it so even if you are a “lone programmer” you can still have a code review!

72 of 92

Variable A

Variable B

R = 0.893

This code runs well on my computer, let me email it to you!

So exciting!

73 of 92

Main goals of an original code author

  1. Set up your reviewer for success by erring toward overcommunicating

  • Interpret reviews positively!

  • Determine solutions collaboratively.

74 of 92

Ruby wants to merge 5 commits into main from a-new-branch

Reviewers

Avi

Code changes #10

I updated the project and added new files.

Owner

...

Ruby commented 10 minutes ago

  • 999100057849 - 1901

1870

0

10

2

Ummm… I’m a bit confused. Can you explain the context of these changes? This is a lot for me to try to follow. I’m also unsure what kind of feedback you are looking for.

Owner

...

Avi commented 8 minutes ago

Ruby requested a review from Avi 10 minutes ago

75 of 92

Ruby wants to merge 5 commits into main from a-new-branch

Reviewers

Avi

Update documentation for heatmap-script.R

Background

In this previous PR we updated the heatmap-script.R file. But now the documentation in the

README is out of date. This PR aims to update the README accordingly.

Approach

I updated the README with information on the new arguments we added. This also required me to update the Usage section and recommendations there.

Feedback needed

Can you look at the Usage section and try running the command and steps described there? I am concerned that this section is not clear enough but I am not sure how to add clarity. Please let me know if you have suggestions on this point.

Owner

...

Ruby commented 10 minutes ago

  • 50 - 19

1

0

2

2

Ruby this is great! I was able to dig into this and give you feedback at the places you asked. Let me know what you think of my ideas and comments!

Owner

...

Avi commented 8 minutes ago

Ruby requested a review from Avi 10 minutes ago

76 of 92

Image from https://phauer.com/2018/code-review-guidelines/

77 of 92

Main goals of a reviewer

  • Identify opportunities for learning and improvement.

  • Communicate these with positivity and empathy.

  • Determine solutions collaboratively.

78 of 92

Empathy is an important part of effective code review!

79 of 92

Image from https://quickbirdstudios.com/blog/code-review-best-practices-guidelines/

Remember the author of the pull request has been putting time and effort into this!

80 of 92

This code needs work.

Don’t use the formattR package it’s inefficient and takes forever to run.

You didn’t style the last chunk of code.

Owner

...

Avi commented 8 minutes ago

Ruby requested a review from Avi 10 minutes ago

81 of 92

Ruby, thanks for all this work! This is a great start! I have a few questions so we can further polish this code.

  • Is your usage of the formattR package because of the weird formatting of the data.tsv file? Perhaps we can brainstorm another approach to this that would allow us to get rid of this package requirement.
  • I think that in your last chunk you may have forgotten to style the code according to the conventions for this repository. Perhaps we can discuss how we introduce something to help all authors of this repository adhere to the conventions. This may be an instance we can use automation or a checklist to help.

Owner

...

Avi commented 8 minutes ago

Ruby requested a review from Avi 10 minutes ago

82 of 92

Tips for reproducibility:

  • Write it down! (Documentation!)
  • Organize and re-organize
  • Managing computing environments (Containers)
  • Have everything* publicly available (GitHub)
  • Have others look it over (code review)
  • Automate monotonous stuff (CI/CD)

*Everything that doesn’t violate IRB privacy and ethical data handling guidelines

83 of 92

CI/CD:

Continuous Integration/

Continuous Deployment

– A software concept that is useful for reproducible data analyses!

84 of 92

Every time a change is proposed we check it before incorporating it

Change made Run a series of tests

CI/CD: Continuous Integration/ Continuous Deployment

85 of 92

I will re-run this analysis whenever changes are made to it.

86 of 92

Think smarter not harder: automation in a nutshell

87 of 92

Continuous integration / continuous deployment

You’re a construction manager

Should you:

  1. Check that your construction plans are good and meet safety and engineering standards as you build it?

OR

  • Build the entire building without consulting anyone and only have them check these things after you are done and its basically a demo job if you want to fix it?

88 of 92

Problems accumulate without using CI/CD

Time/Effort

Re-run

Re-run

A bug being introduced

3 bugs to track down!

89 of 92

Catching changes/problems early with CI/CD

Time/Effort

Re-run

Re-run

Re-run

Re-run

1 bug

1 bug

1 bug

90 of 92

Catching changes/problems early with CI/CD

R = 0.902

R = 0.902

R = 0.905

R = 0.902

Time/Effort

Re-run

Re-run

Re-run

Re-run

91 of 92

This is only the beginning!

  • Intro to Reproducibility
    • DRY code
    • Data organization
    • GitHub intro
  • Advanced Reproducibility
    • Intro to Containers
    • Intro to GitHub Actions
    • GitHub intermediate
    • Code review
  • Github Automation for Scientists
    • Principles of data sharing
    • CI/CD principles in science and analyses
    • GitHub Actions + Docker
  • Containers for Scientists STILL UNDER DEVELOPMENT

92 of 92

This is only the beginning!

  • Intro to Reproducibility
    • DRY code
    • Data organization
    • GitHub intro
  • Advanced Reproducibility
    • Intro to Containers
    • Intro to GitHub Actions
    • GitHub intermediate
    • Code review
  • Github Automation for Scientists
    • Principles of data sharing
    • CI/CD principles in science and analyses
    • GitHub Actions + Docker
  • Containers for Scientists STILL UNDER DEVELOPMENT