1 of 21

Software preservation is necessary for reproducibility

Vicky Rampin

@VickyRampin | vicky.rampin@nyu.edu

New York University

2 of 21

What is reproducibility?

  • Reproducibility - independent people use the same research materials and conditions to verify a claim
    • Need access to the full environment and materials in order to be able to test reproducibility of a claim�
  • Sustaining research software in service of reproducibility requires in-depth curation of data + software together
    • It’s about reproducing computational workflows in the long-term, not giving access to software in sandbox environments (though it needs access to those sandbox environments)

3 of 21

Reproducibility on a spectrum

Reviewable

Process & tools archived and re-usable

Replicable

Confirmable

Auditable

Open/Reproducible

Auditable research made openly available

Main conclusion can be reached without original materials

Original results can be reached with original materials

Sufficient detail for peer review (default)

4 of 21

COMPUTATIONAL

ENVIRONMENT

DOCUMENTATION

CODE & DATA

ARTICLE

REVIEWABLE RESEARCH

REPLICABLE RESEARCH

AUDITABLE RESEARCH

CONFIRMABLE RESEARCH

5 of 21

COMPUTATIONAL

ENVIRONMENT

DOCUMENTATION

CODE & DATA

ARTICLE

REVIEWABLE RESEARCH

REPLICABLE RESEARCH

AUDITABLE RESEARCH

CONFIRMABLE RESEARCH

Entry points for sustainability efforts

6 of 21

Exact computational environments matter!

  • Story from Ars Technica about errors in Python scripts commonly used for analysis in chemistry
  • It’s the problem of DEPENDENCY HELL
    • the problem with the scripts was a specific library they used, glob, which returns different sorted order depending on the OS
  • ~160 papers affected

“The scripts [...] were found to return correct results on macOS Mavericks and Windows 10. But on macOS Mojave and Ubuntu, the results were off by nearly a full percent.”

7 of 21

Another example of dependency hell...

The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements

  • Significant differences in result of neuro analysis depending on version of software, hardware, and operating system

8 of 21

ReproZip - the reproducibility packer!

  • Open source tool originally developed in 2013 at NYU
  • Captures and packs all dependencies automatically in a small bundle (.RPZ)
    • Captures provenance and order-of-execution
  • Then, automatically rerun research in its original environment on a different machine!

9 of 21

How ReproZip is helps with sustainability

Well-bundled:

  • Captures *everything* a research process needs to rerun with lots of *extremely* detailed metadata!�

Generalizable:

  • RPZ format is simple but effective and very generalizable, can interoperate, be read/accessed by, and run with lots of software�

Future-proofed:

  • Can always add/remove unpackers to give users in the future full access to the bundle. As long as there are VMs, containers, EaaSI, or Linux, we can re-execute the bundle contents!!

10 of 21

Some reasons to use ReproZip!

  • Automatically creates long-lasting bundle of research! Bundles created from 2013 can still be reproduced.
  • Not only reproduce work, but also can extend work in RPZ file by adding new inputs and editing configurations
  • Multiple interfaces to work with ReproZip - command-line interface, GUI for unpacking (and soon for packing too!), Jupyter notebook extension
  • No lock-in! You can access materials in RPZ files without using ReproZip.
  • It’s open source, so you can use it in your applications - many have!!

11 of 21

ReproZip Ecosystem

12 of 21

High-level overview of ReproZip in workflow

Once you’re done, use ReproZip!

Work normally

In an institutional, disciplinary, or general repository!

Publish RPZ

Use ReproZip trace to find all dependencies

ReproZip Trace

Make RPZ bundle with everything

ReproZip Pack

1

4

3

2

13 of 21

What can ReproZip pack?

  • Data analysis scripts / software (any language)
  • Graphical software
  • Interactive tools
  • Client-server applications (including databases)
  • Jupyter notebooks
  • MPI experiments

… and more! If you can run it, ReproZip can probably pack it

14 of 21

Packing

Research Process (e.g. a website with DB)

Computational Environment E (Linux)

reprozip

Executing

Tracing

Creating�Configuration

Configuration�File

Reproducible Bundle

(.rpz file)

Configuring

Packing

Input files, output files, parameters

Data

Executable programs and steps

Workflow

Environment variables, software used, dependencies, …

Environment

What ReproZip tracks & keeps:

Original

Author(s)

15 of 21

Unpacking

reprounzip

Unpacking

Computational Environment E’ (potentially different than E)

directory

Linux

chroot

vagrant

Linux�macOS�Windows

docker

Provenance�Graph

VisTrails

Linux

Linux�macOS�Windows

Reproducible

Bundle

(.rpz file)

Singularity (upcoming)

ReproServer

Secondary�User(s)

16 of 21

Some current uses of ReproZip & ReproServer

Facilitating peer review

Sharing reproducible research

Backend reproducibility

Computational science tools

NeuroDocker to minify docker containers�Spot to reconstruct provenance graphs

Metadata capture & query

Digital Preservation

17 of 21

Example: packing digital humanities plots

18 of 21

Example: unpacking digital humanities plots

19 of 21

BONUS: unpacking digital humanities plots in-browser

20 of 21

Overall summary

  • All research is founded on different computational environments
    • We need to understand those for repro
    • We need those preserved for repro�
  • Can help with software preservation thru propagating principles for better stewardship of research materials BEFORE it gets to the archive�
  • ReproZip is a tool to facilitate creating preservation-ready reproducible bundles of research, inclusive of software + containing in-depth technical metadata of software included

21 of 21

Thank you!

Happy to take questions!

Try out ReproZip - reprozip.org

Contact -

vicky.rampin@nyu.edu

VickyRampin everywhere

CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon, infographics & images by Freepik