1 of 14

Managing Workflows with Snakemake

Morgridge brown bag 2020-04-30

2 of 14

Outline

  • Motivation

  • Snakemake: quick description

  • Demo!

3 of 14

Motivation

Two questions a computational scientist should think about:

4 of 14

Motivation

Two questions a computational scientist should think about:

  • “Am I automating myself enough?”
    • “Automated” == “reproducible”
    • Continual self-automation ⇒ exponential productivity

5 of 14

Motivation

Two questions a computational scientist should think about:

  • “Am I automating myself enough?”
    • “Automated” == “reproducible”
    • Continual self-automation ⇒ exponential productivity

  • “Am I harnessing all of the compute power available to me?”
    • Thanks, HTCondor!

6 of 14

Motivation

An example computational �workflow:

(it’s a Directed Acyclic Graph)

Data

Other data

Preprocessed data

Analysis outputs (parameter setting N)

Analysis outputs (parameter setting 2)

Analysis outputs (parameter setting 1)

Summaries

Plots

7 of 14

Motivation

An example computational �workflow:

Data

Other data

Preprocessed data

Analysis outputs (parameter setting N)

Analysis outputs (parameter setting 2)

Analysis outputs (parameter setting 1)

Summaries

Plots

8 of 14

Snakemake: quick description

  • What is it?
    • A system for automating DAGs of compute jobs
    • Helps users parallelize execution across available CPUs

  • Contrasted with DAGMan:
    • Snakemake is a python package -- install via pip or conda
    • People can run Snakemake with or without access to a cluster �⇒ reproducibility!

9 of 14

Snakemake: quick description

  • What does it look like?
    • GNU Make + Python
    • You write a Snakefile that sits in a directory.
      • The Snakefile contains rules defining a DAG of jobs.
      • Snakefile rules may contain python syntax, for increased expressiveness.
    • When you run Snakemake in a directory containing “Snakefile”, it will build and run the DAG of jobs.
    • You can specify the number of cores to use, or you can tell it to submit jobs to a cluster.

10 of 14

Demo: Nested Cross-Validation

  • Given a prediction task, nested CV estimates the performance of a *family* of predictors.

11 of 14

Demo: Nested Cross-Validation

  • Given a prediction task, nested CV estimates the performance of a family of predictors.
  • Two “levels” of cross-validation:
    • Inner level: for tuning hyperparameters (picking ‘best’ in family)
    • Outer level: for estimating performance

12 of 14

Demo: Nested Cross-Validation

  • Given a prediction task, nested CV estimates the performance of a *family* of predictors.
  • Two “levels” of cross-validation:
    • Inner level: for tuning hyperparameters (picking ‘best’ in family)
    • Outer level: for estimating performance
  • Computationally expensive!
    • Scikit-learn has a limited ability to parallelize nested CV -- we can do much better.

13 of 14

Demo: Nested Cross-Validation

14 of 14

Demo: Nested Cross-Validation

Ok, time for actual demo