1 of 14

Managing Workflows with Snakemake

Morgridge brown bag 2020-04-30

2 of 14

Outline

Motivation

Snakemake: quick description

Demo!

3 of 14

Motivation

Two questions a computational scientist should think about:

4 of 14

Motivation

Two questions a computational scientist should think about:

“Am I automating myself enough?”

“Automated” == “reproducible”
Continual self-automation ⇒ exponential productivity

5 of 14

Motivation

Two questions a computational scientist should think about:

“Am I automating myself enough?”

“Automated” == “reproducible”
Continual self-automation ⇒ exponential productivity

“Am I harnessing all of the compute power available to me?”

Thanks, HTCondor!

6 of 14

Motivation

An example computational �workflow:

(it’s a Directed Acyclic Graph)

Data

Other data

Preprocessed data

Analysis outputs (parameter setting N)

Analysis outputs (parameter setting 2)

Analysis outputs (parameter setting 1)

Summaries

Plots

7 of 14

Motivation

An example computational �workflow:

Data

Other data

Preprocessed data

Analysis outputs (parameter setting N)

Analysis outputs (parameter setting 2)

Analysis outputs (parameter setting 1)

Summaries

Plots

8 of 14

Snakemake: quick description

What is it?

A system for automating DAGs of compute jobs
Helps users parallelize execution across available CPUs

Contrasted with DAGMan:

Snakemake is a python package -- install via pip or conda
People can run Snakemake with or without access to a cluster �⇒ reproducibility!

9 of 14

Snakemake: quick description

What does it look like?

GNU Make + Python
You write a Snakefile that sits in a directory.

The Snakefile contains rules defining a DAG of jobs.
Snakefile rules may contain python syntax, for increased expressiveness.

When you run Snakemake in a directory containing “Snakefile”, it will build and run the DAG of jobs.
You can specify the number of cores to use, or you can tell it to submit jobs to a cluster.

10 of 14

Demo: Nested Cross-Validation

Given a prediction task, nested CV estimates the performance of a *family* of predictors.

11 of 14

Demo: Nested Cross-Validation

Given a prediction task, nested CV estimates the performance of a family of predictors.
Two “levels” of cross-validation:

Inner level: for tuning hyperparameters (picking ‘best’ in family)
Outer level: for estimating performance

12 of 14

Demo: Nested Cross-Validation

Given a prediction task, nested CV estimates the performance of a *family* of predictors.
Two “levels” of cross-validation:

Inner level: for tuning hyperparameters (picking ‘best’ in family)
Outer level: for estimating performance

Computationally expensive!

Scikit-learn has a limited ability to parallelize nested CV -- we can do much better.

1 of 14

2 of 14

3 of 14

4 of 14

5 of 14

6 of 14

7 of 14

8 of 14

9 of 14

10 of 14

11 of 14

12 of 14

13 of 14

14 of 14