1 of 17

Modularization

The key to making your life easier when moving to the cloud is not to port whole pipelines as monolithic structures, but to break them up into bite-sized pieces.

Q: How does one eat a large meal?

A: One bite at a time.

2 of 17

Provided by the user

inputs

parameters

environment

execution

outputs

Usually provided by the sysadmin

Produced by our code

Inputs: files

Parameters: values (ints, strings, etc) or constants (e.g. a reference genome)

Environment:

- What programs or modules are available?

- What environment variables are present?

Execution:

- How do we run our code? As a script? On the CLI? Using a workflow description language?

What are the components of a given task/workflow/pipeline/script?

3 of 17

inputs

parameters

environment

execution

outputs

Local Compute

Environment, execution, inputs, outputs and parameters all live in the same space

- Scaling: extremely limited

- Storage: limited

Execution: most often CLI, sometimes workflow management
Always on – easy to access, but expensive
Easiest to use (smallest additional learning curve)
Terrible reproducibility usually

This looks like a mess for a reason.

4 of 17

HPC / Cluster Compute

Compute somewhat separated from inputs, parameters, outputs.

- Environment managed by modules or users.

- Execution (via bash scripts and a scheduler)

Scaling: good to great, with some user intervention
Mostly sunk costs – little to no marginal per-run cost

But others can’t rerun our experiments, and our environment is fragile (and expensive!)

Storage/Scripts

Compute

inputs

parameters

environment

execution

outputs

Scheduler

5 of 17

Generic cloud separates this even further (but places it behind unfamiliar interfaces)

inputs

parameters

environment

execution

outputs

Scaling: mostly limitless

Execution: workflow language

Environment: containers or VM images

Inputs: Usually sit in cloud storage

Parameters: sent to cloud

Outputs: often in cloud storage

Note how much the cloud looks like HPC

Easy to share our environment / inputs.

High development cost

Major learning curve

Better transparency of our environment

Can get locked into vendors

6 of 17

Serverless – we remove any thoughts about environment, we shrink our inputs so that our processes can be made ephemeral, we package our parameters and inputs with our code, we run on an essentially unlimited number of tiny, identical compute environments, and we reduce our outputs back into a single piece if needed.

Scaling: unlimited and dynamic

Totally transferable and reproducible when done right

Cheapest compute

Major development cost

Mind-bending learning curve

Requires (or assumes) independence across inputs.

Most importantly – counter to how many bioinformatics workflows have traditionally worked.

7 of 17

Let’s start stretching our brains

Let’s start with an example to get us into the cloud, then we’ll work on getting closer to our preferred serverless architecture.

First, a poll:

What interface do most of y’all use for analyses? CLI? GUI? R console? python REPL?

8 of 17

DNA alignment (BWA mem) example

inputs

parameters

environment

execution

outputs

FASTQ

FASTA, indexes, command line args

BWA executable, BWA version

CLI invocation (in workflow or BASH)

BAM files

Let’s live-code the components of moving BWA to a cloud architecture, then discuss the weaknesses of this approach and ways we could improve it.

9 of 17

Step 1: set up the environment

We aren’t going to re-write all of BWA into our own code, so we will wrap it in a container or VM.

Dockerfile: https://github.com/edawson/dawdl/blob/master/dockerfiles/bwa.Dockerfile
We could also put this in a VM image (Amazon AMI, GCE image, etc).
We’ll build the Docker container out of this and then push that to a public repository

10 of 17

Step 2: Modularize our workflow as much as we can

From FASTQ + FASTA to BAM, we have two steps (tasks):

Indexing

Inputs are independent
Usually run only once, then used for many alignment runs
Takes only a FASTA file and produces indexes BWA uses to speed up alignment

Alignment

Inputs: A FASTQ of reads and our set of indexes

Not completely independent, but close

Read groups have individual properties (e.g. insert size)
By default algorithm takes insert size from first 10,000 reads
We can safely divide by RG, or by N (~100,000 reads)

Almost embarrassingly parallel (i.e. we can break our billions of reads into sets of several hundred thousand and run on those)

BWA also parallelize at the read level for us

Together, these tasks compose a single workflow. Can you think of how else we could divide this work?

11 of 17

Step 3: Define our parameters and inputs in a workflow script

We’ll use WDL, which can run locally, on compute clusters, or on the Broad’s FireCloud.

BioWDL example: https://github.com/biowdl/tasks/blob/78ebc0c8f54cb37c7ff3742725f7ccf1de96f3e0/bwa.wdl
We’re coding our own, which will reside in https://github.com/edawson/dawdl

Index:

Input: Fasta file

Params: none

Output: bwa indices

Align:

Input: bwa indices

Input: FASTQ

Param: # threads

(Param: # splits)

12 of 17

Optional step 3.5: break our workflow into smaller chunks

Why?

Smaller VMs are cheaper, more available, and the jobs finish faster, so we can run on preemptible VMs (10X cost reduction). Plus, we get results sooner!

Why not?

Lots of effort can be spent here, but it can easily be worth it.

Split out FASTQ into splits of 100000 reads
Run on a four-core VM, passing BWA mem the “-t 4” flag. Choose an amount of memory that also grabs us a 4-core VM.

13 of 17

Step 4: define our execution command in our workflow script.

We need to tell WDL exactly the command line we want, including variables (rather than hardcoded names) for our inputs.

14 of 17

Step 5: define our environment in the WDL

WDL requires more than just a container – we need to consider what type of VM we run on and whether our workflow can handle being preempted.

15 of 17

Step 6: define our outputs, so we can extract them from the cloud.

This is the easiest part.

16 of 17

How would we make this serverless?

We could port the BWA mem algorithm to JavaScript

I’m not doing that.

We could use a node native package (i.e. a javascript wrapper for BWA)

This is probably a great idea, and a few people have done similar things already.

We could expose a BWA API

Probably easier than the above, but relies on a backend server somewhere.

We could replace BWA as the standard of alignment

BWA will eventually be replaced, but good luck beating Heng to it.

17 of 17

What’s wrong (and right) with our work here?

Is it portable?
Is it reproducible?
Is it easily scaled?
Is it future proof?

How easily could we change BWA versions?
What if WDL dies?
What if we lose all of our funding for three months?
What if Broad’s FireCloud ceases to exist?

Is it transparent?

What if someone updates our Docker container maliciously?

Are we using the cheapest possible compute infrastructure?
Are we doing too much development work?