Modularization
The key to making your life easier when moving to the cloud is not to port whole pipelines as monolithic structures, but to break them up into bite-sized pieces.
Q: How does one eat a large meal?
A: One bite at a time.
Provided by the user
inputs
parameters
environment
execution
outputs
Usually provided by the sysadmin
Produced by our code
Inputs: files
Parameters: values (ints, strings, etc) or constants (e.g. a reference genome)
Environment:
- What programs or modules are available?
- What environment variables are present?
Execution:
- How do we run our code? As a script? On the CLI? Using a workflow description language?
What are the components of a given task/workflow/pipeline/script?
inputs
parameters
environment
execution
outputs
Local Compute
Environment, execution, inputs, outputs and parameters all live in the same space
- Scaling: extremely limited
- Storage: limited
This looks like a mess for a reason.
HPC / Cluster Compute
Compute somewhat separated from inputs, parameters, outputs.
- Environment managed by modules or users.
- Execution (via bash scripts and a scheduler)
But others can’t rerun our experiments, and our environment is fragile (and expensive!)
Storage/Scripts
Compute
inputs
parameters
environment
execution
outputs
Scheduler
Scheduler
Generic cloud separates this even further (but places it behind unfamiliar interfaces)
inputs
parameters
environment
execution
outputs
Scaling: mostly limitless
Execution: workflow language
Environment: containers or VM images
Inputs: Usually sit in cloud storage
Parameters: sent to cloud
Outputs: often in cloud storage
Note how much the cloud looks like HPC
Easy to share our environment / inputs.
High development cost
Major learning curve
Better transparency of our environment
Can get locked into vendors
Serverless – we remove any thoughts about environment, we shrink our inputs so that our processes can be made ephemeral, we package our parameters and inputs with our code, we run on an essentially unlimited number of tiny, identical compute environments, and we reduce our outputs back into a single piece if needed.
Scaling: unlimited and dynamic
Totally transferable and reproducible when done right
Cheapest compute
Major development cost
Mind-bending learning curve
Requires (or assumes) independence across inputs.
Most importantly – counter to how many bioinformatics workflows have traditionally worked.
Let’s start stretching our brains
Let’s start with an example to get us into the cloud, then we’ll work on getting closer to our preferred serverless architecture.
First, a poll:
What interface do most of y’all use for analyses? CLI? GUI? R console? python REPL?
DNA alignment (BWA mem) example
inputs
parameters
environment
execution
outputs
FASTQ
FASTA, indexes, command line args
BWA executable, BWA version
CLI invocation (in workflow or BASH)
BAM files
Let’s live-code the components of moving BWA to a cloud architecture, then discuss the weaknesses of this approach and ways we could improve it.
Step 1: set up the environment
We aren’t going to re-write all of BWA into our own code, so we will wrap it in a container or VM.
Step 2: Modularize our workflow as much as we can
From FASTQ + FASTA to BAM, we have two steps (tasks):
Together, these tasks compose a single workflow. Can you think of how else we could divide this work?
Step 3: Define our parameters and inputs in a workflow script
We’ll use WDL, which can run locally, on compute clusters, or on the Broad’s FireCloud.
Index:
Input: Fasta file
Params: none
Output: bwa indices
Align:
Input: bwa indices
Input: FASTQ
Param: # threads
(Param: # splits)
Optional step 3.5: break our workflow into smaller chunks
Why?
Why not?
Step 4: define our execution command in our workflow script.
We need to tell WDL exactly the command line we want, including variables (rather than hardcoded names) for our inputs.
Step 5: define our environment in the WDL
Step 6: define our outputs, so we can extract them from the cloud.
How would we make this serverless?
What’s wrong (and right) with our work here?