1 of 115

Dockstore Fundamentals:

Introduction to Docker and Descriptors for Reproducible Analysis

Louise Cabansay, Software Engineer, UC Santa Cruz Genomics Institute

Andrew Duncan, Software Engineer, Ontario Institute for Cancer Research

Denis Yuen, Senior Software Engineer, Ontario Institute for Cancer Research

1

2 of 115

Learning Objectives

What is Dockstore?
Introduction to Docker
Introduction to Descriptors

Overview of descriptor languages (CWL, WDL, Nextflow)
Practice using WDL

Dockstore Usage

Key features
Best practices

Overall goal today is to give you a good foundation in the basics of Dockstore. We will be providing a variety of take home materials to guide you on where to go next

2

3 of 115

Format and Setup

Lecture + Examples (slides)
Q/A:

#dockstore-fundamentals on Discord can also be a great place for questions
Click “yes” in Zoom when finished with an exercise
Introduce TAs

3

4 of 115

Format and Setup

Hands-on practice (Instruqt)

Message us on Discord if you did not get the link
Browser-based tutorial environment for your exercises
Close the sidebar in Instruqt for additional screen space

4

Chrome Browser Required (currently a bug on Instruqt w/ Firefox)

Download : https://www.google.com/chrome/

5 of 115

What is Dockstore?

Dockstore is a free and open source platform for sharing scientific tools and workflows.

Portability

Interoperability

Reproducibility

5

“An app store for bioinformatics”

6 of 115

Portability:

6

Software is “packaged” using container technology and described using descriptor languages

Analysis can be moved from environment-to-environment (local machines, clouds, servers) and yet be guaranteed to run on anything that supports Docker

+

Descriptor

Container

7 of 115

What’s on Dockstore? tools and workflows

7

Workflow

Tool

OR

Container + Descriptor

Workflow: Tools + Descriptor

A tool uses a single container and performs a single action or step that is outlined by a descriptor.

A workflow can use multiple containers and executes multiple actions or steps, still outlined by a descriptor

8 of 115

Example:

8

Variant Calling Pipeline

BWA-MEM

OR

A tool uses a single container and performs a single action or step that is outlined by a descriptor.

(ex: alignment)

Tool

Workflow

Debian OS

binaries

C

GNU Make

GCC

zlibs

processing

alignment

call variants

A workflow can use multiple containers and executes multiple actions or steps, still outlined by a descriptor

*note: you can register a single tool as a workflow on Dockstore, but a multi-step workflow cannot be registered as a tool.

ex: BWA-MEM can technically be registered as both

9 of 115

Interoperability:

9

Source Control

Analysis Environments

Integration with various sites allows Dockstore to function as centralized catalog of bioinformatics tools and workflows

By following GA4GH API standards, Dockstore enables users to run tools and workflows in their preferred compute and analysis environments

700+ tools and workflows published to Dockstore

Docker Registries

10 of 115

Reproducibility: Create, Share, Use

Dockstore is a place for researchers and developers to share their work so that others can also use it
The combination of containers and workflow languages minimizes redundant and error prone installation

Increases the transparency of analysis methods
Allows others to verify results and apply existing methods into their own research

Other Dockstore features to increase reproducibility:

Versioning, snapshots, and metadata handling
Generating Digital Object Identifiers (DOIs) via Zenodo
Organizations and Collections for sharing and findability
and more!

10

11 of 115

Docker Basics

*as used on Dockstore

11

12 of 115

What is a container? What is Docker?

12

Container:

A container encapsulates all the software dependencies associated with running a program.

Allows for portable software that runs quickly and reliably from one computing environment to another.

Docker:

A particularly popular brand of container

Some users may wish to explore alternatives like Singularity

13 of 115

What kinds of problems are solved by containers?

Installation problems

Software was built on a different OS (executable files don’t run)
Install documentation is unclear or out of date

Dependency problems

Software requires different version than what is available on machine (ex. Java, Python)
Multiple programs have shared dependencies, but different versions of those dependencies

Portability problems

Software can be run on any host OS that has Docker installed

13

14 of 115

Docker Concepts: Container, Image, Registry

14

Image:

Packaged up code with its all dependencies at rest.

Allows for portable software that runs quickly and reliably from one computing environment to another.

Official Image

Official images on Docker Hub are regularly updated and scanned for vulnerabilities

Registry:

Repositories where users can store images privately or publicly in the cloud.

Dockstore itself does not host images, but rather gets them from Image Registries:

Docker Hub
Quay.io
Google Container Registry (GCR)

Container:

A running image

Packaged up, isolated software environment
All dependencies included

**the terms container and image are often used interchangeably, but there is a slight distinction.

15 of 115

Docker Ecosystem

15

Docker Daemon

docker pull

docker run

docker build

Docker CLI

Host Machine

Images

Containers

Image Registry

Docker Hub

Quay.io

GCR

16 of 115

Docker Ecosystem

16

docker pull

docker run

docker build

Docker Hub

Quay.io

GCR

Docker Daemon

Docker CLI

Host Machine

Images

Containers

Image Registry

17 of 115

Docker Ecosystem

17

docker pull

docker run

docker build

Docker Hub

Quay.io

GCR

docker run

Note: the docker run command will also pull from a registry if the image doesn’t already exist on machine

Docker Daemon

Docker CLI

Host Machine

Images

Containers

Image Registry

18 of 115

Start Up Instruqt

18

Wait for start to show up

19 of 115

Docker Client (CLI)

A command-line utility for:

downloading and building Docker images
running Docker containers
managing both images and containers (think disk space, cleanup)

19

docker [sub-command] [-flag options] [arguments]

20 of 115

Basic Docker Sub Commands:

20

docker info [OPTIONS]

Display system-wide information about your installation of docker:

docker image [COMMAND]

Managing docker images:

docker container [COMMAND]

Managing docker containers:

docker run [-flags] [registry name]/[path to image repository]:[tag] [arguments]

Running docker containers:

Docker has a whole library of commands, here are some basic examples:

21 of 115

How are containers commonly used?

Run and done

Execute docker run command
Actions executed in container (stuff happens)
Container stops running after completion

21

22 of 115

How are containers commonly used?

Run and done

Execute docker run command
Actions executed in container (stuff happens)
Container stops running after completion

Run continuously

Execute docker run command
Container starts and runs in the background continuously
Other processes can interact with the container (stuff happens)
Container keeps running unless it is stopped

22

Your main method of containers!!

23 of 115

How do I ‘run’ a container?

23

docker run [-flag options] [registry name]/[path to image repository]:[tag] [arguments]

base

run command

dockerhub*

quay.io

gcr.io

generally, the arguments are what gets passed into the container

ex: the command you want to run

Both official containers and user containers are available

Only specify the registry if its not dockerhub

The ‘version’ of the image you want to run

Additional options

Ex. --name to specify a name for the container

24 of 115

24

25 of 115

Exercise #1a: Running containers

25

docker run docker/whalesay cowsay "fill me in"

Exercises:

Use the whalesay container from Docker Hub to print a welcome message

docker run [-flag options] [registry name]/[path to image repository]:[tag] [arguments]

26 of 115

Exploring containers interactively

26

docker run -it quay.io/ldcabansay/samtools:latest

Example:

Enter the samtools container and confirm that samtools is installed

The ‘-it’ flags drop you inside the container
(-i) Keeps STDIN open for interactive use and (-t) allocates a terminal
You can then interact with the container’s terminal like a normal command line

docker run [-flag options] [registry name]/[path to image repository]:[tag] [arguments]

-i -t

27 of 115

Sharing data between host and container

Bind mounts (-v) (aka two-way data binding)

Map a host directory to a directory in a container
Any files added to either directory is available in the other

27

docker run -v /usr/data:/tmp/data ...

Output stored in the container directory /tmp/data will also be available on the host at /usr/data

docker run -v [ path where data is ]:[ path to put data ] ...

HOST

CONTAINER

Note: using absolute paths is highly recommended

28 of 115

Exercise #1b: exploring containers

28

docker run [-flag options] [registry name]/[path to image repository]:[tag] [arguments]

-i -t -v

docker run -it -v /root/bcc2020-training/data:/data quay.io/ldcabansay/samtools:latest

Exercise(s):

Enter the samtools container, but this time bring in some data!

docker run -v [ path where data is ]:[ path to put data ] ...

HOST

CONTAINER

docker run -v /root/bcc2020-training/data:/data quay.io/ldcabansay/samtools:latest samtools view -S -b /data/mini.sam -o /data/mini.bam

Convert a sam file to a bam file using the samtools container.

29 of 115

Data binding: In-depth (Extra Reading)

Useful tips and tricks to know about databinding
Advanced features and caveats to keep in mind
Read more here: https://docs.docker.com/storage/

29

30 of 115

Dockerfiles: Custom Images

Sometimes an existing image isn’t available for the software we want to use or an existing image may be lacking something we require
Dockerfiles are used to create our own custom images

Starts from a base image
Contains a series of steps that set up our environment

We can then also share these custom images via an image registry

30

31 of 115

Primer: How is software installed and used?

31

Author of dockerfile programmatically details the software installation and any other steps for environment setup

Image built from the dockerfile can then be used for ‘off-the-shelf’ software usage by others.

Package managers

Managed collection of software with automated install, upgrades, and removal

Executable Files or Binaries �(ex: *.jar, *.c, grep, tar, diff, md5sum)

software that has already been built or compiled into an executable file.

Building or running from source files

Compiling from source build an executable (requires compiler & dependencies)
For languages that don’t need compiling (python), necessary runtime dependencies required in environment

32 of 115

Dockerfiles Overview:

A simple text file with instructions to build an image:

YAML syntax
Start from a base image
Metadata
Install software and dependencies
Set up scripts
Other environment prep
Define commands to run when container starts

32

Dockerfile

x

FROM

MAINTAINER

RUN

ENV

CMD

base image (start)

commands to:

- install software

- install dependencies

- run scripts

- misc. environment setup

environment variables

command to execute when container starts (optional)

author metadata

33 of 115

Dockerfiles - local

33

docker pull

docker run

docker build

Docker Daemon

Docker CLI

Host Machine

Images

Containers

Dockerfile

Configuration to set up a docker image

34 of 115

Dockerfiles - local

34

docker pull

Docker Daemon

Docker CLI

Host Machine

Images

Containers

Dockerfile

docker run

docker build

Configuration to set up a docker image

35 of 115

Example: BWA (via package manager)

Images of containers are built from Dockerfiles�
Dockerfiles describe the packaged up environment:

operating system or base image to build upon
dependencies needed for the software
the actual analysis software

35

Dockerfile

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

# Start with a base image

FROM ubuntu:18.04

# Add file author/maintainer and contact info (optional)

MAINTAINER Louise Cabansay <lcabansa@ucsc.edu>

#set user you want to install packages as

USER root

#update package manager & install dependencies (if any)

RUN apt update

# install analysis software from package manager

RUN apt install -y bwa

#######################################################

# Dockerfile to build a sample container for bwa

#######################################################

36 of 115

Basic Commands: docker build

36

docker build -t bwa:v1.0 .

( -t ) : Builds and creates a tag v1.0 for a bwa image (if in dockerfile directory)

docker build -t bwa:v1.0 -f dockerfiles/bwa/Dockerfile .

( -f ) : Build a specific Dockerfile by providing path to file (relative to build context)

docker image ls

View built Docker images

docker build [-flag options] [build context]

-t -f

Earlier I mentioned that we need to build Dockerfiles to create Docker images.

We do this with the docker build command.

There are two important flags we will look at.

-t is used to name images and create tags. In the first example we named our image bwa and set the version to v1.0.

The period at the end of this command is for the build context.

For the sake of this tutorial, just think of it as the folder that the build command will look in for your Dockerfile.

The build command implicitly looks in your build context folder for a file called Dockerfile.

-f is used to override the default path to the Dockerfile. This is useful if the Dockerfile is in another directory or if we have multiple Dockerfiles.

Make sure the path is relative to the build context.

We can see this in the second example. It will use the file dockerfiles/bwa/Dockerfile as the location of the Dockerfile.

Finally, we can view our built images using docker image ls.

37 of 115

37

38 of 115

Exercise #2a: Writing your first Dockerfile: tabix

Dockerfiles describe the packaged up environment:

operating system or base image to build upon:

ubuntu:18.04

Install dependencies needed: N/A
Install actual analysis software:

tabix

38

Dockerfile

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

# Start with a base image

FROM { base image name }

# Add file author/maintainer and contact info (optional)

MAINTAINER {your name} <youremail@research.edu>

# set user you want to install packages as

USER root

# update package manager & install dependencies (if any)

RUN apt update

# install analysis software from package manager

RUN apt install -y { software package name }

docker image build -t { name } -f { path to dockerfile } .

Build an image from Dockerfile:

39 of 115

Exercise #2b: Try out your new container!

39

docker image ls

docker run [image id] tabix

Exercises:

1. Verify that your image was built (get the image ID to use in part 2)

2. Use your local image to view the tabix command help

docker container run [-flag options] [registry name]/[path to image repository]:[tag] [args]

40 of 115

Ex: bamstats (executable)

Dockerfiles describe the packaged up environment:

operating system or base image to build upon
dependencies needed for the software
the actual analysis software

Here it’s an already compiled executable

scripts or commands to complete software and environment set- up

40

Dockerfile

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

# Start with a base image

FROM ubuntu:14.04

# Add file author/maintainer and contact info (optional)

MAINTAINER Brian OConnor <briandoconnor@gmail.com>

# install software dependencies

USER root

RUN apt-get -m update && apt-get install -y wget unzip \

openjdk-7-jre zip

# manual software installation from source

# get the tool and install it in /usr/local/bin

RUN wget -q http://downloads.sourceforge.net/project

/bamstats/BAMStats-1.25.zip

# commands/scripts to finish software setup

RUN unzip BAMStats-1.25.zip && \

rm BAMStats-1.25.zip && \

mv BAMStats-1.25 /opt/

COPY bin/bamstats /usr/local/bin/

RUN chmod a+x /usr/local/bin/bamstats

# switch back to the ubuntu user so this tool (and the files written) are not owned by root

RUN groupadd -r -g 1000 ubuntu && useradd -r -g ubuntu -u 1000 -m ubuntu

USER ubuntu

# command /bin/bash is executed when container starts

CMD ["/bin/bash"]

41 of 115

Example: samtools (compile from source files)

Images of containers are built from Dockerfiles�
Dockerfiles describe the packaged up environment:

operating system or base image to build upon
dependencies needed for the software
the actual analysis software
Scripts or commands to complete environment set- up

41

Dockerfile

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

# Start with a base image

FROM ubuntu:18.04

# Add file author/maintainer and contact info (optional)

MAINTAINER Louise Cabansay <lcabansa@ucsc.edu>

# install software dependencies

RUN apt update && apt -y upgrade && apt install -y \

wget build-essential libncurses5-dev zlib1g-dev \

libbz2-dev liblzma-dev libcurl3-dev \

WORKDIR /usr/src

# get the software source files

RUN wget https://github.com/samtools/samtools/releases/

download/1.10/samtools-1.10.tar.bz2

# installation commands to compile source files

tar xjf samtools-1.10.tar.bz2 && \

rm samtools-1.10.tar.bz2 && \

cd samtools-1.10 && \

./configure --prefix $(pwd) && \

make

# add newly built executables to path

ENV PATH="/usr/src/samtools-1.10:${PATH}"

42 of 115

Sharing your Dockerfiles and Images

42

A Dockerfile contains the configuration to package up your software into an image

Dockerfile

Source Control

Image Registry

Dockstore recommends storing your Dockerfile in an external repository (Bitbucket, GitHub, GitLab) and then registering your source controlled Dockerfile to an image registry (Docker Hub, Quay.io, Google Container Registry, etc)

43 of 115

Best Practices (Take home reading)

Start from official images ( https://docs.docker.com/docker-hub/official_images/ )
Only add what you need (keep containers light),

One way to do this is to use multiple containers in your workflows
Recommend not including large reference data within containers

Instead provide that data at runtime through data binding or via platform

Version your containers for re-use later (use --tag when building)

43

44 of 115

What Next?

Docker is great, it tells us how to install software.

However, it doesn’t tell us how to use software.

Descriptor languages are the solution!

44

45 of 115

Break

45

46 of 115

What Next?

Docker is great, it tells us how to install software.

However, it doesn’t tell us how to use software.

Descriptor languages are the solution!

46

47 of 115

Intro to Descriptors (WDL)

47

48 of 115

Components and Concepts shared by Descriptors

48

Descriptor:

A workflow language used to describe how to run your pipeline.

Which containers
What steps and when
Define parameters

I/O data
compute requirements

Metadata

Parameter File (wdl, cwl):

Specifies the actual input/output files (local, ftp, http, or cloud)
Set compute resources
JSON, YAML

Container:

Packaged up code with all of its dependencies. This allows for portable software that runs quickly and reliably from one computing environment to another.

49 of 115

CWL: Common Workflow Language

49

CWL: Common Workflow Language�

Open and portable standard for describing analysis workflows and tools
Alternative to WDL/nextflow that might be present in your lab/cloud
Upcoming tutorial example but in CWL

Implementations/Engines:�

CWL doesn’t have a single official engine
CWLtool - reference implementation
Other implementations include Arvados, Toil, and Cromwell

Analysis Platforms (Launch-with)

Seven Bridges Cancer Genomics Cloud (CGC)

50 of 115

Nextflow

50

Nextflow

Fluent domain-specific language also for scientific workflows using software containers
View as alternative to CWL/WDL that might be present in your lab/cloud
Upcoming tutorial example but in Nextflow

Running nextflow workflows

Works on local machines, HPC, AWS, Google Cloud
Cloud support via Sequera Labs

51 of 115

WDL: Workflow Description Language

51

WDL: Workflow Description Language�

human-readable and writable descriptor language

Engines:

Cromwell: “A Workflow Management System geared towards scientific workflows”

The first execution engine that understands WDL

Other engines: TOIL, miniWDL

Analysis Platforms (Launch-with)

Terra (AnVIL, BioDataCatalyst)
DNAstack
DNAnexus
Also works on local machines, HPC, and Cloud

52 of 115

What’s in a WDL? Top-level Components

3 top-level components that are part of the core structure of a WDL script

Workflow
Call
Task

52

workflow.wdl

x

53 of 115

Top-level Components - Workflow

Workflow: Code block that defines the overall workflow. You can think of it as an outline.

53

workflow myWorkflowName {

}

workflow.wdl

x

54 of 115

Top-level Components - Workflow

Workflow: Code block that defines the overall workflow. You can think of it as an outline.

Inputs (optional)

Ex: use when building more complex workflows that will re-use inputs for multiple purposes

54

workflow myWorkflowName {

}

input {

...

}

workflow.wdl

x

output {...}

Outputs

Specify output you want to keep from run of entire workflow
Signals to Cromwell to keep track of these outputs and save them somewhere

55 of 115

Top-level Components - Call

Call: Component that defines which tasks the workflow will run

Located within workflow block
can also specify input parameters to pass to that task (optional)

55

workflow myWorkflowName {

}

call task_B {

input: ...

}

call task_A

input {

...

}

workflow.wdl

x

output {...}

56 of 115

Top-level Components - Task

Task: Defines all the information necessary to perform an action.

Tasks are referred to within a call, but are actually defined outside of the workflow block

56

workflow myWorkflowName {

}

task task_A { … }

task task_B { … }

input {

...

}

workflow.wdl

x

call task_B {

input: ...

}

call task_A

output {...}

57 of 115

What’s in a Task?

Task: Defines all the information necessary to perform an action

57

task doSomething {

}

task.wdl

x

58 of 115

What’s in a Task? Command

Task: Defines all the information necessary to perform an action

Command (‘the action’) - required!

Defines the command(s) that will be run in the execution environment
Can be multiple lines/commands

58

task doSomething {

}

task.wdl

x

command {

echo Hello World!

cat ${myName}

}

59 of 115

What’s in a Task? Inputs

Task: Defines all the information necessary to perform an action

Inputs

Optional, only required if the task will have inputs
All inputs must be typed

string, int, file, etc

Individual inputs can also be optional, denoted by ‘?’:

Can also set a default values:

�

59

task doSomething {

}

command {

echo Hello World!

cat ${myName}

}

input { File myName }

task.wdl

x

input { File? myName }

input { String? myName=“Foobar” }

60 of 115

What’s in a Task? Outputs

Task: Defines all the information necessary to perform an action

Outputs

The outputs section defines which values should be exposed as outputs after a successful run of the task

Especially useful when using output of one task as input of another

Outputs must be typed

ex: string, int, file, etc

�

60

task doSomething {

}

output {

File outFile = “Hello.txt”

}

command {

echo Hello World! > Hello.txt

cat ${myName} >> Hello.txt

}

input { File myName }

task.wdl

x

61 of 115

Simple Example: HelloWorld.wdl

61

HelloWorld.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

version 1.0

# add and name a workflow block

workflow hello_world {

}

# define the ‘hello’ task

task hello {

input { File myName }

command {

echo Hello World! > Hello.txt

cat ${myName} >> Hello.txt

}

output { File outFile = “Hello.txt” }

}

call hello

# important: add output for whole workflow

output {

File helloFile = hello.outFile

}

62 of 115

Parameter JSON (simple):

A parameter JSON specifies the actual input values to fill-in for the WDL parameters when running the workflow

62

value can be path, string, int, array, etc

"hello_world.hello.myName": "/<usr>/bcc2020/wdl-training/exercise1/name.txt"

workflow

name

task

name

parameter

name

KEY

VALUE

hello.json

x

1

2

3

{

"hello_world.hello.myName": "/<usr>/bcc2020/wdl-training/exercise1/name.txt"

}

Note: using absolute paths is highly recommended

63 of 115

63

64 of 115

Exercise #1: Run your first wdl

Single task workflow

Workflow block not required

64

HelloWorld.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

version 1.0

workflow hello_world {

call hello

output { File helloFile = hello.outFile }

}

task hello {

input { File myName }

command {

echo Hello World! > Hello.txt

cat ${myName} >> Hello.txt

}

output { File outFile = “Hello.txt” }

}

dockstore workflow launch --local-entry HelloWorld.wdl --json hello.json

Run with DockstoreCLI

65 of 115

Overview: Dockstore CLI

A handy command line resource to help users develop content locally.

Run descriptors by automatically calling to Cromwell or CWLtool

Local descriptors
Remote descriptors pulled from Dockstore

Generate a JSON parameter template based on a given descriptor
Built-in plugins enable fetching remote input data (http, s3, gs)

65

Example execution with the Dockstore Command Line Interface (CLI):

dockstore workflow launch --local-entry HelloWorld.wdl --json hello.json

66 of 115

What’s in a Task? Runtime

Task: Defines all the information necessary to perform an action

Runtime

Defines context/environment

docker containers
compute resources

66

task doSomething {

}

output {

File outFile = “Hello.txt”

}

command {

echo Hello World! > Hello.txt

cat ${myName} >> Hello.txt

}

input { File myName }

task.wdl

x

runtime {

docker: “ubuntu:latest”

memory: “1GB”

}

For cloud platforms, runtime parameters can be used to allocate/configure resources when spinning up a compute instance

CPU
Disk
Memory
Instance types
Region
Preemptibility

67 of 115

What’s in a Task? Parameterization

Task: Defines all the information necessary to perform an action

Parameterized values

Variables that define placeholder requirements
flexibility, reuse, repurposing
Pretty much everything can be parameterized

Commands
Inputs, outputs
Runtime requirements

67

task doSomething {

}

output {

File outFile = “${outFile}”

}

command {

echo Hello World! > ${outFile}

cat ${myName} >> ${outFile}

}

input {

File myName

String outFile

}

task.wdl

x

runtime {

docker: docker_image

memory: “${memory_gb}”

}

But is this always a best practice?

68 of 115

Non-Input Declarations

A task can have declarations which are intermediate values rather than inputs.

Think of these as variables that you can define to help execute a task
Typed: String, Int, File, etc
Can use or build upon:

the input values given to task
methods from WDL standard library
other non-input declarations

After defining, used in other sections: command, outputs, runtime

68

task doSomething {

}

#creating non-input declaration

String myString = “hi ” + ${myName}

String outFile = ${myName} + “.out”

task.wdl

x

# example usage in command

command {

echo ${myString} > ${outFile}

}

input { String myName }

# example usage in output

output {

File outFile = “${outFile}”

}

69 of 115

WDL Standard Library (Take Home)

Built-in functions or methods provided by the core WDL language

Helpful in writing more complex workflows
Examples:

File handling (read, write different kinds files: JSON, tsv, etc )
Working with stdin, stdout
Data manipulation/handling: arrays, objects, strings, etc
Mapping values
Much more!

69

70 of 115

WDL Standard Library (Simple)

70

task hello {

}

output {

File outFile = “Hello.txt”

}

command {

echo Hello World! > Hello.txt

cat ${myName} >> Hello.txt

}

input { File myName }

task.wdl

x

task hello {

}

output {

File outFile = stdout()

}

command {

echo Hello World!

cat ${myName}

}

input { File myName }

task.wdl

x

Output:

“hello.outFile” : “.../stdout”

Output:

“hello.outFile” : “.../Hello.txt”

71 of 115

Primer Exercise #2

For our second exercise we’re going to parameterize a simple workflow.

Goal: generate statistics about an alignment file

Software: samtools (flagstat)
Input: sam file
Output: alignment statistics

There will be multiple ways to solve this assignment. This is a chance for you to apply the things we’ve learned to a real bioinformatics workflow.

71

72 of 115

Exercise #2: Complete metrics.wdl

Set runtime to use the samtools docker container:

quay.io/ldcabansay/samtools:latest

Parameterize the samtools command in the flagstat task
(optional) If you make any new inputs, be sure update the metrics.json

72

runtime {

docker: “ubuntu:latest”

memory: “1GB”

}

metrics.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow metrics {

call flagstat

output { File align_metrics = flagstat.metrics }

}

task flagstat {

input { File input_sam }

# non-parameterized flagstat command

command {

samtools flagstat mini.sam > mini.sam.metrics

}

output {

File metrics = “mini.sam.metrics”

}

# set some parameterized runtime parameters

runtime {

docker: # set

}

73 of 115

Exercise #2: Solution* - Descriptors

73

metrics.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow metrics {

call Flagstat

output { File align_metrics = flagstat.metrics }

}

task flagstat {

input { File input_sam }

# non-parameterized flagstat command

command {

samtools flagstat mini.sam > mini.sam.metrics

}

output {

File metrics = “mini.sam.metrics”

}

# set some parameterized runtime parameters

runtime {

docker: # set

}

metrics.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow metrics {

call Flagstat

output { File align_metrics = flagstat.metrics }

}

task flagstat {

input { File input_sam }

# slightly parameterized flagstat command

command {

samtools flagstat ${input_sam} > mini.sam.metrics

}

output { File metrics = “mini.sam.metrics” }

# set docker runtime

runtime {

docker: “quay.io/ldcabansay/samtools:latest”

}

*note: this is one example solution, multiple are possible

74 of 115

Exercise #2: Solution vs Solution2

74

metrics.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow metrics {

call Flagstat

output { File align_metrics = flagstat.metrics }

}

task flagstat {

input { File input_sam }

# slightly parameterized flagstat command

command {

samtools flagstat ${input_sam} > mini.sam.metrics

}

output { File metrics = “mini.sam.metrics” }

# set docker runtime

runtime {

“quay.io/ldcabansay/samtools:latest”

}

metrics.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow metrics {

call Flagstat

output { File align_metrics = flagstat.metrics }

}

task flagstat {

input {

File input_sam

String docker_image

}

# create a string to help parameterize command

String stats = basename(input_sam) + “.metrics”

command {

samtools flagstat ${input_sam} > ${stats}

}

output { File metrics = “${stats}” }

# set a parameterized docker runtime

runtime {

docker: docker_image

}

*note: this is one example solution, multiple are possible

75 of 115

Break

75

76 of 115

Multi-task workflows:

In real use cases, workflows are typically multi-task pipelines that build upon individual steps

The output of one task can serve as the input of another

Each task in workflow can have their own isolated runtime environment

Docker image
Task specific compute requirements

76

77 of 115

Example: Multi-task workflow

77

HelloWorld.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow hello_world {

call hello

output { File helloFile = hello.outFile }

}

task hello {

input { File myName }

command {

echo Hello World!

cat ${myName}

}

output { File outFile = stdout() }

}

GoodbyeWorld.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow goodbye_world {

call goodbye

output { File byeFile = goodbye.outFile }

}

task goodbye {

input { File greeting }

command {

cat ${greeting}

echo See you later!

}

output { File outFile = stdout() }

}

78 of 115

Example: Multi-task workflow - HelloGoodbye.wdl

78

HelloGoodbye.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow HelloGoodbye {

call hello

call goodbye {

input: greeting = hello.outFile

}

output { File hello_goodbye = goodbye.outFile }

}

task hello {

input { File myName }

command {

echo Hello World!

cat ${myName}

}

output { File outFile = stdout() }

}

task goodbye {

input { File greeting }

command {

cat ${greeting}

echo See you later!

}

output { File outFile = stdout() }

}

HelloGoodbye.json

x

1

2

3

4

5

{

"HelloGoodbye.hello.myName": "/root/bcc2020-training/wdl-training/exercise3/

hello_examples/name.txt"

}

79 of 115

WDL Imports

A WDL file may contain import statements to include WDL code from other sources.

Useful in keeping large multi-task workflows organized
Helpful way to reuse and build modular workflows

79

80 of 115

Imports: Concepts

Primary Descriptor

The ‘main’ descriptor file, anchors relevant imports

Sub-workflow aka sub-wdl

the imported, external WDL

Namespaces

prevents name collisions, organizes tasks/workflows into groups

Aliases

A custom name given to a namespace (optional)
Not specifying an alias defaults to namespace to filename without ‘.wdl’

80

workflow

name

task

name

parameter

name

workflow primary {

...

}

primary-descriptor.wdl

x

import "<resource>" as <alias>

task task_A { ... }

call <alias>.taskOne {

input: ...

}

call task_A

"primary.taskOne.param_name": "<value of param or path if file>"

JSON mapping:

81 of 115

Example: No Imports vs Imports

81

HelloGoodbye.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow HelloGoodbye {

call hello

call goodbye {

input: greeting = hello.outFile

}

output { File hello_goodbye = goodbye.outFile }

}

task hello {

input { File myName }

command {

echo Hello World!

cat ${myName}

}

output { File outFile = stdout() }

}

task goodbye {

input { File greeting }

command {

cat ${greeting}

echo See you later!

}

output { File outFile = stdout() }

}

HelloGoodbye_imports.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

# add import statements to bring in sub-workflows

# if not given, namespace/alias = file minus ‘.wdl’

import “HelloWorld.wdl”

# otherwise, namespace = <alias>

import “GoodbyeWorld.wdl” as bye

workflow HelloGoodbye {

#call the hello task, syntax: <alias>.taskname

call HelloWorld.hello

#call the goodbye task, syntax: <alias>.taskname

call bye.goodbye {

input: greeting = hello.outFile

}

#same as before, define workflow outputs

output { File hello_goodbye = goodbye.outFile }

}

82 of 115

Example: No Imports vs Imports

82

HelloGoodbye.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow HelloGoodbye {

call hello

call goodbye {

input: greeting = hello.outFile

}

output { File hello_goodbye = goodbye.outFile }

}

task hello {

input { File myName }

command {

echo Hello World!

cat ${myName}

}

output { File outFile = stdout() }

}

task goodbye {

input { File greeting }

command {

cat ${greeting}

echo See you later!

}

output { File outFile = stdout() }

}

HelloGoodbye_imports.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

# add import statements to bring in sub-workflows

# if not given, namespace/alias = file minus ‘.wdl’

import “HelloWorld.wdl”

# otherwise, namespace = <alias>

import “GoodbyeWorld.wdl” as bye

workflow HelloGoodbye {

#call the hello task, syntax: <alias>.taskname

call HelloWorld.hello

#call the goodbye task, syntax: <alias>.taskname

call bye.goodbye {

input: greeting = hello.outFile

}

#same as before, define workflow outputs

output { File hello_goodbye = goodbye.outFile }

}

Do we have to change the JSON when running HelloGoodbye using imports?

83 of 115

Example: No Imports vs Imports (no comments)

83

HelloGoodbye.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow HelloGoodbye {

call hello

call goodbye {

input: greeting = hello.outFile

}

output { File hello_goodbye = goodbye.outFile }

}

task hello {

input { File myName }

command {

echo Hello World!

cat ${myName}

}

output { File outFile = stdout() }

}

task goodbye {

input { File greeting }

command {

cat ${greeting}

echo See you later!

}

output { File outFile = stdout() }

}

HelloGoodbye_imports.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

import “HelloWorld.wdl”

import “GoodbyeWorld.wdl” as bye

workflow HelloGoodbye {

call HelloWorld.hello

call bye.goodbye {

input: greeting = hello.outFile

}

output { File hello_goodbye = goodbye.outFile }

}

84 of 115

Primer for Exercise #3: (if time permits)

Aligner (bwa)

Align a sequence files (FASTQs) to a reference
Produces an alignment file: sam

Metrics (samtools flagstat)

Evaluates an alignment file (sam or bam)
Reports statistics on about the alignment

What we’ll make:

A workflow that does both of these tasks, first aligns, then generates statistics about the alignment

Without imports
With imports

84

85 of 115

Ex: BWA Aligner

Specify WDL version

Define workflow, call, and task(s)

Define parameters

Input and output

Parameterized command(s)

Runtime environment
Compute resource requirements

Metadata

Authorship, contact information, etc

85

aligner.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

version 1.0

workflow alignReads {

call bwa_align

output { File output_sam = bwa_align.output_sam }

}

task bwa_align {

input {

String sample_name

String docker_image

String? bwa_options

File read1_fastq

File read2_fastq

File ref_fasta

File ref_fasta_fai

File ref_fasta_amb

File ref_fasta_ann

File ref_fasta_bwt

File ref_fasta_pac

File ref_fasta_sa

}

String output_sam = “${sample_name}” + .“sam”

command {

bwa mem ${bwa_options} ${ref_fasta} \

${read1_fastq} ${read2+fastq} > ${output_sam}

}

output { File output_sam = “${output_sam}” }

runtime {

docker: docker_image

memory: “${memory_gb}” + “GB”

}

meta {

author: "Foo Bar"

email: "foobar@university.edu"

}

86 of 115

Example: Metrics.wdl (samtools flagstat)

Same as the solution to exercise #2

86

metrics.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow metrics {

call Flagstat

output { File align_metrics = Flagstat.metrics }

}

task Flagstat {

input {

File input_sam

String docker_image

}

# create a string to help parameterize command

String stats = basename(input_sam) + “.metrics”

command {

samtools flagstat ${input_sam} > ${stats}

}

output { File metrics = “${stats}” }

# set some parameterized runtime parameters

runtime {

docker: docker_image

}

87 of 115

Exercise#3: Writing a multi-task workflow (if time permits)

Create a multi-task workflow: align_and_metrics.wdl

aligner.wdl

Generates a sam file from given fastq sequence files and reference files

metrics.wdl

Generates alignment statistics of a given sam/bam file.

You may create align_and_metrics.wdl with or without imports. A skeleton is provided to you for each type.

87

88 of 115

Importing workflows: Best Practices & Tips (Take home)

Each import should perform a specific action and be runnable on its own (given the correct input)
Use descriptive aliases for namespaces
Use descriptive names for tasks and workflows
Imports can be local file paths or HTTP(s) paths

HTTP(s) example: grabbing a raw file from a repository like GitHub

88

89 of 115

Importing workflows: Caveats (Take home)

Not all workflow engines or platforms support imports (ex. DNAnexus)
Levels of support also varies, some features don’t work

Ex. Terra only recently started supporting local file path imports and only if the descriptor is in GitHub

Learn more about which platforms support imports:

https://docs.dockstore.org/en/develop/end-user-topics/language-support.html

89

90 of 115

Metadata and Parameter Metadata (Take Home)

Metadata

meta section within the workflow section
Optional key/value pairs
Useful for author, email, description

Parameter Metadata

parameter_meta section within a task
Optional key/value pairs
Describe parameters
Key must map to a task input or output

90

task doSomething {

}

output {

File outFile = “Hello.txt”

}

command {

echo Hello World! > Hello.txt

cat ${myName} >> Hello.txt

}

input { File myName }

task.wdl

x

91 of 115

92 of 115

Summary, Dockstore, and next steps!

92

93 of 115

What is Dockstore?

Dockstore is a free and open source platform for sharing scientific tools and workflows.

Portability

Interoperability

Reproducibility

93

“An app store for bioinformatics”

94 of 115

Dockstore Ecosystem

94

Source Control

Analysis Environments

Store your descriptors and containers and descriptors on your preferred sites

Dockstore’s launch-with feature enables users to export tools and workflows to a variety of cloud compute platforms

Register these as tools and workflows on Dockstore, allowing for a centralized bioinformatics catalog of resources

Docker Registries

95 of 115

Partner Platforms

DNAstack
DNAnexus
Terra
CGC : Cancer Genomics Cloud (Seven Bridges)
AnVIL
BioData Catalyst

Launching Analysis

Structural Variant Calling using Graph Genomes

Contributed by: Jean Monlong and Charles Markello (VG Team, UC Santa Cruz, Genomics Institute)

96 of 115

Launching Analysis - Example

96

Cumulus Workflow: https://dockstore.org/workflows/github.com/klarman-cell-observatory/cumulus/Cumulus:0.15.0?tab=info

AnVIL Organization, Cumulus Collection: https://dockstore.org/organizations/anvil/collections/Cumulus

Contributed by: Bo Li & Yiming Yang (Cumulus Team, Broad Institute)

97 of 115

General Best Practices

Include authorship and contact information in the primary descriptor file
Include workflow description either in the README or descriptor file
Use releases/tags (ex. v1.1) instead of branches for versioning
Test parameter files associated with workflows to provide samples that users can try out

Use publicly available data
Real world examples or simple test data

Add labels to your workflow to improve findability
Checker workflows to test compatibility with different environments
Use the snapshot and DOI feature to improve reproducibility

97

98 of 115

DOIs

Create snapshots and digital object identifiers for your workflows to permanently capture the state of a workflow for publication

Creating Snapshots and Requesting DOIs — Dockstore documentation

Examples: �Forward� https://doi.org/10.5281/zenodo.3889018�Backward

https://dockstore.org/workflows/github.com/dockstore/hello_world:master?tab=versions

98

99 of 115

Organizations

99

Landing page to showcase tools and workflows

You can organize and collect workflows from other people
Same workflow may be in an organization based on lab, funding source
Organizations and Collections — Dockstore documentation

Example: a COVID-19 collection that submits to Nextstrain https://dockstore.org/organizations/BroadInstitute/collections/pgs

Contributed by: Daniel Park (Viral Genomics Group, Broad Institute)

100 of 115

Getting Help on Dockstore

User forum at https://discuss.dockstore.org/

Topics embedded with each tool, workflow, and documentation page.
Talk about bioinformatics, workflows, and get help on development

100

101 of 115

Documentation and Tutorials

Example Topics:

Launching Tools and Workflows
Writing checker workflows
Developing File Provisioning Plugins
Creating Organizations
And many more!

101

https://docs.dockstore.org/

102 of 115

Dockstore Ecosystem

102

Dockstore is thankful to its many contributors, users, and partners. This community has pulled together a library of over 700 tools and workflows. In the diagram to the right we’ve highlighted a few select contributors to give a sense of what has been occuring in this space.

103 of 115

The Dockstore Team

103

Louise Cabansay

Natalie Perez

Melaina Legaspi

Charles Reid

Emily Soth

Andy Chen

Benedict Paten

Elnaz Sarbar

Charles Overbeck

Walt Shands

David Steinberg

Nneka Olunwa

Lincoln Stein

Denis Yuen

Andrew Duncan

Gary Luu

Gregory Hogue

104 of 115

Acknowledgements

104

This work was funded by the Government of Canada through Genome Canada and the Ontario Genomics Institute (OGI-168).

Funded by:

105 of 115

Extra Slides for Q&A

105

106 of 115

Additional Readings

Bind Mounts - https://docs.docker.com/storage/bind-mounts/
Volumes - https://docs.docker.com/storage/volumes/

Note: -v has historically been how volumes are mounted, however --mount is an equivalent option with a different syntax

106

107 of 115

Exercise #1a: Using Docker

107

docker info

Display system-wide information about your installation of docker:

docker image help

Managing docker images:

docker container help

Managing docker containers:

Docker has a whole library of commands, here are some basic examples:

docker container run hello-world

Run the official hello-world docker container from dockerhub:

108 of 115

Exercise #1b: Explore the Dockstore CLI (Take Home)

108

dockstore workflow convert entry2json --entry [ dockstore identifier ] > [ parameter.json ]

Make a JSON template based off descriptor located remotely on dockstore:

dockstore workflow launch --entry [ dockstore identifier ] --json [ parameter.json ]

Run a descriptor located remotely on dockstore:

dockstore workflow convert wdl2json --wdl hello-task.wdl > convert.json

Make a JSON template based off a local WDL:

109 of 115

Scatter Gather ( take home reading )

Scatter

Given an array of values, run the same task on each value in parallel (ex. Array of Files)

Gather

Collect the results of running each scatter command in an array

Beginner Example - Scatter Gather Pipeline

Advanced Example - Use scatter-gather to joint call genotypes

109

110 of 115

What’s in a WDL? Top-level Summary

Workflow: Code block that defines the overall workflow.

think of it as an ‘outline’

Call: Defines which tasks to run

can also specify input parameters to pass to that task.
Located within workflow block�

Task: Defines all the information necessary to perform an action.

Tasks to run are specified by a ‘call’ inside the workflow block
Full definition of task is done outside of the workflow block

110

3 top-level components that are part of the core structure of a WDL script

workflow myWorkflowName {

}

task task_A { … }

task task_B { … }

input {

...

}

workflow.wdl

x

call task_B {

input: ...

}

call task_A

output {...}

111 of 115

What’s in a Task? Summary

Task: Defines all the information necessary to perform an action in a parameterized way.

Command (‘the action’)

the shell command(s) that will be executed when task is ran

Inputs and Outputs

All inputs and outputs must be typed (ex: string, int, file, etc)

Non-input declarations

Intermediate variables to help run task

Runtime

Defines context/environment

container
compute resources

111

task doSomething {

}

output {

File outFile = “{outFile}”

}

command {

echo Hello World! > {outFile}

cat ${myName} >> {outFile}

}

input {

File myName

String outFile

}

task.wdl

x

runtime {

docker: docker_image

memory: “${memory_gb}”

}

112 of 115

Summary

A workflow:

A workflow block (with inputs and outputs)
A call section to define which task(s) to run
One or more task(s)defining what the workflow will do
A meta section

A task:

An input section (required if the task will have inputs)
Non-input declarations (as many as needed, optional)
A command section (required)
A runtime section (optional)
An output section (required if the task will have outputs)
A parameter_meta section (optional)

112

113 of 115

Ways to Register to Dockstore

113

Containerized Tool

Workflow:

Tools + Descriptor

Dockstore Registration

External Hosting

Docker image

Build

System

Dockerfile

Descriptor

Register workflow and tool descriptors from external source control

+

Point to docker image(s) on quay or dockerhub

1.9.0 install the Dockstore GitHub app to automatically update Dockstore when workflow is updated on GitHub

114 of 115

Dockstore Ecosystem

114

Source Control

Analysis Environments

Store your descriptors and containers and descriptors on your preferred sites

Dockstore’s launch-with feature enables users to export tools and workflows to a variety of cloud compute platforms

Register these as tools and workflows on Dockstore, allowing for a centralized bioinformatics catalog of resources

Docker Registries

115 of 115

Language Support

115

More info:

https://docs.dockstore.org/en/develop/end-user-topics/language-support.html