1 of 115

Dockstore Fundamentals:

Introduction to Docker and Descriptors for Reproducible Analysis

Louise Cabansay, Software Engineer, UC Santa Cruz Genomics Institute

Andrew Duncan, Software Engineer, Ontario Institute for Cancer Research

Denis Yuen, Senior Software Engineer, Ontario Institute for Cancer Research

1

2 of 115

Learning Objectives

  • What is Dockstore?
  • Introduction to Docker
  • Introduction to Descriptors
    • Overview of descriptor languages (CWL, WDL, Nextflow)
    • Practice using WDL
  • Dockstore Usage
    • Key features
    • Best practices
  • Overall goal today is to give you a good foundation in the basics of Dockstore. We will be providing a variety of take home materials to guide you on where to go next

2

3 of 115

Format and Setup

  • Lecture + Examples (slides)
  • Q/A:
    • #dockstore-fundamentals on Discord can also be a great place for questions
    • Click “yes” in Zoom when finished with an exercise
    • Introduce TAs

3

4 of 115

Format and Setup

  • Hands-on practice (Instruqt)
    • Message us on Discord if you did not get the link
    • Browser-based tutorial environment for your exercises
    • Close the sidebar in Instruqt for additional screen space

4

Chrome Browser Required (currently a bug on Instruqt w/ Firefox)

Download : https://www.google.com/chrome/

5 of 115

What is Dockstore?

Dockstore is a free and open source platform for sharing scientific tools and workflows.

Portability

Interoperability

Reproducibility

5

“An app store for bioinformatics”

6 of 115

Portability:

6

Software is “packaged” using container technology and described using descriptor languages

  • Analysis can be moved from environment-to-environment (local machines, clouds, servers) and yet be guaranteed to run on anything that supports Docker

+

Descriptor

Container

7 of 115

What’s on Dockstore? tools and workflows

7

Workflow

Tool

OR

Container + Descriptor

Workflow: Tools + Descriptor

A tool uses a single container and performs a single action or step that is outlined by a descriptor.

A workflow can use multiple containers and executes multiple actions or steps, still outlined by a descriptor

8 of 115

Example:

8

Variant Calling Pipeline

BWA-MEM

OR

A tool uses a single container and performs a single action or step that is outlined by a descriptor.

(ex: alignment)

Tool

Workflow

Debian OS

binaries

C

GNU Make

GCC

zlibs

processing

alignment

call variants

A workflow can use multiple containers and executes multiple actions or steps, still outlined by a descriptor

*note: you can register a single tool as a workflow on Dockstore, but a multi-step workflow cannot be registered as a tool.

ex: BWA-MEM can technically be registered as both

9 of 115

Interoperability:

9

Source Control

Analysis Environments

Integration with various sites allows Dockstore to function as centralized catalog of bioinformatics tools and workflows

By following GA4GH API standards, Dockstore enables users to run tools and workflows in their preferred compute and analysis environments

700+ tools and workflows published to Dockstore

Docker Registries

10 of 115

Reproducibility: Create, Share, Use

  • Dockstore is a place for researchers and developers to share their work so that others can also use it
  • The combination of containers and workflow languages minimizes redundant and error prone installation
    • Increases the transparency of analysis methods
    • Allows others to verify results and apply existing methods into their own research
  • Other Dockstore features to increase reproducibility:
    • Versioning, snapshots, and metadata handling
    • Generating Digital Object Identifiers (DOIs) via Zenodo
    • Organizations and Collections for sharing and findability
    • and more!

10

11 of 115

Docker Basics

*as used on Dockstore

11

12 of 115

What is a container? What is Docker?

12

Container:

A container encapsulates all the software dependencies associated with running a program.

  • Allows for portable software that runs quickly and reliably from one computing environment to another.

Docker:

A particularly popular brand of container

  • Some users may wish to explore alternatives like Singularity

13 of 115

What kinds of problems are solved by containers?

  • Installation problems
    • Software was built on a different OS (executable files don’t run)
    • Install documentation is unclear or out of date
  • Dependency problems
    • Software requires different version than what is available on machine (ex. Java, Python)
    • Multiple programs have shared dependencies, but different versions of those dependencies
  • Portability problems
    • Software can be run on any host OS that has Docker installed

13

14 of 115

Docker Concepts: Container, Image, Registry

14

Image:

Packaged up code with its all dependencies at rest.

  • Allows for portable software that runs quickly and reliably from one computing environment to another.

Official Image

  • Official images on Docker Hub are regularly updated and scanned for vulnerabilities

Registry:

Repositories where users can store images privately or publicly in the cloud.

Dockstore itself does not host images, but rather gets them from Image Registries:

  • Docker Hub
  • Quay.io
  • Google Container Registry (GCR)

Container:

A running image

  • Packaged up, isolated software environment
  • All dependencies included

**the terms container and image are often used interchangeably, but there is a slight distinction.

15 of 115

Docker Ecosystem

15

Docker Daemon

docker pull

docker run

docker build

Docker CLI

Host Machine

Images

Containers

Image Registry

Docker Hub

Quay.io

GCR

16 of 115

Docker Ecosystem

16

docker pull

docker pull

docker run

docker build

Docker Hub

Quay.io

GCR

Docker Daemon

Docker CLI

Host Machine

Images

Containers

Image Registry

17 of 115

Docker Ecosystem

17

docker pull

docker run

docker build

Docker Hub

Quay.io

GCR

docker run

Note: the docker run command will also pull from a registry if the image doesn’t already exist on machine

Docker Daemon

Docker CLI

Host Machine

Images

Containers

Image Registry

18 of 115

Start Up Instruqt

18

Wait for start to show up

19 of 115

Docker Client (CLI)

A command-line utility for:

  • downloading and building Docker images
  • running Docker containers
  • managing both images and containers (think disk space, cleanup)

19

docker [sub-command] [-flag options] [arguments]

20 of 115

Basic Docker Sub Commands:

20

docker info [OPTIONS]

Display system-wide information about your installation of docker:

docker image [COMMAND]

Managing docker images:

docker container [COMMAND]

Managing docker containers:

docker run [-flags] [registry name]/[path to image repository]:[tag] [arguments]

Running docker containers:

Docker has a whole library of commands, here are some basic examples:

21 of 115

How are containers commonly used?

Run and done

  1. Execute docker run command
  2. Actions executed in container (stuff happens)
  3. Container stops running after completion

21

22 of 115

How are containers commonly used?

Run and done

  • Execute docker run command
  • Actions executed in container (stuff happens)
  • Container stops running after completion

Run continuously

  1. Execute docker run command
  2. Container starts and runs in the background continuously
  3. Other processes can interact with the container (stuff happens)
  4. Container keeps running unless it is stopped

22

Your main method of containers!!

23 of 115

How do I ‘run’ a container?

23

docker run [-flag options] [registry name]/[path to image repository]:[tag] [arguments]

base

run command

dockerhub*

quay.io

gcr.io

generally, the arguments are what gets passed into the container

ex: the command you want to run

Both official containers and user containers are available

Only specify the registry if its not dockerhub

The ‘version’ of the image you want to run

Additional options

Ex. --name to specify a name for the container

24 of 115

24

25 of 115

Exercise #1a: Running containers

25

docker run docker/whalesay cowsay "fill me in"

Exercises:

Use the whalesay container from Docker Hub to print a welcome message

docker run [-flag options] [registry name]/[path to image repository]:[tag] [arguments]

26 of 115

Exploring containers interactively

26

docker run -it quay.io/ldcabansay/samtools:latest

Example:

Enter the samtools container and confirm that samtools is installed

  • The ‘-it’ flags drop you inside the container
  • (-i) Keeps STDIN open for interactive use and (-t) allocates a terminal
  • You can then interact with the container’s terminal like a normal command line

docker run [-flag options] [registry name]/[path to image repository]:[tag] [arguments]

-i -t

27 of 115

Sharing data between host and container

Bind mounts (-v) (aka two-way data binding)

  • Map a host directory to a directory in a container
  • Any files added to either directory is available in the other

27

docker run -v /usr/data:/tmp/data ...

Output stored in the container directory /tmp/data will also be available on the host at /usr/data

docker run -v [ path where data is ]:[ path to put data ] ...

HOST

CONTAINER

Note: using absolute paths is highly recommended

28 of 115

Exercise #1b: exploring containers

28

docker run [-flag options] [registry name]/[path to image repository]:[tag] [arguments]

-i -t -v

docker run -it -v /root/bcc2020-training/data:/data quay.io/ldcabansay/samtools:latest

Exercise(s):

Enter the samtools container, but this time bring in some data!

docker run -v [ path where data is ]:[ path to put data ] ...

HOST

CONTAINER

docker run -v /root/bcc2020-training/data:/data quay.io/ldcabansay/samtools:latest samtools view -S -b /data/mini.sam -o /data/mini.bam

Convert a sam file to a bam file using the samtools container.

29 of 115

Data binding: In-depth (Extra Reading)

  • Useful tips and tricks to know about databinding
  • Advanced features and caveats to keep in mind
  • Read more here: https://docs.docker.com/storage/

29

30 of 115

Dockerfiles: Custom Images

  • Sometimes an existing image isn’t available for the software we want to use or an existing image may be lacking something we require
  • Dockerfiles are used to create our own custom images
    • Starts from a base image
    • Contains a series of steps that set up our environment
  • We can then also share these custom images via an image registry

30

31 of 115

Primer: How is software installed and used?

31

Author of dockerfile programmatically details the software installation and any other steps for environment setup

Image built from the dockerfile can then be used for ‘off-the-shelf’ software usage by others.

Package managers

  • Managed collection of software with automated install, upgrades, and removal

Executable Files or Binaries �(ex: *.jar, *.c, grep, tar, diff, md5sum)

  • software that has already been built or compiled into an executable file.

Building or running from source files

  • Compiling from source build an executable (requires compiler & dependencies)
  • For languages that don’t need compiling (python), necessary runtime dependencies required in environment

32 of 115

Dockerfiles Overview:

A simple text file with instructions to build an image:

  • YAML syntax
  • Start from a base image
  • Metadata
  • Install software and dependencies
  • Set up scripts
  • Other environment prep
  • Define commands to run when container starts

32

Dockerfile

x

FROM

MAINTAINER

RUN

ENV

CMD

base image (start)

commands to:

- install software

- install dependencies

- run scripts

- misc. environment setup

environment variables

command to execute when container starts (optional)

author metadata

33 of 115

Dockerfiles - local

33

docker pull

docker run

docker build

Docker Daemon

Docker CLI

Host Machine

Images

Containers

Dockerfile

Configuration to set up a docker image

34 of 115

Dockerfiles - local

34

docker pull

Docker Daemon

Docker CLI

Host Machine

Images

Containers

Dockerfile

docker run

docker build

Configuration to set up a docker image

35 of 115

Example: BWA (via package manager)

  • Images of containers are built from Dockerfiles�
  • Dockerfiles describe the packaged up environment:
    • operating system or base image to build upon
    • dependencies needed for the software
    • the actual analysis software

35

Dockerfile

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

# Start with a base image

FROM ubuntu:18.04

# Add file author/maintainer and contact info (optional)

MAINTAINER Louise Cabansay <lcabansa@ucsc.edu>

#set user you want to install packages as

USER root

#update package manager & install dependencies (if any)

RUN apt update

# install analysis software from package manager

RUN apt install -y bwa

#######################################################

# Dockerfile to build a sample container for bwa

#######################################################

36 of 115

Basic Commands: docker build

36

docker build -t bwa:v1.0 .

( -t ) : Builds and creates a tag v1.0 for a bwa image (if in dockerfile directory)

docker build -t bwa:v1.0 -f dockerfiles/bwa/Dockerfile .

( -f ) : Build a specific Dockerfile by providing path to file (relative to build context)

docker image ls

View built Docker images

docker build [-flag options] [build context]

-t -f

37 of 115

37

38 of 115

Exercise #2a: Writing your first Dockerfile: tabix

Dockerfiles describe the packaged up environment:

  • operating system or base image to build upon:
    • ubuntu:18.04
  • Install dependencies needed: N/A
  • Install actual analysis software:
    • tabix

38

Dockerfile

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

# Start with a base image

FROM { base image name }

# Add file author/maintainer and contact info (optional)

MAINTAINER {your name} <youremail@research.edu>

# set user you want to install packages as

USER root

# update package manager & install dependencies (if any)

RUN apt update

# install analysis software from package manager

RUN apt install -y { software package name }

docker image build -t { name } -f { path to dockerfile } .

Build an image from Dockerfile:

39 of 115

Exercise #2b: Try out your new container!

39

docker image ls

docker run [image id] tabix

Exercises:

1. Verify that your image was built (get the image ID to use in part 2)

2. Use your local image to view the tabix command help

docker container run [-flag options] [registry name]/[path to image repository]:[tag] [args]

40 of 115

Ex: bamstats (executable)

  • Dockerfiles describe the packaged up environment:
    • operating system or base image to build upon
    • dependencies needed for the software
    • the actual analysis software
      • Here it’s an already compiled executable
    • scripts or commands to complete software and environment set- up

40

Dockerfile

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

# Start with a base image

FROM ubuntu:14.04

# Add file author/maintainer and contact info (optional)

MAINTAINER Brian OConnor <briandoconnor@gmail.com>

# install software dependencies

USER root

RUN apt-get -m update && apt-get install -y wget unzip \

openjdk-7-jre zip

# manual software installation from source

# get the tool and install it in /usr/local/bin

RUN wget -q http://downloads.sourceforge.net/project

/bamstats/BAMStats-1.25.zip

# commands/scripts to finish software setup

RUN unzip BAMStats-1.25.zip && \

rm BAMStats-1.25.zip && \

mv BAMStats-1.25 /opt/

COPY bin/bamstats /usr/local/bin/

RUN chmod a+x /usr/local/bin/bamstats

# switch back to the ubuntu user so this tool (and the files written) are not owned by root

RUN groupadd -r -g 1000 ubuntu && useradd -r -g ubuntu -u 1000 -m ubuntu

USER ubuntu

# command /bin/bash is executed when container starts

CMD ["/bin/bash"]

41 of 115

Example: samtools (compile from source files)

  • Images of containers are built from Dockerfiles�
  • Dockerfiles describe the packaged up environment:
    • operating system or base image to build upon
    • dependencies needed for the software
    • the actual analysis software
    • Scripts or commands to complete environment set- up

41

Dockerfile

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

# Start with a base image

FROM ubuntu:18.04

# Add file author/maintainer and contact info (optional)

MAINTAINER Louise Cabansay <lcabansa@ucsc.edu>

# install software dependencies

RUN apt update && apt -y upgrade && apt install -y \

wget build-essential libncurses5-dev zlib1g-dev \

libbz2-dev liblzma-dev libcurl3-dev \

WORKDIR /usr/src

# get the software source files

RUN wget https://github.com/samtools/samtools/releases/

download/1.10/samtools-1.10.tar.bz2

# installation commands to compile source files

tar xjf samtools-1.10.tar.bz2 && \

rm samtools-1.10.tar.bz2 && \

cd samtools-1.10 && \

./configure --prefix $(pwd) && \

make

# add newly built executables to path

ENV PATH="/usr/src/samtools-1.10:${PATH}"

42 of 115

Sharing your Dockerfiles and Images

42

A Dockerfile contains the configuration to package up your software into an image

Dockerfile

Source Control

Image Registry

Dockstore recommends storing your Dockerfile in an external repository (Bitbucket, GitHub, GitLab) and then registering your source controlled Dockerfile to an image registry (Docker Hub, Quay.io, Google Container Registry, etc)

43 of 115

Best Practices (Take home reading)

  • Start from official images ( https://docs.docker.com/docker-hub/official_images/ )
  • Only add what you need (keep containers light),
    • One way to do this is to use multiple containers in your workflows
    • Recommend not including large reference data within containers
      • Instead provide that data at runtime through data binding or via platform
  • Version your containers for re-use later (use --tag when building)

43

44 of 115

What Next?

Docker is great, it tells us how to install software.

However, it doesn’t tell us how to use software.

Descriptor languages are the solution!

44

45 of 115

Break

45

46 of 115

What Next?

Docker is great, it tells us how to install software.

However, it doesn’t tell us how to use software.

Descriptor languages are the solution!

46

47 of 115

Intro to Descriptors (WDL)

47

48 of 115

Components and Concepts shared by Descriptors

48

Descriptor:

A workflow language used to describe how to run your pipeline.

  • Which containers
  • What steps and when
  • Define parameters
    • I/O data
    • compute requirements
  • Metadata

Parameter File (wdl, cwl):

  • Specifies the actual input/output files (local, ftp, http, or cloud)
  • Set compute resources
  • JSON, YAML

Container:

Packaged up code with all of its dependencies. This allows for portable software that runs quickly and reliably from one computing environment to another.

49 of 115

CWL: Common Workflow Language

49

CWL: Common Workflow Language�

  • Open and portable standard for describing analysis workflows and tools
  • Alternative to WDL/nextflow that might be present in your lab/cloud
  • Upcoming tutorial example but in CWL

Implementations/Engines:�

  • CWL doesn’t have a single official engine
  • CWLtool - reference implementation
  • Other implementations include Arvados, Toil, and Cromwell

Analysis Platforms (Launch-with)

  • Seven Bridges Cancer Genomics Cloud (CGC)

50 of 115

Nextflow

50

Nextflow

  • Fluent domain-specific language also for scientific workflows using software containers
  • View as alternative to CWL/WDL that might be present in your lab/cloud
  • Upcoming tutorial example but in Nextflow

Running nextflow workflows

  • Works on local machines, HPC, AWS, Google Cloud
  • Cloud support via Sequera Labs

51 of 115

WDL: Workflow Description Language

51

WDL: Workflow Description Language�

  • human-readable and writable descriptor language

Engines:

  • Cromwell: “A Workflow Management System geared towards scientific workflows”
    • The first execution engine that understands WDL
  • Other engines: TOIL, miniWDL

Analysis Platforms (Launch-with)

  • Terra (AnVIL, BioDataCatalyst)
  • DNAstack
  • DNAnexus
  • Also works on local machines, HPC, and Cloud

52 of 115

What’s in a WDL? Top-level Components

3 top-level components that are part of the core structure of a WDL script

  • Workflow
  • Call
  • Task

52

workflow.wdl

x

53 of 115

Top-level Components - Workflow

Workflow: Code block that defines the overall workflow. You can think of it as an outline.

53

workflow myWorkflowName {

}

workflow.wdl

x

54 of 115

Top-level Components - Workflow

Workflow: Code block that defines the overall workflow. You can think of it as an outline.

  • Inputs (optional)
    • Ex: use when building more complex workflows that will re-use inputs for multiple purposes

54

workflow myWorkflowName {

}

input {

...

}

workflow.wdl

x

output {...}

  • Outputs
    • Specify output you want to keep from run of entire workflow
    • Signals to Cromwell to keep track of these outputs and save them somewhere

55 of 115

Top-level Components - Call

Call: Component that defines which tasks the workflow will run

  • Located within workflow block
  • can also specify input parameters to pass to that task (optional)

55

workflow myWorkflowName {

}

call task_B {

input: ...

}

call task_A

input {

...

}

workflow.wdl

x

output {...}

56 of 115

Top-level Components - Task

Task: Defines all the information necessary to perform an action.

  • Tasks are referred to within a call, but are actually defined outside of the workflow block

56

workflow myWorkflowName {

}

task task_A { … }

task task_B { … }

input {

...

}

workflow.wdl

x

call task_B {

input: ...

}

call task_A

output {...}

57 of 115

What’s in a Task?

Task: Defines all the information necessary to perform an action

57

task doSomething {

}

task.wdl

x

58 of 115

What’s in a Task? Command

Task: Defines all the information necessary to perform an action

  • Command (‘the action’) - required!
    • Defines the command(s) that will be run in the execution environment
    • Can be multiple lines/commands

58

task doSomething {

}

task.wdl

x

command {

echo Hello World!

cat ${myName}

}

59 of 115

What’s in a Task? Inputs

Task: Defines all the information necessary to perform an action

  • Inputs
    • Optional, only required if the task will have inputs
    • All inputs must be typed
      • string, int, file, etc
    • Individual inputs can also be optional, denoted by ‘?’:

    • Can also set a default values:

59

task doSomething {

}

command {

echo Hello World!

cat ${myName}

}

input { File myName }

task.wdl

x

input { File? myName }

input { String? myName=“Foobar” }

60 of 115

What’s in a Task? Outputs

Task: Defines all the information necessary to perform an action

  • Outputs
    • The outputs section defines which values should be exposed as outputs after a successful run of the task
      • Especially useful when using output of one task as input of another
    • Outputs must be typed
      • ex: string, int, file, etc

60

task doSomething {

}

output {

File outFile = “Hello.txt”

}

command {

echo Hello World! > Hello.txt

cat ${myName} >> Hello.txt

}

input { File myName }

task.wdl

x

61 of 115

Simple Example: HelloWorld.wdl

61

HelloWorld.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

version 1.0

# add and name a workflow block

workflow hello_world {

}

# define the ‘hello’ task

task hello {

input { File myName }

command {

echo Hello World! > Hello.txt

cat ${myName} >> Hello.txt

}

output { File outFile = “Hello.txt” }

}

call hello

# important: add output for whole workflow

output {

File helloFile = hello.outFile

}

62 of 115

Parameter JSON (simple):

  • A parameter JSON specifies the actual input values to fill-in for the WDL parameters when running the workflow

62

value can be path, string, int, array, etc

"hello_world.hello.myName": "/<usr>/bcc2020/wdl-training/exercise1/name.txt"

workflow

name

task

name

parameter

name

KEY

VALUE

hello.json

x

1

2

3

{

"hello_world.hello.myName": "/<usr>/bcc2020/wdl-training/exercise1/name.txt"

}

Note: using absolute paths is highly recommended

63 of 115

63

64 of 115

Exercise #1: Run your first wdl

  • Single task workflow
    • Workflow block not required

64

HelloWorld.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

version 1.0

workflow hello_world {

call hello

output { File helloFile = hello.outFile }

}

task hello {

input { File myName }

command {

echo Hello World! > Hello.txt

cat ${myName} >> Hello.txt

}

output { File outFile = “Hello.txt” }

}

dockstore workflow launch --local-entry HelloWorld.wdl --json hello.json

Run with DockstoreCLI

65 of 115

Overview: Dockstore CLI

A handy command line resource to help users develop content locally.

  • Run descriptors by automatically calling to Cromwell or CWLtool
    • Local descriptors
    • Remote descriptors pulled from Dockstore
  • Generate a JSON parameter template based on a given descriptor
  • Built-in plugins enable fetching remote input data (http, s3, gs)

65

Example execution with the Dockstore Command Line Interface (CLI):

dockstore workflow launch --local-entry HelloWorld.wdl --json hello.json

66 of 115

What’s in a Task? Runtime

Task: Defines all the information necessary to perform an action

  • Runtime
    • Defines context/environment
      • docker containers
      • compute resources

66

task doSomething {

}

output {

File outFile = “Hello.txt”

}

command {

echo Hello World! > Hello.txt

cat ${myName} >> Hello.txt

}

input { File myName }

task.wdl

x

runtime {

docker: “ubuntu:latest”

memory: “1GB”

}

    • For cloud platforms, runtime parameters can be used to allocate/configure resources when spinning up a compute instance
      • CPU
      • Disk
      • Memory
      • Instance types
      • Region
      • Preemptibility

67 of 115

What’s in a Task? Parameterization

Task: Defines all the information necessary to perform an action

  • Parameterized values
    • Variables that define placeholder requirements
    • flexibility, reuse, repurposing
    • Pretty much everything can be parameterized
      • Commands
      • Inputs, outputs
      • Runtime requirements

67

task doSomething {

}

output {

File outFile = “${outFile}”

}

command {

echo Hello World! > ${outFile}

cat ${myName} >> ${outFile}

}

input {

File myName

String outFile

}

task.wdl

x

runtime {

docker: docker_image

memory: “${memory_gb}”

}

But is this always a best practice?

68 of 115

A task can have declarations which are intermediate values rather than inputs.

  • Think of these as variables that you can define to help execute a task
  • Typed: String, Int, File, etc
  • Can use or build upon:
    • the input values given to task
    • methods from WDL standard library
    • other non-input declarations
  • After defining, used in other sections: command, outputs, runtime

68

task doSomething {

}

#creating non-input declaration

String myString = “hi ” + ${myName}

String outFile = ${myName} + “.out”

task.wdl

x

# example usage in command

command {

echo ${myString} > ${outFile}

}

input { String myName }

# example usage in output

output {

File outFile = “${outFile}”

}

69 of 115

WDL Standard Library (Take Home)

Built-in functions or methods provided by the core WDL language

  • Helpful in writing more complex workflows
  • Examples:
    • File handling (read, write different kinds files: JSON, tsv, etc )
    • Working with stdin, stdout
    • Data manipulation/handling: arrays, objects, strings, etc
    • Mapping values
    • Much more!

69

70 of 115

WDL Standard Library (Simple)

70

task hello {

}

output {

File outFile = “Hello.txt”

}

command {

echo Hello World! > Hello.txt

cat ${myName} >> Hello.txt

}

input { File myName }

task.wdl

x

task hello {

}

output {

File outFile = stdout()

}

command {

echo Hello World!

cat ${myName}

}

input { File myName }

task.wdl

x

Output:

“hello.outFile” : “.../stdout”

Output:

“hello.outFile” : “.../Hello.txt”

71 of 115

Primer Exercise #2

For our second exercise we’re going to parameterize a simple workflow.

Goal: generate statistics about an alignment file

  • Software: samtools (flagstat)
  • Input: sam file
  • Output: alignment statistics

There will be multiple ways to solve this assignment. This is a chance for you to apply the things we’ve learned to a real bioinformatics workflow.

71

72 of 115

Exercise #2: Complete metrics.wdl

  1. Set runtime to use the samtools docker container:

quay.io/ldcabansay/samtools:latest

  • Parameterize the samtools command in the flagstat task
  • (optional) If you make any new inputs, be sure update the metrics.json

72

runtime {

docker: “ubuntu:latest”

memory: “1GB”

}

metrics.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow metrics {

call flagstat

output { File align_metrics = flagstat.metrics }

}

task flagstat {

input { File input_sam }

# non-parameterized flagstat command

command {

samtools flagstat mini.sam > mini.sam.metrics

}

output {

File metrics = “mini.sam.metrics”

}

# set some parameterized runtime parameters

runtime {

docker: # set

}

}

73 of 115

Exercise #2: Solution* - Descriptors

73

metrics.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow metrics {

call Flagstat

output { File align_metrics = flagstat.metrics }

}

task flagstat {

input { File input_sam }

# non-parameterized flagstat command

command {

samtools flagstat mini.sam > mini.sam.metrics

}

output {

File metrics = “mini.sam.metrics”

}

# set some parameterized runtime parameters

runtime {

docker: # set

}

}

metrics.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow metrics {

call Flagstat

output { File align_metrics = flagstat.metrics }

}

task flagstat {

input { File input_sam }

# slightly parameterized flagstat command

command {

samtools flagstat ${input_sam} > mini.sam.metrics

}

output { File metrics = “mini.sam.metrics” }

# set docker runtime

runtime {

docker: “quay.io/ldcabansay/samtools:latest”

}

}

*note: this is one example solution, multiple are possible

74 of 115

Exercise #2: Solution vs Solution2

74

metrics.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow metrics {

call Flagstat

output { File align_metrics = flagstat.metrics }

}

task flagstat {

input { File input_sam }

# slightly parameterized flagstat command

command {

samtools flagstat ${input_sam} > mini.sam.metrics

}

output { File metrics = “mini.sam.metrics” }

# set docker runtime

runtime {

“quay.io/ldcabansay/samtools:latest”

}

}

metrics.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow metrics {

call Flagstat

output { File align_metrics = flagstat.metrics }

}

task flagstat {

input {

File input_sam

String docker_image

}

# create a string to help parameterize command

String stats = basename(input_sam) + “.metrics”

command {

samtools flagstat ${input_sam} > ${stats}

}

output { File metrics = “${stats}” }

# set a parameterized docker runtime

runtime {

docker: docker_image

}

}

*note: this is one example solution, multiple are possible

75 of 115

Break

75

76 of 115

Multi-task workflows:

  • In real use cases, workflows are typically multi-task pipelines that build upon individual steps
    • The output of one task can serve as the input of another
  • Each task in workflow can have their own isolated runtime environment
    • Docker image
    • Task specific compute requirements

76

77 of 115

Example: Multi-task workflow

77

HelloWorld.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow hello_world {

call hello

output { File helloFile = hello.outFile }

}

task hello {

input { File myName }

command {

echo Hello World!

cat ${myName}

}

output { File outFile = stdout() }

}

GoodbyeWorld.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow goodbye_world {

call goodbye

output { File byeFile = goodbye.outFile }

}

task goodbye {

input { File greeting }

command {

cat ${greeting}

echo See you later!

}

output { File outFile = stdout() }

}

78 of 115

Example: Multi-task workflow - HelloGoodbye.wdl

78

HelloGoodbye.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow HelloGoodbye {

call hello

call goodbye {

input: greeting = hello.outFile

}

output { File hello_goodbye = goodbye.outFile }

}

task hello {

input { File myName }

command {

echo Hello World!

cat ${myName}

}

output { File outFile = stdout() }

}

task goodbye {

input { File greeting }

command {

cat ${greeting}

echo See you later!

}

output { File outFile = stdout() }

}

HelloGoodbye.json

x

1

2

3

4

5

{

"HelloGoodbye.hello.myName": "/root/bcc2020-training/wdl-training/exercise3/

hello_examples/name.txt"

}

79 of 115

WDL Imports

A WDL file may contain import statements to include WDL code from other sources.

  • Useful in keeping large multi-task workflows organized
  • Helpful way to reuse and build modular workflows

79

80 of 115

Imports: Concepts

  • Primary Descriptor
    • The ‘main’ descriptor file, anchors relevant imports
  • Sub-workflow aka sub-wdl
    • the imported, external WDL
  • Namespaces
    • prevents name collisions, organizes tasks/workflows into groups
  • Aliases
    • A custom name given to a namespace (optional)
    • Not specifying an alias defaults to namespace to filename without ‘.wdl’

80

workflow

name

task

name

parameter

name

workflow primary {

...

...

}

primary-descriptor.wdl

x

import "<resource>" as <alias>

task task_A { ... }

call <alias>.taskOne {

input: ...

}

call task_A

"primary.taskOne.param_name": "<value of param or path if file>"

JSON mapping:

81 of 115

Example: No Imports vs Imports

81

HelloGoodbye.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow HelloGoodbye {

call hello

call goodbye {

input: greeting = hello.outFile

}

output { File hello_goodbye = goodbye.outFile }

}

task hello {

input { File myName }

command {

echo Hello World!

cat ${myName}

}

output { File outFile = stdout() }

}

task goodbye {

input { File greeting }

command {

cat ${greeting}

echo See you later!

}

output { File outFile = stdout() }

}

HelloGoodbye_imports.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

# add import statements to bring in sub-workflows

# if not given, namespace/alias = file minus ‘.wdl’

import “HelloWorld.wdl”

# otherwise, namespace = <alias>

import “GoodbyeWorld.wdl” as bye

workflow HelloGoodbye {

#call the hello task, syntax: <alias>.taskname

call HelloWorld.hello

#call the goodbye task, syntax: <alias>.taskname

call bye.goodbye {

input: greeting = hello.outFile

}

#same as before, define workflow outputs

output { File hello_goodbye = goodbye.outFile }

}

82 of 115

Example: No Imports vs Imports

82

HelloGoodbye.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow HelloGoodbye {

call hello

call goodbye {

input: greeting = hello.outFile

}

output { File hello_goodbye = goodbye.outFile }

}

task hello {

input { File myName }

command {

echo Hello World!

cat ${myName}

}

output { File outFile = stdout() }

}

task goodbye {

input { File greeting }

command {

cat ${greeting}

echo See you later!

}

output { File outFile = stdout() }

}

HelloGoodbye_imports.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

# add import statements to bring in sub-workflows

# if not given, namespace/alias = file minus ‘.wdl’

import “HelloWorld.wdl”

# otherwise, namespace = <alias>

import “GoodbyeWorld.wdl” as bye

workflow HelloGoodbye {

#call the hello task, syntax: <alias>.taskname

call HelloWorld.hello

#call the goodbye task, syntax: <alias>.taskname

call bye.goodbye {

input: greeting = hello.outFile

}

#same as before, define workflow outputs

output { File hello_goodbye = goodbye.outFile }

}

Do we have to change the JSON when running HelloGoodbye using imports?

83 of 115

Example: No Imports vs Imports (no comments)

83

HelloGoodbye.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow HelloGoodbye {

call hello

call goodbye {

input: greeting = hello.outFile

}

output { File hello_goodbye = goodbye.outFile }

}

task hello {

input { File myName }

command {

echo Hello World!

cat ${myName}

}

output { File outFile = stdout() }

}

task goodbye {

input { File greeting }

command {

cat ${greeting}

echo See you later!

}

output { File outFile = stdout() }

}

HelloGoodbye_imports.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

import “HelloWorld.wdl”

import “GoodbyeWorld.wdl” as bye

workflow HelloGoodbye {

call HelloWorld.hello

call bye.goodbye {

input: greeting = hello.outFile

}

output { File hello_goodbye = goodbye.outFile }

}

84 of 115

Primer for Exercise #3: (if time permits)

  • Aligner (bwa)
    • Align a sequence files (FASTQs) to a reference
    • Produces an alignment file: sam
  • Metrics (samtools flagstat)
    • Evaluates an alignment file (sam or bam)
    • Reports statistics on about the alignment
  • What we’ll make:
    • A workflow that does both of these tasks, first aligns, then generates statistics about the alignment
      • Without imports
      • With imports

84

85 of 115

Ex: BWA Aligner

  • Specify WDL version

  • Define workflow, call, and task(s)

  • Define parameters
    • Input and output

    • Parameterized command(s)

    • Runtime environment
    • Compute resource requirements
  • Metadata
    • Authorship, contact information, etc

85

aligner.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

version 1.0

workflow alignReads {

call bwa_align

output { File output_sam = bwa_align.output_sam }

}

task bwa_align {

input {

String sample_name

String docker_image

String? bwa_options

File read1_fastq

File read2_fastq

File ref_fasta

File ref_fasta_fai

File ref_fasta_amb

File ref_fasta_ann

File ref_fasta_bwt

File ref_fasta_pac

File ref_fasta_sa

}

String output_sam = “${sample_name}” + .“sam”

command {

bwa mem ${bwa_options} ${ref_fasta} \

${read1_fastq} ${read2+fastq} > ${output_sam}

}

output { File output_sam = “${output_sam}” }

runtime {

docker: docker_image

memory: “${memory_gb}” + “GB”

}

meta {

author: "Foo Bar"

email: "foobar@university.edu"

}

}

86 of 115

Example: Metrics.wdl (samtools flagstat)

Same as the solution to exercise #2

86

metrics.wdl

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

version 1.0

workflow metrics {

call Flagstat

output { File align_metrics = Flagstat.metrics }

}

task Flagstat {

input {

File input_sam

String docker_image

}

# create a string to help parameterize command

String stats = basename(input_sam) + “.metrics”

command {

samtools flagstat ${input_sam} > ${stats}

}

output { File metrics = “${stats}” }

# set some parameterized runtime parameters

runtime {

docker: docker_image

}

}

87 of 115

Exercise#3: Writing a multi-task workflow (if time permits)

Create a multi-task workflow: align_and_metrics.wdl

  • aligner.wdl
    • Generates a sam file from given fastq sequence files and reference files
  • metrics.wdl
    • Generates alignment statistics of a given sam/bam file.
  • You may create align_and_metrics.wdl with or without imports. A skeleton is provided to you for each type.

87

88 of 115

Importing workflows: Best Practices & Tips (Take home)

  • Each import should perform a specific action and be runnable on its own (given the correct input)
  • Use descriptive aliases for namespaces
  • Use descriptive names for tasks and workflows
  • Imports can be local file paths or HTTP(s) paths
    • HTTP(s) example: grabbing a raw file from a repository like GitHub

88

89 of 115

Importing workflows: Caveats (Take home)

  • Not all workflow engines or platforms support imports (ex. DNAnexus)
  • Levels of support also varies, some features don’t work
    • Ex. Terra only recently started supporting local file path imports and only if the descriptor is in GitHub
  • Learn more about which platforms support imports:

https://docs.dockstore.org/en/develop/end-user-topics/language-support.html

89

90 of 115

Metadata

  • meta section within the workflow section
  • Optional key/value pairs
  • Useful for author, email, description

Parameter Metadata

  • parameter_meta section within a task
  • Optional key/value pairs
  • Describe parameters
  • Key must map to a task input or output

90

task doSomething {

}

output {

File outFile = “Hello.txt”

}

command {

echo Hello World! > Hello.txt

cat ${myName} >> Hello.txt

}

input { File myName }

task.wdl

x

91 of 115

More trainings and tutorial content:

91

92 of 115

Summary, Dockstore, and next steps!

92

93 of 115

What is Dockstore?

Dockstore is a free and open source platform for sharing scientific tools and workflows.

Portability

Interoperability

Reproducibility

93

“An app store for bioinformatics”

94 of 115

Dockstore Ecosystem

94

Source Control

Analysis Environments

Store your descriptors and containers and descriptors on your preferred sites

Dockstore’s launch-with feature enables users to export tools and workflows to a variety of cloud compute platforms

Register these as tools and workflows on Dockstore, allowing for a centralized bioinformatics catalog of resources

Docker Registries

95 of 115

Partner Platforms

  • DNAstack
  • DNAnexus
  • Terra
  • CGC : Cancer Genomics Cloud (Seven Bridges)
  • AnVIL
  • BioData Catalyst

Launching Analysis

Structural Variant Calling using Graph Genomes

Contributed by: Jean Monlong and Charles Markello (VG Team, UC Santa Cruz, Genomics Institute)

96 of 115

Launching Analysis - Example

96

Cumulus Workflow: https://dockstore.org/workflows/github.com/klarman-cell-observatory/cumulus/Cumulus:0.15.0?tab=info

AnVIL Organization, Cumulus Collection: https://dockstore.org/organizations/anvil/collections/Cumulus

Contributed by: Bo Li & Yiming Yang (Cumulus Team, Broad Institute)

97 of 115

General Best Practices

  • Include authorship and contact information in the primary descriptor file
  • Include workflow description either in the README or descriptor file
  • Use releases/tags (ex. v1.1) instead of branches for versioning
  • Test parameter files associated with workflows to provide samples that users can try out
    • Use publicly available data
    • Real world examples or simple test data
  • Add labels to your workflow to improve findability
  • Checker workflows to test compatibility with different environments
  • Use the snapshot and DOI feature to improve reproducibility

97

98 of 115

DOIs

Create snapshots and digital object identifiers for your workflows to permanently capture the state of a workflow for publication

Creating Snapshots and Requesting DOIs — Dockstore documentation

Examples: �Forward� https://doi.org/10.5281/zenodo.3889018�Backward

https://dockstore.org/workflows/github.com/dockstore/hello_world:master?tab=versions

98

99 of 115

Organizations

99

Landing page to showcase tools and workflows

  • You can organize and collect workflows from other people
  • Same workflow may be in an organization based on lab, funding source
  • Organizations and Collections — Dockstore documentation

Example: a COVID-19 collection that submits to Nextstrain https://dockstore.org/organizations/BroadInstitute/collections/pgs

  • Contributed by: Daniel Park (Viral Genomics Group, Broad Institute)

100 of 115

Getting Help on Dockstore

User forum at https://discuss.dockstore.org/

  • Topics embedded with each tool, workflow, and documentation page.
  • Talk about bioinformatics, workflows, and get help on development

100

101 of 115

Documentation and Tutorials

  • Example Topics:
    • Launching Tools and Workflows
    • Writing checker workflows
    • Developing File Provisioning Plugins
    • Creating Organizations
    • And many more!

101

102 of 115

Dockstore Ecosystem

102

Dockstore is thankful to its many contributors, users, and partners. This community has pulled together a library of over 700 tools and workflows. In the diagram to the right we’ve highlighted a few select contributors to give a sense of what has been occuring in this space.

103 of 115

The Dockstore Team

103

Louise Cabansay

Natalie Perez

Melaina Legaspi

Charles Reid

Emily Soth

Andy Chen

Benedict Paten

Elnaz Sarbar

Charles Overbeck

Walt Shands

David Steinberg

Nneka Olunwa

Lincoln Stein

Denis Yuen

Andrew Duncan

Gary Luu

Gregory Hogue

104 of 115

Acknowledgements

104

This work was funded by the Government of Canada through Genome Canada and the Ontario Genomics Institute (OGI-168).

Funded by:

105 of 115

Extra Slides for Q&A

105

106 of 115

Additional Readings

Note: -v has historically been how volumes are mounted, however --mount is an equivalent option with a different syntax

106

107 of 115

Exercise #1a: Using Docker

107

docker info

Display system-wide information about your installation of docker:

docker image help

Managing docker images:

docker container help

Managing docker containers:

Docker has a whole library of commands, here are some basic examples:

docker container run hello-world

Run the official hello-world docker container from dockerhub:

108 of 115

Exercise #1b: Explore the Dockstore CLI (Take Home)

108

dockstore workflow convert entry2json --entry [ dockstore identifier ] > [ parameter.json ]

Make a JSON template based off descriptor located remotely on dockstore:

dockstore workflow launch --entry [ dockstore identifier ] --json [ parameter.json ]

Run a descriptor located remotely on dockstore:

dockstore workflow convert wdl2json --wdl hello-task.wdl > convert.json

Make a JSON template based off a local WDL:

109 of 115

Scatter Gather ( take home reading )

Scatter

  • Given an array of values, run the same task on each value in parallel (ex. Array of Files)

Gather

  • Collect the results of running each scatter command in an array

Beginner Example - Scatter Gather Pipeline

Advanced Example - Use scatter-gather to joint call genotypes

109

110 of 115

What’s in a WDL? Top-level Summary

Workflow: Code block that defines the overall workflow.

  • think of it as an ‘outline’

Call: Defines which tasks to run

  • can also specify input parameters to pass to that task.
  • Located within workflow block�

Task: Defines all the information necessary to perform an action.

  • Tasks to run are specified by a ‘call’ inside the workflow block
  • Full definition of task is done outside of the workflow block

110

3 top-level components that are part of the core structure of a WDL script

workflow myWorkflowName {

}

task task_A { … }

task task_B { … }

input {

...

}

workflow.wdl

x

call task_B {

input: ...

}

call task_A

output {...}

111 of 115

What’s in a Task? Summary

Task: Defines all the information necessary to perform an action in a parameterized way.

  • Command (‘the action’)
    • the shell command(s) that will be executed when task is ran
  • Inputs and Outputs
    • All inputs and outputs must be typed (ex: string, int, file, etc)
  • Non-input declarations
    • Intermediate variables to help run task
  • Runtime
    • Defines context/environment
      • container
      • compute resources

111

task doSomething {

}

output {

File outFile = “{outFile}”

}

command {

echo Hello World! > {outFile}

cat ${myName} >> {outFile}

}

input {

File myName

String outFile

}

task.wdl

x

runtime {

docker: docker_image

memory: “${memory_gb}”

}

112 of 115

Summary

A workflow:

  • A workflow block (with inputs and outputs)
  • A call section to define which task(s) to run
  • One or more task(s)defining what the workflow will do
  • A meta section

A task:

  • An input section (required if the task will have inputs)
  • Non-input declarations (as many as needed, optional)
  • A command section (required)
  • A runtime section (optional)
  • An output section (required if the task will have outputs)
  • A parameter_meta section (optional)

112

113 of 115

Ways to Register to Dockstore

113

Containerized Tool

Workflow:

Tools + Descriptor

Dockstore Registration

External Hosting

Docker image

Build

System

Dockerfile

Descriptor

Register workflow and tool descriptors from external source control

+

Point to docker image(s) on quay or dockerhub

1.9.0 install the Dockstore GitHub app to automatically update Dockstore when workflow is updated on GitHub

114 of 115

Dockstore Ecosystem

114

Source Control

Analysis Environments

Store your descriptors and containers and descriptors on your preferred sites

Dockstore’s launch-with feature enables users to export tools and workflows to a variety of cloud compute platforms

Register these as tools and workflows on Dockstore, allowing for a centralized bioinformatics catalog of resources

Docker Registries

115 of 115

Language Support

115