1 of 45

An introduction to the�CWL standards

Michael R. Crusoe CWL Project Lead @biocrusoe

2018-12-09 #CommonWL

11th NBDC/DBCLS BioHackathon (BioHackathon 2018)�Matsue, Japan

https://tinyurl.com/biohack18-cwl

https://tinyurl.com/biohack18-cwl

2 of 45

From Phoenix, Arizona (Sonoran Desert), USA

Studied at Arizona State: Comp. Sci.; time in industry as a developer & system administrator (Google, others); returned to ASU for B.S. in Microbiology.

Introduced to bioinformatics via Anolis (lizard) genome assembly and analysis (Kenro Kusumi, Arizona State)

Returned to software engineering as a Research Software Engineer for k-h-mer project (C. Titus Brown; Michigan State then Univ. of California, Davis)

Now based out of�Vilnius, Lithuania

CWL Project Lead;�Assisting EOSC Pilot, ELIXIR,�ASTERICS, & NIH Data Commons

https://tinyurl.com/biohack18-cwl

3 of 45

https://tinyurl.com/biohack18-cwl

4 of 45

Common Workflow Language v1.0

  • Common declarative format for tool & workflow execution
  • Community based standards effort, not a specific software package; Very extensible
  • Defined with a schema, specification, & test suite
  • Designed for shared-nothing clusters, academic clusters, cloud environments, and local execution
  • Supports the use of containers (e.g. Docker) and shared research computing clusters with locally installed software

https://tinyurl.com/biohack18-cwl

5 of 45

From the Life Sciences…

https://tinyurl.com/biohack18-cwl

6 of 45

…to (astro)physics and beyond

https://tinyurl.com/biohack18-cwl

7 of 45

Why have a standard?

  • Standards create a surface for collaboration that promote innovation
  • Research frequently dip in and out of different systems but interoperability is not a basic feature.
  • Funders, journals, and other sources of incentives prefer standards over proprietary or single-source approaches

https://tinyurl.com/biohack18-cwl

8 of 45

Timeline

2014 Bioinformatics Open Source Conference CodeFest:�4 software engineers and a whiteboard

2015: CWL “draft-2” version, commercial vendor (Seven Bridges Genomics) releases product in December.

2016: CWL v1.0 released; GA4GH begins WES on top of CWL.

2017: CWL v1.0.1 and v1.0.2 released.� Now 4 open source implementations

2018: IBM released their CWL implementation for LSF.�CWL v1.1 with a multitude of corner case clarifications being finalized for release soon.

CWL in NIH Data Commons & European Open Science Cloud

https://tinyurl.com/biohack18-cwl

9 of 45

The CWL model for tools

CWL tool descriptions turn POSIX command-line data analysis tools into functions

  • well defined and named inputs & outputs
  • typed

These inputs and outputs are connected into “data flow” style workflows

The reference CWL runner runs on Microsoft Windows using Docker software containers

https://tinyurl.com/biohack18-cwl

10 of 45

Well described tools and workflows → Save time, money

CWL tool descriptions can self describe the “shape” of the computation

  • # of cores
  • memory needs
  • temporary and output storage estimations

This uses fixed values, or can be computed prior to scheduling based upon the input data & its metadata

http://www.commonwl.org/v1.0/CommandLineTool.html#Runtime_environment

https://tinyurl.com/biohack18-cwl

11 of 45

Data locality with CWL

Input and output files are modeled in CWL as rich object with identifier (URI/IRI) and other metadata.

Platforms that understand CWL can use these identifiers to send compute to where or near the location of data.

In combination with the resource matchmaking this can conversely result in data being sent to specialized compute as configured by the operator (or machine learning)

https://tinyurl.com/biohack18-cwl

12 of 45

Software Containers & CWL

CWL v1.0.x has built in (optional) support for Docker software containers. The CWL reference runner has support for running Docker containers using the Docker, Singularity, uDocker, or dx-docker runtimes.

CWL descriptions can also contain more generic software requirements; can be used to make applications available using Docker, Singularity, conda, Debian, or any other packaging system (like CVMFS).

http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwareRequirement

Example with reference CWL runner: https://github.com/common-workflow-language/cwltool#leveraging-softwarerequirements-beta

https://tinyurl.com/biohack18-cwl

13 of 45

Open Source Implementations

Full list at https://www.commonwl.org/#Implementations

Arvados from Curoverse / Veritas Genetics

CWLEXEC from IBM LSF

CWL-Airflow from BioWardrobe Team, CCHMC

Toil from UCSC & community contributors

Rabix Bunny from Seven Bridges

REANA from CERN

https://tinyurl.com/biohack18-cwl

14 of 45

Open Source Implementations

Full list at https://www.commonwl.org/#Implementations

Some are full platforms, others are just workflow executors.

Execution environments include:

  • Local (Linux, OS X, Windows)
  • HPC: Slurm, GridEngine, PBS, LSF, HTCondor, Apache Airflow
  • Cloud: Amazon AWS, Google GCP, Mesos, OpenStack,�MS Azure, Kubernetes

https://tinyurl.com/biohack18-cwl

15 of 45

EBI’s metagenomics workflow scripts -> CWL

https://www.ebi.ac.uk/metagenomics/pipelines/3.0

9522 lines of Python, BASH, and Perl code (data analysis workflows logic mixed with operational details

converted into

2560 lines of CWL descriptions

https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl

(Lines of code counts via https://github.com/AlDanial/cloc#Stable)

https://tinyurl.com/biohack18-cwl

16 of 45

EBI’s metagenomics -> CWL project

https://tinyurl.com/biohack18-cwl

17 of 45

Extensibility a core feature

Vendors are encouraged to develop new features as well marked extensions.�(Inspired by modern web standards development practices)

These extensions are then candidates for inclusion as official extensions, or perhaps required elements of a future version of the standard.

Example�arv:PartitionRequirement will be part of CWL v1.1 as BatchQueue.

https://tinyurl.com/biohack18-cwl

18 of 45

Software Containers & CWL

Future version of the CWL standard will switch from “Docker” image format for software containers to the Open Container Initiative image format standard.

https://tinyurl.com/biohack18-cwl

19 of 45

Use Cases for the CWL standards

Publication reproducibility, reusability

Workflow creation & improvement across institutions and continents

Contests & challenges

Analysis on non-public data sets, possibly using GA4GH job & workflow submission API

https://tinyurl.com/biohack18-cwl

20 of 45

Linked Data & CWL

  • Hyperlinks are common currency
  • Bring your own RDF ontologies for metadata
  • Supports SPARQL to query

Example: can use the EDAM ontology (ELIXIR-DK) to specify file formats and reason about them:� “FASTQ Sanger” encoding is a type of FASTQ file

https://tinyurl.com/biohack18-cwl

21 of 45

Related Efforts

US FDA & George Washington U initiative; now @ IEEE P2791 Working Group.

Desired to have input from Japan: regulators, pharma, clinical perspectives

https://osf.io/h59uh/

"An emerging approach to the publication, and exchange of scholarly information on the Web."

http://www.researchobject.org/

https://tinyurl.com/biohack18-cwl

22 of 45

BH2018 Proposal

Work on new features for CWL (improve proposals, start implementation, write tests)

  • enhanced metadata handling�https://github.com/common-workflow-language/common-workflow-language/issues/710
  • further restrictions on input type�https://github.com/common-workflow-language/common-workflow-language/issues/764
    • Example: number between 23 and 42
    • String with length of more than 12 characters

https://tinyurl.com/biohack18-cwl

23 of 45

Key Points

CWL, as a standard, allows us to move the interface between the researcher and the infrastructure to a much higher layer. This frees the researcher to focus on their work and frees the e-infra providers to better optimize and balance their systems.

This workflow standard already has a growing ecosystem: training materials (in three languages), visualizers, support for popular text editors and IDEs, standalone GUI, and more

https://tinyurl.com/biohack18-cwl

24 of 45

Thanks!

Please join us in Vilnius, Lithuania

2019-03-09 ~ 2019-03-11

“Debian Med 2019 Sprint”�for Scientific Software Packaging & Containerization

(other packaging systems are welcome: Conda, Guix, etc…)

https://wiki.debian.org/Sprints/2019/DebianMed2019

Vilnius, Lithuania is 13.5-14.5 hours flying from Tokyo

(via Helsinki, Finland or Warsaw, Poland)

¥100,000 - ¥140,000 round trip airfare

€24 (¥3,100) / night for single room accommodations

https://www.commonwl.org

https://tinyurl.com/biohack18-cwl

25 of 45

Backup slides!

https://tinyurl.com/biohack18-cwl

26 of 45

Editors, viewers, utilities, etc.

Rabix CWL GUI (“Composer”) also integrated into the Arvados Platform

Text editor support for Atom, Vim, emacs, Visual Studio, IntelliJ, and gedit courtesy community contributors

https://www.commonwl.org/#Editors_and_viewers

https://www.commonwl.org/#Converters_and_code_generators

https://tinyurl.com/biohack18-cwl

27 of 45

Editors, viewers, utilities, etc.

Rabix CWL GUI (“Composer”) also integrated into the Arvados Platform

Text editor support for Atom, Vim, emacs, Visual Studio, IntelliJ, and gedit courtesy community contributors

https://www.commonwl.org/#Editors_and_viewers

https://www.commonwl.org/#Converters_and_code_generators

https://tinyurl.com/biohack18-cwl

28 of 45

Community Based Standards development

Different model than traditional nation-based or regulatory approach

We adopted the Open-Stand.org Modern Paradigm for Standards: Cooperation, Adherence to Principles (Due process, Broad consensus, Transparency, Balance, Openness), Collective Empowerment, (Free) Availability, Voluntary Adoption

https://tinyurl.com/biohack18-cwl

29 of 45

Workflows

  • Specify data dependencies between steps
  • Scatter/gather on steps
  • Can nest workflows in steps
  • Still working on:
  • Conditionals & looping

30 of 45

How to search for a tool, or for a workflow

GitHub�Search for CWL documents using�extension:cwl cwlVersion + <your search terms>, for example extension:cwl cwlVersion picard.

Google�Search for CWL documents using�filetype:cwl cwlVersion + <your search terms>, for example filetype:cwl cwlVersion picard

Can also browse https://view.commonwl.org/workflows

https://tinyurl.com/biohack18-cwl

31 of 45

Example: samtools-sort.cwl

File type & metadata

Input parameters

Output parameters

class: CommandLineTool

cwlVersion: v1.0

doc: Sort by chromosomal coordinates

inputs:

aligned_sequences:

type: File

format: edam:format_2572 # BAM binary alignment format

inputBinding:

position: 1

outputs:

sorted_aligned_sequences:

type: stdout

format: edam:format_2572

Executable

baseCommand: [samtools, sort]

hints:

DockerRequirement:

dockerPull: quay.io/cancercollaboratory/dockstore-tool-samtools-sort

Runtime environment

$namespaces: { edam: "http://edamontology.org/" }

$schemas: [ "http://edamontology.org/EDAM_1.15.owl" ]

Linked data support

32 of 45

File type & metadata

  • Identify as a CommandLineTool object
  • Core spec includes simple comments
  • Metadata about tool extensible to arbitrary RDF vocabularies, e.g.
    • Biotools & EDAM
    • Dublin Core Terms (DCT)
    • Description of a Project (DOAP)

class: CommandLineTool

cwlVersion: v1.0

doc: Sort by chromosomal coordinates

33 of 45

Runtime Environment

  • Define the execution environment of the tool
  • “requirements” must be fulfilled or an error
  • “hints” are soft requirements (express preference but not an error if not satisfied)
  • Also used to enable optional CWL features
    • Mechanism for defining extensions

hints:

DockerRequirement:

dockerPull: quay.io/[...]samtools-sort

34 of 45

Input parameters

  • Specify name & type of input parameters
    • Based on the Apache Avro type system
    • null, boolean, int, string, float, array, record
    • File formats can be IANA Media/MIME types, or from domain specific ontologies, like EDAM for bioinformatics
  • “inputBinding”: describes how to turn parameter value into actual command line argument

inputs:

aligned_sequences:

type: File

format: edam:format_2572 # BAM binary format

inputBinding:

position: 1

35 of 45

Example: samtools-sort.cwl

File type & metadata

Input parameters

Output parameters

class: CommandLineTool

cwlVersion: v1.0

doc: Sort by chromosomal coordinates

inputs:

aligned_sequences:

type: File

format: edam:format_2572 # BAM binary alignment format

inputBinding:

position: 1

outputs:

sorted_aligned_sequences:

type: stdout

format: edam:format_2572

Executable

baseCommand: [samtools, sort]

hints:

DockerRequirement:

dockerPull: quay.io/cancercollaboratory/dockstore-tool-samtools-sort

Runtime environment

$namespaces: { edam: "http://edamontology.org/" }

$schemas: [ "http://edamontology.org/EDAM_1.15.owl" ]

Linked data support

36 of 45

Command Line Building

  • Associate input values with parameters
  • Apply input bindings to generate strings
  • Sort by “position”
  • Prefix “base command”

inputs:

aligned_sequences:

type: File

format: edam:format_2572

inputBinding:

position: 1

baseCommand: [samtools, sort]

aligned_sequences:

class: File

location: example.bam

format: http://edamontology.org/format_2572

[“samtools”, “sort”, “example.bam”]

Input object

37 of 45

Output parameters

  • Specify name & type of output parameters
  • In this example, capture the STDOUT stream from “samtools sort” and tag it as being BAM formatted.

outputs:

sorted_aligned_sequences:

type: stdout

format: edam:format_2572

38 of 45

Example: grep & count

steps:

grep:

run: grep.cwl

in:

pattern: pattern

infile: infiles

scatter: infile

out: [outfile]

wc:

run: wc.cwl

in:

infiles: grep/outfile

out: [outfile]

class: Workflow

cwlVersion: v1.0

inputs:

pattern: string

infiles: File[]

outputs:

outfile:

type: File

outputSource: wc/outfile

requirements:

- class: ScatterFeatureRequirement

39 of 45

Example: grep & count

class: Workflow

cwlVersion: v1.0

inputs:

pattern: string

infiles: File[]

outputs:

outfile:

type: File

outputSource: wc/outfile

requirements:

- class: ScatterFeatureRequirement

steps:

grep:

run: grep.cwl

in:

pattern: pattern

infile: infiles

scatter: infile

out: [outfile]

wc:

run: wc.cwl

in:

infiles: grep/outfile

out: [outfile]

Tool to run

Scatter over input array

Connect output of “grep” to input of “wc”

Connect output of “wc” to workflow output

40 of 45

Funding

Currently, only one FTE! (M. Crusoe). Lots of in-kind donations from participant projects & vendors.

NGO/charity in the USA is legal home of the project (Software Freedom Conservancy, a 501(c)(3))

M. Crusoe recently formed a public enterprise in Lithuania (VšĮ "Darbo eigos") to assist with coordinating & funding CWL work in Europe.

CWL is a standards community & pan-discipline; most traditional funding sources don’t know what to do with us.

https://tinyurl.com/biohack18-cwl

41 of 45

CWL Design principles

  • Low barrier to entry for implementers
  • Support tooling such as generators, GUIs, converters
  • Allow extensions, but must be well marked
  • Be part of linked data ecosystem
  • Be pragmatic

https://tinyurl.com/biohack18-cwl

42 of 45

https://tinyurl.com/biohack18-cwl

43 of 45

The LOFAR pre-facet calibration pipeline

LOFAR pipelines currently written in ‘parsets’ language unique to that team

Gijs Molenaar packaged the software in the KERN suite (3rd party software packages for Ubuntu Linux LTS) and used those packages to create Docker/Singularity containers

Gijs (with some assistance from me) then converted the “parset” based pipeline to a Common Workflow Language version

https://tinyurl.com/biohack18-cwl

44 of 45

The LOFAR pre-facet calibration pipeline

https://tinyurl.com/biohack18-cwl

45 of 45

Searching for Pulsars with PRESTO (& CWL)

https://tinyurl.com/biohack18-cwl