An introduction to the�CWL standards
Michael R. Crusoe CWL Project Lead @biocrusoe
2018-12-09 #CommonWL
11th NBDC/DBCLS BioHackathon (BioHackathon 2018)�Matsue, Japan
https://tinyurl.com/biohack18-cwl
https://tinyurl.com/biohack18-cwl
From Phoenix, Arizona (Sonoran Desert), USA
Studied at Arizona State: Comp. Sci.; time in industry as a developer & system administrator (Google, others); returned to ASU for B.S. in Microbiology.
Introduced to bioinformatics via Anolis (lizard) genome assembly and analysis (Kenro Kusumi, Arizona State)
Returned to software engineering as a Research Software Engineer for k-h-mer project (C. Titus Brown; Michigan State then Univ. of California, Davis)
Now based out of�Vilnius, Lithuania
CWL Project Lead;�Assisting EOSC Pilot, ELIXIR,�ASTERICS, & NIH Data Commons
https://tinyurl.com/biohack18-cwl
https://tinyurl.com/biohack18-cwl
Common Workflow Language v1.0
https://tinyurl.com/biohack18-cwl
From the Life Sciences…
https://tinyurl.com/biohack18-cwl
…to (astro)physics and beyond
https://tinyurl.com/biohack18-cwl
Why have a standard?
https://tinyurl.com/biohack18-cwl
Timeline
2014 Bioinformatics Open Source Conference CodeFest:�4 software engineers and a whiteboard
2015: CWL “draft-2” version, commercial vendor (Seven Bridges Genomics) releases product in December.
2016: CWL v1.0 released; GA4GH begins WES on top of CWL.
2017: CWL v1.0.1 and v1.0.2 released.� Now 4 open source implementations
2018: IBM released their CWL implementation for LSF.�CWL v1.1 with a multitude of corner case clarifications being finalized for release soon.
CWL in NIH Data Commons & European Open Science Cloud
https://tinyurl.com/biohack18-cwl
The CWL model for tools
CWL tool descriptions turn POSIX† command-line data analysis tools into functions
These inputs and outputs are connected into “data flow” style workflows
†The reference CWL runner runs on Microsoft Windows using Docker software containers
https://tinyurl.com/biohack18-cwl
Well described tools and workflows → Save time, money
CWL tool descriptions can self describe the “shape” of the computation
This uses fixed values, or can be computed prior to scheduling based upon the input data & its metadata
http://www.commonwl.org/v1.0/CommandLineTool.html#Runtime_environment
https://tinyurl.com/biohack18-cwl
Data locality with CWL
Input and output files are modeled in CWL as rich object with identifier (URI/IRI) and other metadata.
Platforms that understand CWL can use these identifiers to send compute to where or near the location of data.
In combination with the resource matchmaking this can conversely result in data being sent to specialized compute as configured by the operator (or machine learning)
https://tinyurl.com/biohack18-cwl
Software Containers & CWL
CWL v1.0.x has built in (optional) support for Docker software containers. The CWL reference runner has support for running Docker containers using the Docker, Singularity, uDocker, or dx-docker runtimes.
CWL descriptions can also contain more generic software requirements; can be used to make applications available using Docker, Singularity, conda, Debian, or any other packaging system (like CVMFS).
http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwareRequirement
Example with reference CWL runner: https://github.com/common-workflow-language/cwltool#leveraging-softwarerequirements-beta
https://tinyurl.com/biohack18-cwl
Open Source Implementations
Full list at https://www.commonwl.org/#Implementations
Arvados from Curoverse / Veritas Genetics
CWLEXEC from IBM LSF
CWL-Airflow from BioWardrobe Team, CCHMC
Toil from UCSC & community contributors
Rabix Bunny from Seven Bridges
REANA from CERN
https://tinyurl.com/biohack18-cwl
Open Source Implementations
Full list at https://www.commonwl.org/#Implementations
Some are full platforms, others are just workflow executors.
Execution environments include:
https://tinyurl.com/biohack18-cwl
EBI’s metagenomics workflow scripts -> CWL
https://www.ebi.ac.uk/metagenomics/pipelines/3.0
9522 lines of Python, BASH, and Perl code (data analysis workflows logic mixed with operational details
converted into
2560 lines of CWL descriptions
https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl
(Lines of code counts via https://github.com/AlDanial/cloc#Stable)
https://tinyurl.com/biohack18-cwl
EBI’s metagenomics -> CWL project
Courtesy EMBL-EBI Metagenomics, visualization from
https://tinyurl.com/biohack18-cwl
Extensibility a core feature
Vendors are encouraged to develop new features as well marked extensions.�(Inspired by modern web standards development practices)
These extensions are then candidates for inclusion as official extensions, or perhaps required elements of a future version of the standard.
Example�arv:PartitionRequirement will be part of CWL v1.1 as BatchQueue.
https://tinyurl.com/biohack18-cwl
Software Containers & CWL
Future version of the CWL standard will switch from “Docker” image format for software containers to the Open Container Initiative image format standard.
https://tinyurl.com/biohack18-cwl
Use Cases for the CWL standards
Publication reproducibility, reusability
Workflow creation & improvement across institutions and continents
Contests & challenges
Analysis on non-public data sets, possibly using GA4GH job & workflow submission API
https://tinyurl.com/biohack18-cwl
Linked Data & CWL
Example: can use the EDAM ontology (ELIXIR-DK) to specify file formats and reason about them:� “FASTQ Sanger” encoding is a type of FASTQ file
https://tinyurl.com/biohack18-cwl
Related Efforts
US FDA & George Washington U initiative; now @ IEEE P2791 Working Group.
Desired to have input from Japan: regulators, pharma, clinical perspectives
https://osf.io/h59uh/
"An emerging approach to the publication, and exchange of scholarly information on the Web."
http://www.researchobject.org/
https://tinyurl.com/biohack18-cwl
BH2018 Proposal
Work on new features for CWL (improve proposals, start implementation, write tests)
https://tinyurl.com/biohack18-cwl
Key Points
CWL, as a standard, allows us to move the interface between the researcher and the infrastructure to a much higher layer. This frees the researcher to focus on their work and frees the e-infra providers to better optimize and balance their systems.
This workflow standard already has a growing ecosystem: training materials (in three languages), visualizers, support for popular text editors and IDEs, standalone GUI, and more
https://tinyurl.com/biohack18-cwl
Thanks!
Please join us in Vilnius, Lithuania
2019-03-09 ~ 2019-03-11
“Debian Med 2019 Sprint”�for Scientific Software Packaging & Containerization
(other packaging systems are welcome: Conda, Guix, etc…)
https://wiki.debian.org/Sprints/2019/DebianMed2019
Vilnius, Lithuania is 13.5-14.5 hours flying from Tokyo
(via Helsinki, Finland or Warsaw, Poland)
¥100,000 - ¥140,000 round trip airfare
€24 (¥3,100) / night for single room accommodations
https://tinyurl.com/biohack18-cwl
Backup slides!
https://tinyurl.com/biohack18-cwl
Editors, viewers, utilities, etc.
Rabix CWL GUI (“Composer”) also integrated into the Arvados Platform
Text editor support for Atom, Vim, emacs, Visual Studio, IntelliJ, and gedit courtesy community contributors
https://www.commonwl.org/#Editors_and_viewers
https://www.commonwl.org/#Converters_and_code_generators
https://tinyurl.com/biohack18-cwl
Editors, viewers, utilities, etc.
Rabix CWL GUI (“Composer”) also integrated into the Arvados Platform
Text editor support for Atom, Vim, emacs, Visual Studio, IntelliJ, and gedit courtesy community contributors
https://www.commonwl.org/#Editors_and_viewers
https://www.commonwl.org/#Converters_and_code_generators
https://tinyurl.com/biohack18-cwl
Community Based Standards development
Different model than traditional nation-based or regulatory approach
We adopted the Open-Stand.org Modern Paradigm for Standards: Cooperation, Adherence to Principles (Due process, Broad consensus, Transparency, Balance, Openness), Collective Empowerment, (Free) Availability, Voluntary Adoption
https://tinyurl.com/biohack18-cwl
Workflows
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
How to search for a tool, or for a workflow
GitHub�Search for CWL documents using�extension:cwl cwlVersion + <your search terms>, for example extension:cwl cwlVersion picard.
Google�Search for CWL documents using�filetype:cwl cwlVersion + <your search terms>, for example filetype:cwl cwlVersion picard
Can also browse https://view.commonwl.org/workflows
https://tinyurl.com/biohack18-cwl
Example: samtools-sort.cwl
File type & metadata
Input parameters
Output parameters
class: CommandLineTool
cwlVersion: v1.0
doc: Sort by chromosomal coordinates
inputs:
aligned_sequences:
type: File
format: edam:format_2572 # BAM binary alignment format
inputBinding:
position: 1
outputs:
sorted_aligned_sequences:
type: stdout
format: edam:format_2572
Executable
baseCommand: [samtools, sort]
hints:
DockerRequirement:
dockerPull: quay.io/cancercollaboratory/dockstore-tool-samtools-sort
Runtime environment
$namespaces: { edam: "http://edamontology.org/" }
$schemas: [ "http://edamontology.org/EDAM_1.15.owl" ]
Linked data support
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
File type & metadata
class: CommandLineTool
cwlVersion: v1.0
doc: Sort by chromosomal coordinates
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Runtime Environment
hints:
DockerRequirement:
dockerPull: quay.io/[...]samtools-sort
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Input parameters
inputs:
aligned_sequences:
type: File
format: edam:format_2572 # BAM binary format
inputBinding:
position: 1
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Example: samtools-sort.cwl
File type & metadata
Input parameters
Output parameters
class: CommandLineTool
cwlVersion: v1.0
doc: Sort by chromosomal coordinates
inputs:
aligned_sequences:
type: File
format: edam:format_2572 # BAM binary alignment format
inputBinding:
position: 1
outputs:
sorted_aligned_sequences:
type: stdout
format: edam:format_2572
Executable
baseCommand: [samtools, sort]
hints:
DockerRequirement:
dockerPull: quay.io/cancercollaboratory/dockstore-tool-samtools-sort
Runtime environment
$namespaces: { edam: "http://edamontology.org/" }
$schemas: [ "http://edamontology.org/EDAM_1.15.owl" ]
Linked data support
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Command Line Building
inputs:
aligned_sequences:
type: File
format: edam:format_2572
inputBinding:
position: 1
baseCommand: [samtools, sort]
aligned_sequences:
class: File
location: example.bam
format: http://edamontology.org/format_2572
[“samtools”, “sort”, “example.bam”]
Input object
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Output parameters
outputs:
sorted_aligned_sequences:
type: stdout
format: edam:format_2572
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Example: grep & count
steps:
grep:
run: grep.cwl
in:
pattern: pattern
infile: infiles
scatter: infile
out: [outfile]
wc:
run: wc.cwl
in:
infiles: grep/outfile
out: [outfile]
class: Workflow
cwlVersion: v1.0
inputs:
pattern: string
infiles: File[]
outputs:
outfile:
type: File
outputSource: wc/outfile
requirements:
- class: ScatterFeatureRequirement
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Example: grep & count
class: Workflow
cwlVersion: v1.0
inputs:
pattern: string
infiles: File[]
outputs:
outfile:
type: File
outputSource: wc/outfile
requirements:
- class: ScatterFeatureRequirement
steps:
grep:
run: grep.cwl
in:
pattern: pattern
infile: infiles
scatter: infile
out: [outfile]
wc:
run: wc.cwl
in:
infiles: grep/outfile
out: [outfile]
Tool to run
Scatter over input array
Connect output of “grep” to input of “wc”
Connect output of “wc” to workflow output
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Funding
Currently, only one FTE! (M. Crusoe). Lots of in-kind donations from participant projects & vendors.
NGO/charity in the USA is legal home of the project (Software Freedom Conservancy, a 501(c)(3))
M. Crusoe recently formed a public enterprise in Lithuania (VšĮ "Darbo eigos") to assist with coordinating & funding CWL work in Europe.
CWL is a standards community & pan-discipline; most traditional funding sources don’t know what to do with us.
https://tinyurl.com/biohack18-cwl
CWL Design principles
https://tinyurl.com/biohack18-cwl
Web: http://reana.io�Docs: http://reana.readthedocs.io Twitter: https://twitter.com/reanahub�GitHub: https://github.com/reanahub
https://tinyurl.com/biohack18-cwl
The LOFAR pre-facet calibration pipeline
LOFAR pipelines currently written in ‘parsets’ language unique to that team
Gijs Molenaar packaged the software in the KERN suite (3rd party software packages for Ubuntu Linux LTS) and used those packages to create Docker/Singularity containers
Gijs (with some assistance from me) then converted the “parset” based pipeline to a Common Workflow Language version
https://tinyurl.com/biohack18-cwl
The LOFAR pre-facet calibration pipeline
https://tinyurl.com/biohack18-cwl
Searching for Pulsars with PRESTO (& CWL)
https://tinyurl.com/biohack18-cwl