CWLの紹介(GUI を利用したハンズオン)
ハッシュタグ #CommonWL #CommonWLjp
HANDS-ON WORK SETUP 1 (事前準備)
追記:2018年12月5日15:30、Macでdockerがうまくはいらないときは、こちらもためしてみてもらえますか?
Docker Windows 10 professional or Enterprise: https://download.docker.com/win/stable/Docker%20for%20Windows%20Installer.exe
Docker Other versions of Windows https://download.docker.com/win/stable/DockerToolbox.exe
Java Development Kit 8+ http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
Rabix Composer for Windows https://github.com/rabix/composer/releases/download/1.0.1-rc.1/rabix-composer.Setup.1.0.1-rc.1.exe
Docker for Mac OS X https://download.docker.com/mac/stable/Docker.dmg
Java Development Kit 8+ http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
Rabix Composer for Mac OS X https://github.com/rabix/composer/releases/download/1.0.1-rc.1/rabix-composer-1.0.1-rc.1.dmg
Docker, pick your distribution from https://www.docker.com/community-edition
Java Development Kit 8+
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
or sudo apt-get install openjdk-8-jdk-headless
Rabix Composer for Linux https://github.com/rabix/composer/releases/download/1.0.1/rabix-composer-1.0.1-x86_64.AppImage
Docker のインストールがうまく行ったか確認するために、以下のコマンドを試しておいてください
docker pull mgrast/pdf2wordcloud:demo
Common Workflow Language v1.0
From the Life Sciences…
Timeline
2014 Bioinformatics Open Source Conference(BOSC) CodeFest:�4人のソフトウェアエンジニアと、ホワイトボードで始まる
2015: CWL “draft-2” , 12月に商用ベンダ (Seven Bridges Genomics) から、製品がリリース
2016: CWL v1.0 リリース
2017: CWL v1.0.1 と v1.0.2 がリリース.� 4つの実装が利用可能
2018: IBM が LSF 向けの CWL 実装をリリース。CWL v1.1 が策定予定。
CWL How It WORKS
CWL How It WORKS example
CommandLineTool
pdf2text
CommandLineTool
grep
CommandLineTool
wc
PDFからテキストを抜き出して、特定の単語を数を集計する例
CWL How It WORKS example
CommandLineTool
pdf2text
CommandLineTool
grep
CommandLineTool
wc
PDFからテキストを抜き出して、特定の単語を数を集計する例
2.1で定義したコマンドを、必要なだけ連続して実行するように、ワークフロー(Workflow)を定義
Workflow
CWL How It WORKS example
CommandLineTool
pdf2text
CommandLineTool
grep
CommandLineTool
wc
PDFからテキストを抜き出して、特定の単語を数を集計する例
3.2で定義したワークフローを、実行します。このとき、主に実行する計算環境に合わせて実行エンジンを選ぶ
Workflow
実行エンジン
cwltool, cwlexec, toil, Galaxy
HPC, Slurm, Gridengine...
AWS, Azure, GCP…
OpenStack,CloudStack
Kubernetes, Airflow,
Local
COMMANDLINETOOL
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Example: samtools-sort.cwl
File type & metadata
Input parameters
Output parameters
class: CommandLineTool
cwlVersion: v1.0
doc: Sort by chromosomal coordinates
inputs:
aligned_sequences:
type: File
format: edam:format_2572 # BAM binary alignment format
inputBinding:
position: 1
outputs:
sorted_aligned_sequences:
type: stdout
format: edam:format_2572
Executable
baseCommand: [samtools, sort]
requirements:
DockerRequirement:
dockerPull: quay.io/biocontainers/samtools:1.8--4
Runtime environment
$namespaces: { edam: "http://edamontology.org/" }
$schemas: [ "http://edamontology.org/EDAM_1.15.owl" ]
Linked data support
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
File type & metadata
class: CommandLineTool
cwlVersion: v1.0
doc: Sort by chromosomal coordinates
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Runtime Environment
requirements:
DockerRequirement:
dockerPull: quay.io/biocontainers/samtools:1.8--4
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Input parameters
inputs:
aligned_sequences:
type: File
format: edam:format_2572 # BAM binary format
inputBinding:
position: 1
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Example: samtools-sort.cwl
File type & metadata
Input parameters
Output parameters
class: CommandLineTool
cwlVersion: v1.0
doc: Sort by chromosomal coordinates
inputs:
aligned_sequences:
type: File
format: edam:format_2572 # BAM binary alignment format
inputBinding:
position: 1
outputs:
sorted_aligned_sequences:
type: stdout
format: edam:format_2572
Executable
baseCommand: [samtools, sort]
requirements:
DockerRequirement:
dockerPull: quay.io/biocontainers/samtools:1.8--4
Runtime environment
$namespaces: { edam: "http://edamontology.org/" }
$schemas: [ "http://edamontology.org/EDAM_1.15.owl" ]
Linked data support
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Command Line Building
inputs:
aligned_sequences:
type: File
format: edam:format_2572
inputBinding:
position: 1
baseCommand: [samtools, sort]
aligned_sequences:
class: File
location: example.bam
format: http://edamontology.org/format_2572
[“samtools”, “sort”, “example.bam”]
Input object
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Output parameters
outputs:
sorted_aligned_sequences:
type: stdout
format: edam:format_2572
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
HANDS ON #1: run tools
Goal: Rabix Composerを使って最初のCWL tool (pdf2text)を実行してみましょう
git clone https://github.com/mr-c/CWL-Quick-Start.git
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
HANDS ON #1: run tools
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
HANDS ON #1: run tools
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
HANDS ON #1: run tools
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
HANDS ON #1: run tools
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
HANDS ON #1: run tools
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
HANDS ON #1: run tools
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Workflows
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Example: grep & count
steps:
grep:
run: grep.cwl
in:
pattern: pattern
infile: infiles
scatter: infile
out: [outfile]
wc:
run: wc.cwl
in:
infiles: grep/outfile
out: [outfile]
class: Workflow
cwlVersion: v1.0
inputs:
pattern: string
infiles: File[]
outputs:
outfile:
type: File
outputSource: wc/outfile
requirements:
- class: ScatterFeatureRequirement
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Example: grep & count
class: Workflow
cwlVersion: v1.0
inputs:
pattern: string
infiles: File[]
outputs:
outfile:
type: File
outputSource: wc/outfile
requirements:
- class: ScatterFeatureRequirement
steps:
grep:
run: grep.cwl
in:
pattern: pattern
infile: infiles
out: [outfile]
wc:
run: wc.cwl
in:
infiles: grep/outfile
out: [outfile]
実行するコマンドラインツール
入力で渡された探したい文字列と、対象のファイル
“grep” の出力を “wc” の入力とする
“wc” の出力をワークフローの出力とする
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
HANDS ON #2: run workflows
Goal: 最初のCWLワークフロー(pdf2wordcloud)を、Rabix Composerで実行してみましょう
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
HANDS ON #2: run workflows
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
HANDS ON #2: run workflows
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Software Containers & CWL
CWL v1.0.x では、Dockerコンテナをサポートしています。(実行エンジンによってはサポートされないこともあります。) CWLリファレンスの実行エンジン(cwltool)は、Dokcerコンテナを、 Docker, Singularity, uDocker, dx-docker のランタイムで実行することができます。(このパッチが、本家Dockerに入るので、それも使えるかもしれないです。Allow running dockerd as a non-root user (Rootless mode))
http://www.commonwl.org/v1.0/CommandLineTool.html#SoftwareRequirement
Example with reference CWL runner: https://github.com/common-workflow-language/cwltool#leveraging-softwarerequirements-beta
Open Source Implementations
Full list at https://www.commonwl.org/#Implementations
Arvados from Curoverse / Veritas Genetics
CWLEXEC from IBM LSF
CWL-Airflow from BioWardrobe Team, CCHMC
Toil from UCSC & community contributors
Rabix Bunny from Seven Bridges
REANA from CERN
Open Source Implementations
一覧はここです https://www.commonwl.org/#Implementations
オプショナルを含めてすべてサポートしているものや、必須要件のみをサポートしているものなどがあります。詳しくはリストを見て、自分にあったものを探してみましょう
以下のような環境で実行することができます:
How to search for a tool, or for a workflow
GitHub�で、CWL のファイルを探すには�extension:cwl cwlVersion + <your search terms>, 例 extension:cwl cwlVersion picard.
Google�で、CWL のファイルを探すには�filetype:cwl cwlVersion + <your search terms>, 例 filetype:cwl cwlVersion picard
ここでも見つかります https://view.commonwl.org/workflows
Editors, viewers, utilities, etc.
今回使った Rabix CWL GUI (“Composer”) は、Arvados Platform でも使われています
コミュニティの貢献者により、シンタックスハイライトなどがサポートされているテキストエディタは次のものになります Atom, Vim, emacs, Visual Studio, IntelliJ, gedit
https://www.commonwl.org/#Editors_and_viewers
https://www.commonwl.org/#Converters_and_code_generators
EBI’s metagenomics workflow scripts -> CWL
https://www.ebi.ac.uk/metagenomics/pipelines/3.0
9522 行の、Python, BASH, Perl からなるコードを
CWLに変換したところ
2560 行になりました。
https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl
削除されたほとんどのコードは、解析本体とは関係ない、ジョブスケジュールのためのコードでした。
(Lines of code counts via https://github.com/AlDanial/cloc#Stable)
EBI’s metagenomics -> CWL project
Courtesy EMBL-EBI Metagenomics, visualization from
Data
Interoperability
Tools
Compute
Training
ELIXIR: European infrastructure for
biological information
Data infrastructure for Europe’s life-science research:
Marine metagenomics
Human data
Crop and forest plants
Rare diseases
ELIXIR Hub based
alongside EMBL-EBI
in Hinxton
…to (astro)physics and beyond
天文学/Astronomy
天文学/Radio Astronomy
総合
CERN/ REANA
Thanks!
ASTERICS-OBELICS Workshop 2017 / Barcelona
41
10/10/2017
Backup slides!
Key Points
CWL, as a standard, allows us to move the interface between the researcher and the infrastructure to a much higher layer. This frees the researcher to focus on their work and frees the e-infrastructure providers to better optimize and balance their systems.
This workflow standard already has a growing ecosystem: training materials (in three languages), visualizers, support for popular text editors and IDEs, standalone GUI, and more
Data locality with CWL
Input and output files are modeled in CWL as rich object with identifier (URI/IRI) and other metadata.
Platforms that understand CWL can use these identifiers to send compute to where or near the location of data.
In combination with the resource matchmaking this can conversely result in data being sent to specialized compute as configured by the operator (or machine learning)
Use Cases for the CWL standards
Publication reproducibility, reusability
Workflow creation & improvement across institutions and continents
Contests & challenges
Analysis on non-public data sets, possibly using GA4GH job & workflow submission API
Well described tools and workflows → Save time, money
CWL のツールの説明に、どれくらいのリソースが必要かを書くことができます。
This uses fixed values, or can be computed prior to scheduling based upon the input data & its metadata
http://www.commonwl.org/v1.0/CommandLineTool.html#Runtime_environment
Community Based Standards development
Different model than traditional nation-based or regulatory approach
We adopted the Open-Stand.org Modern Paradigm for Standards: Cooperation, Adherence to Principles (Due process, Broad consensus, Transparency, Balance, Openness), Collective Empowerment, (Free) Availability, Voluntary Adoption
Extensibility a core feature
Vendors are encouraged to develop new features as well marked extensions.�(Inspired by modern web standards development practices)
These extensions are then candidates for inclusion as official extensions, or perhaps required elements of a future version of the standard.
Example�arv:PartitionRequirement will be part of CWL v1.1 as BatchQueue.
The CWL model for tools
CWL tool descriptions turn POSIX† command-line data analysis tools into functions
These inputs and outputs are connected into “data flow” style workflows
†The reference CWL runner runs on Microsoft Windows using Docker software containers
Why have a standard?
ResearchObject.org standard overview
Software Containers & CWL
Future version of the CWL standard will switch from “Docker” image format for software containers to the Open Container Initiative image format standard.
Example: grep & count
steps:
grep:
run: grep.cwl
in:
pattern: pattern
infile: infiles
scatter: infile
out: [outfile]
wc:
run: wc.cwl
in:
infiles: grep/outfile
out: [outfile]
class: Workflow
cwlVersion: v1.0
inputs:
pattern: string
infiles: File[]
outputs:
outfile:
type: File
outputSource: wc/outfile
requirements:
- class: ScatterFeatureRequirement
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Example: grep & count
class: Workflow
cwlVersion: v1.0
inputs:
pattern: string
infiles: File[]
outputs:
outfile:
type: File
outputSource: wc/outfile
requirements:
- class: ScatterFeatureRequirement
steps:
grep:
run: grep.cwl
in:
pattern: pattern
infile: infiles
scatter: infile
out: [outfile]
wc:
run: wc.cwl
in:
infiles: grep/outfile
out: [outfile]
Tool to run
Scatter over input array
Connect output of “grep” to input of “wc”
Connect output of “wc” to workflow output
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
HANDS ON #1: run tools, bonus
$ virtualenv cwl
Running virtualenv with interpreter /usr/bin/python2
New python executable in /home/hmenager/gccbosc2018-cwltutorial/CWL-Quick-Start/cwl/bin/python2
Also creating executable in /home/hmenager/gccbosc2018-cwltutorial/CWL-Quick-Start/cwl/bin/python
Installing setuptools, pkg_resources, pip, wheel...done.
$ . cwl/bin/activate
$ pip install cwltool
Collecting cwltool
Downloading https://files.pythonhosted.org/packages/92/a5/d9739eb51b3e2d55438a194ef7bd7f55ae8785a0219563006fdbab37b80a/cwltool-1.0.20180622214234-py2.py3-none-any.whl (642kB)
100% |████████████████████████████████| 645kB 2.9MB/s
Collecting mypy-extensions (from cwltool)
[...]
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
HANDS ON #1: run tools, bonus
$ cwltool CWL/Tools/pdftotext.cwl CWL/Tools/pdftotext.job.yaml
/home/hmenager/gccbosc2018-cwltutorial/CWL-Quick-Start/cwl/bin/cwltool 1.0.20180622214234
Resolved 'CWL/Tools/pdftotext.cwl' to 'file:///home/hmenager/gccbosc2018-cwltutorial/CWL-Quick-Start/CWL/Tools/pdftotext.cwl'
[job pdftotext.cwl] /tmp/tmp22cRNF$ docker \
run \
-i \
--volume=/tmp/tmp22cRNF:/var/spool/cwl:rw \
--volume=/tmp/tmp5pmG8f:/tmp:rw \
--volume=/home/hmenager/gccbosc2018-cwltutorial/CWL-Quick-Start/Documents/demo.pdf:/var/lib/cwl/stg3e807c8f-9764-4b22-b534-a4da46590b95/demo.pdf:ro \
--workdir=/var/spool/cwl \
--read-only=true \
--user=1000:1000 \
--rm \
--env=TMPDIR=/tmp \
--env=HOME=/var/spool/cwl \
mgrast/pdf2wordcloud:demo \
pdftotext \
/var/lib/cwl/stg3e807c8f-9764-4b22-b534-a4da46590b95/demo.pdf \
demo.txt
[job pdftotext.cwl] completed success
{
"extractedText": {
"checksum": "sha1$a30c1b995e02fa2e7fb2f4357eda9bcebd8d9253",
"basename": "demo.txt",
"location": "file:///home/hmenager/gccbosc2018-cwltutorial/CWL-Quick-Start/demo.txt",
"path": "/home/hmenager/gccbosc2018-cwltutorial/CWL-Quick-Start/demo.txt",
"class": "File",
"size": 19545
}
}
Final process status is success
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
HANDS ON #2: run workflows, bonus
$ cwltool CWL/Workflows/pdf2wordcloud.cwl CWL/Workflows/demo-1.job.yaml
/home/hmenager/gccbosc2018-cwltutorial/CWL-Quick-Start/cwl/bin/cwltool 1.0.20180622214234
Resolved 'CWL/Workflows/pdf2wordcloud.cwl' to 'file:///home/hmenager/gccbosc2018-cwltutorial/CWL-Quick-Start/CWL/Workflows/pdf2wordcloud.cwl'
[workflow ] start
[step pdf2text] start
[job pdf2text] /tmp/tmpNWRMks$ docker \
run \
-i \
--volume=/tmp/tmpNWRMks:/var/spool/cwl:rw \
--volume=/tmp/tmp8amxe_:/tmp:rw \
[...]
[job pdf2text] completed success
[step pdf2text] completed success
[step text2wordCloud] start
[job text2wordCloud] /tmp/tmp8cZpSg$ docker \
run \
-i \
[...]
extracted.txt.png \
--text \
/var/lib/cwl/stgee0f1453-fec7-4a8e-95cf-982a11d55562/extracted.txt
[job text2wordCloud] completed success
[step text2wordCloud] completed success
[workflow ] completed success
{
"words": {
"checksum": "sha1$3a72cfb5ce870a9275af86d5471e19cc95b43053",
"basename": "extracted.txt.png",
"location": "file:///home/hmenager/gccbosc2018-cwltutorial/CWL-Quick-Start/extracted.txt.png",
"path": "/home/hmenager/gccbosc2018-cwltutorial/CWL-Quick-Start/extracted.txt.png",
"class": "File",
"size": 59295
}
}
Final process status is success
Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Linked Data & CWL
Example: can use the EDAM ontology (ELIXIR-DK) to specify file formats and reason about them:� “FASTQ Sanger” encoding is a type of FASTQ file
CWL Design principles
Web: http://reana.io�Docs: http://reana.readthedocs.io Twitter: https://twitter.com/reanahub�GitHub: https://github.com/reanahub
The LOFAR pre-facet calibration pipeline
LOFAR pipelines currently written in ‘parsets’ language unique to that team
Gijs Molenaar packaged the software in the KERN suite (3rd party software packages for Ubuntu Linux LTS) and used those packages to create Docker/Singularity containers
Gijs (with some assistance from me) then converted the “parset” based pipeline to a Common Workflow Language version
The LOFAR pre-facet calibration pipeline
Searching for Pulsars with PRESTO (& CWL)