FAIR Computational workflows
International FAIR Convergence Symposium 2020
Sarah Cohen-Boulakia*, Daniel Garijo**, Carole Goble***
* Université Paris-Saclay, France�**Information Sciences Institute, University of Southern California�*** The University of Manchester, UK
Welcome !
Pipelines are designed to analyse scientific datasets.
Scientific workflow systems have been developed to assist users.
279 registrations!
Housekeeping
Cameras on for speakers and chairs only. Please turn off your camera unless presenting.
The presentations will be shared and the session is being recorded.
You are all muted by default: to ask a question use the chat AND raise your hand. The host can unmute you during the discussion.
Please use the chat for discussion mark your point as QUESTION if you are addressing it to the panellists.
If you are tweeting use the hashtag #FAIRconvergence
Cameras on for speakers and chairs only
Session 1 - Tools for FAIR workflows
Chair : Daniel Garijo
5 mins questions at the end - put your questions in the chat
The WorkflowHub.eu �FAIR workflow registry
Carole Goble, The University of Manchester / ELIXIR-UK / EOSC-Life
And the WorkflowHub Club
carole.goble@manchester.ac.uk
FAIR Computational Workflows Session, 30th November 2020, FAIR Convergence Symposium
https://workflowhub.eu
Beta Release September 2020
FIND and ACCESS Workflows
The workflow registered are INTEROPERABLE and REUSABLE
Workflows are FAIR and the Registry is FAIR too.
Workflows are FAIR objects
in their own right.
A Registry for Computational Workflows.
https://workflowhub.eu
A Registry for Computational Workflows.
Workflow Management System agnostic.
Workflows may remain in their native repositories in their native form.
Open to workflows from all disciplines and any country.
Sponsored by the European Life Science community.
The WorkflowHub Club open community.
Beta Release September 2020
9
FAIR and richly featured
Registry
Entry
Nextflow
Native form
Common Workflow Language
Workflows organized by:
Teams
Collections
Properties
Search & Browsing
Makers are custodians of their own workflows
Preserve personal attribution, affiliations and contribution credit
70
Workflows
29
COVID-19
85 people
registered
15
countries
32
organisations
26
teams
Submitters
Credited for registering
and curating entries
Contributors
Credited for developing
workflows
Scripts
Any WfMS
FAIR Machine Processable Metadata
Metadata about a workflow
Schema.org markup
Canonical description of the workflow
Links to containerized tools
Alongside native descriptions
Metadata for organizing & packaging
components of a workflow
An exchange format for workflows
Collection of files and file ids (urls)
Simple
upload
upload
access
upload
download
access
Testing & Monitoring Systems
Search and Launch from within WfMS
TRS API
Register (push) / Harvest (pull)
Other registries
Partnering with specific Workflow management Systems�for advanced integration and rich features
Your WfMS here
snapshots
versions
provenance
identifiers
citation
referencing
community standards
common
registration
metadata
canonical workflow descriptions & mark-up
licenses
analytics
API
supplementary materials
test, example data
documentation
Links to monitoring & testing services
packaging
LIVING CONTENT
16
Workflow registry
open for business!
https://workflowhub.eu
WorkflowHub.eu Open for Business!
We gratefully acknowledge the WorkflowHub Club, Bioschemas Group, RO-Crate community, CWL Community and our WfMS partners in Galaxy, Snakemake, Nextflow, CWL, SCIPION, NMRPipe
�
Packaging workflows with RO-Crate
Stian Soiland-Reyes�The University of Manchester, BioExcel, ELIXIR-UK
University of Amsterdam
This work is licensed under a Creative Commons Attribution 4.0 International License.
This work is funded by the European Union contracts �BioExcel CoE �H2020-INFRAEDI-02-2018-823830, H2020-EINFRA-2015-1-675728�EOSC-Life �H2020-INFRAEOSC-2018-2-824087
Annual reminder on FAIR principles
Interoperable
I1. (meta)data use a formal, accessible, shared, and� broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualified references to other (meta)data
Findable
F1. (meta)data are assigned a globally unique and persistent identifier
F2. data are described with rich metadata (defined by R1 below)
F3. metadata clearly and explicitly include the identifier of the � data it describes
F4. (meta)data are registered or indexed in a searchable resource
Reusable
R1. meta(data) are richly described with a plurality of� accurate and relevant attributes
R1.1. (meta)data are released with a clear and accessible � data usage license
R1.2. (meta)data are associated with detailed provenance
R1.3. (meta)data meet domain-relevant community standards
Accessible
A1. (meta)data are retrievable by their identifier using a� standardized communications protocol
A1.1 the protocol is open, free, and universally implementable
A1.2 the protocol allows for an authentication and authorization� procedure, where necessary
A2. metadata are accessible, even when the � data are no longer available
tl;dr: machine-readable metadata
Best practices for �workflow reproducibility
Methods
(..)
De novo assembly and binning
Raw reads from each run were first assembled with SPAdes v.3.10.020 with option --meta21. Thereafter, MetaBAT 215 (v.2.12.1) was used to bin the assemblies using a minimum contig length threshold of 2,000 bp (option --minContig 2000) and default parameters. Depth of coverage required for the binning was inferred by mapping the raw reads back to their assemblies with BWA-MEM v.0.7.1645 and then calculating the corresponding read depths of each individual contig with samtools v.1.546 (‘samtools view -Sbu’ followed by ‘samtools sort’) together with the jgi_summarize_bam_contig_depths function from MetaBAT 2. The QS of each metagenome-assembled genome (MAG) was estimated with CheckM v.1.0.722 using the lineage_wf workflow and calculated as: level of completeness − 5 × contamination. Ribosomal RNAs (rRNAs) were detected with the cmsearch function from INFERNAL v.1.1.247 (options -Z 1000 --hmmonly --cut_ga) using the Rfam48 covariance models of the bacterial 5S, 16S and 23S rRNAs. Total alignment length was inferred by the sum of all non-overlapping hits. Each gene was considered present if more than 80% of the expected sequence length was contained in the MAG. Transfer RNAs (tRNAs) were identified with tRNAscan-s.e. v.2.049 using the bacterial tRNA model (option -B) and default parameters. Classification into high- and medium-quality MAGs was based on the criteria defined by the minimum information about a metagenome-assembled genome (MIMAG) standards23 (high: >90% completeness and <5% contamination, presence of 5S, 16S and 23S rRNA genes, and at least 18 tRNAs; medium: ≥ 50% completeness and <10% contamination). (...)
(..)
Assignment of MAGs to reference databases
Four reference databases were used to classify the set of MAGs recovered from the human gut assemblies: HR, RefSeq, GenBank and a collection of MAGs from public datasets. HR comprised a total of 2,468 high-quality genomes (>90% completeness, <5% contamination) retrieved from both the HMP catalogue (https://www.hmpdacc.org/catalog/) and the HGG8. From the RefSeq database, we used all the complete bacterial genomes available (n = 8,778) as of January 2018. In the case of GenBank, a total of 153,359 bacterial and 4,053 eukaryotic genomes (3,456 fungal and 597 protozoan genomes) deposited as of August 2018 were considered. Lastly, we surveyed 18,227 MAGs from the largest datasets publicly available as of August 201813,16,17,18,19, including those deposited in the Integrated Microbial Genomes and Microbiomes (IMG/M) database52. For each database, the function ‘mash sketch’ from Mash v.2.053 was used to convert the reference genomes into a MinHash sketch with default k-mer and sketch sizes. Then, the Mash distance between each MAG and the set of references was calculated with ‘mash dist’ to find the best match (that is, the reference genome with the lowest Mash distance). Subsequently, each MAG and its closest relative were aligned with dnadiff v.1.3 from MUMmer 3.2354 to compare each pair of genomes with regard to the fraction of the MAG aligned (aligned query, AQ) and ANI.
(..)
Best practices for �workflow reproducibility
Semantic Web world vs Real World
Peter Sefton at Open Repositories 2019
Excessive FAIR considered dangerous for your health .
2018 reboot → RO-Crate
RO-Crate is a community effort to establish a lightweight approach to packaging research data with their metadata.
It is based on schema.org annotations in JSON-LD, and aims to make best-practice in formal metadata description accessible and practical for use in a wider variety of situations, from an individual researcher working with a folder of data, to large data-intensive computational research environments.
RO-Crate is the marriage of Research Objects with DataCrate. It aims to build on their respective strengths, but also to draw on lessons learned from those projects and similar research data packaging efforts. For more details, see RO-Crate background.
The RO-Crate specification details how to capture a set of files and resources as a dataset with associated metadata – including contextual entities like people, organizations, publishers, funding, licensing, provenance, workflows, geographical places, subjects and repositories.
A growing list of RO-Crate tools and libraries simplify creation and consumption of RO-Crates, including the graphical interface Describo.
The RO-Crate community help shape the specification or get help with using it!
2018 reboot: Building the RO-Crate Community
Monthly telcons (4th Thursday of month), everyone welcome!�https://s.apache.org/ro-crate-minutes
RO-Crate Specification
Best Practice Guidance, not strict specifications
Developer-friendly rather than semantic correctness
Focus on JSON, but gradual path to extensibility with Linked Data�(example: how to do ad-hoc terms)
Opinionated profile of schema.org
Example-driven documentation, not strict schemas
Workflows in RO-Crate
Separation of concern
Interoperability, Explainability
Software packaging
Tool registries
Distribution, Packaging
Storage, Repositories �(including the Web!)
Describing, Relating, Typing, Contextualizing
Interactivity, Scalability
Challenges:
Packaging zoo - how to choose?
Using “just enough” of the stack
Lossy interoperability between layers
Avoid everyone concluding
“I’ll just make my own JSON/API”
http://
Identifiers (incl URIs)
urn:uuid:1ca3b9dc-a97c-408d-ab1c-8431909e343a
manifest-sha512.txt
a0ae93…77fb data/ro-crate-metadata.json
e5fec4…500b data/ro-crate-preview.html
a2f562…f3fa data/workflow.cwl�481bb7…10b7 data/chipseq_20200910.json
What is “the workflow”?
Same conceptual workflow;�multiple executable flavours for different workflow engines and specific use-cases (e.g. COVID-19)
Tooling!
How can I use it?
While we’re mostly focusing on the specification, some tools already exist for working with RO-Crates:
These applications use or expose RO-Crates:
Software Libraries
from rocrate import rocrate_api
# Workflow and extra file paths
wf_path = "test/test-data/Genomics-4-PE_Variation.ga"
extra_files = ["test/test-data/extra_file.txt"]
# Create base package
wf_crate = rocrate_api.make_workflow_rocrate(workflow_path=wf_path,
wf_type="Galaxy",
include_files=extra_files)
# Add authors info
author_metadata = {'name': 'Jaco Pastorius'}
jaco = wf_crate.add_person('#jaco', author_metadata)
wf_crate.creator = jaco
# Write to zip file
out_path = "/home/test_user/wf_crate.zip"
wf_crate.write_zip(out_path)
FAIR is not just machine-readable!
/
ro-crate-metadata.json
ro-crate-preview.html
nextflow.log
results/
Timeline
History:�2018-10 RO-Lite conceived � IEEE eScience 2018
2019-02 RO-Lite 0.1 drafted
2019-11 RO-Crate 1.0 released
2020-10 RO-Crate 1.1 released
Next:
Workflow Run Provenance
Workflow Run job & test data
Containers and clouds
RO-Crate in Zenodo?
More tooling!
More in session 260: �FAIR Data Provenance
Wednesday 15:00-17:00 UTC
FAIR Principles 4 Workflows with CWL and friends
Michael R. Crusoe�Common Workflow Language Project Lead
Annual reminder on FAIR principles
Interoperable
I1. (meta)data use a formal, accessible, shared, and� broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualified references to other (meta)data
Findable
F1. (meta)data are assigned a globally unique and persistent identifier
F2. data are described with rich metadata (defined by R1 below)
F3. metadata clearly and explicitly include the identifier of the � data it describes
F4. (meta)data are registered or indexed in a searchable resource
Reusable
R1. meta(data) are richly described with a plurality of� accurate and relevant attributes
R1.1. (meta)data are released with a clear and accessible � data usage license
R1.2. (meta)data are associated with detailed provenance
R1.3. (meta)data meet domain-relevant community standards
Accessible
A1. (meta)data are retrievable by their identifier using a� standardized communications protocol
A1.1 the protocol is open, free, and universally implementable
A1.2 the protocol allows for an authentication and authorization� procedure, where necessary
A2. metadata are accessible, even when the � data are no longer available
Who will issue identifiers for workflows?
Should this be a Workflow-specific service?
Yes, Zenodo will give us a DOI for free, but I can’t query a Zenodo DOI for workflow metadata.
Nor can I use a Zenodo DOI to immediately get a URL to download the workflow.�– However if the workflow is stored inside a Workflow RO Crate, then the DOI becomes actionable
What API should we use to query workflow registries?
The Global Alliance 4 Genomics and Health (GA4GH) have specified TRS API, implemented by dockstore.org and workflowhub.eu.
Are the needs and assumptions of a human health association the correct ones for all scientific/research workflow users?
GA4GH is working through the complexities of Authorization and Authentication (FAIR Principle A1.2)
Requests for Workflow Environment Developers
Please
FAIR Principles & Workflows with CWL and friends
CWL + WorkflowHub.eu + CWLProv + GA4GH TRS + schema.org can be used to support all 17 of the FAIR principles — BUT fully realizing them is not automatic!
What about when WorkflowHub.eu content and API evolves? (we should archive the RO-Crates)
What if workflow authors don’t self-annotate sufficiently?
What if schema.org moves the canonical location of their RDF again?
Let’s work together to answer these questions!
OpenML
Sharing and reproducing
machine learning experiments
Organizing and automating machine learning
Joaquin Vanschoren and the OpenML team
Machine Learning: art or science?
data engineers
Process with many actors and tools (model lifecycle)
raw data
predict
ML engineers
Deployment
train
preprocess
What if…
we could organize the world’s machine learning information
and make it universally accessible and useful?
Machine Learning (FAIR) objects
Datasets
(+ meta-data)
Flows
(algorithm meta-data)
Runs
(model meta-data)
Tasks
(problem meta-data)
how to evaluate
dependencies,
detailed structure
(pipelines, neural nets)
configurations,
evaluations
file, url, version
Come with a lot of meta-data
Share and rediscover all used dataset, flows, and runs
For every model, get the exact dataset and algorithms used
How do I reproduce this result?
Reproducibility
metadata
metadata
metadata
How do I collect all this metadata?
System of execution
Manual annotations. Easy, but heterogeneous
System of record
meta-database
experiments, projects, models,…
visualizations, search
log_param(‘x’,1)
log_data_version(‘x’,1)
log_metric(‘x’,1)
Auto-annotation. Requires APIs, tool integration
meta-database
object store
experiments, projects,
visualizations, search
OpenML API
tool integrations
auto-log on demand
Also indexes all datasets, flows, runs
OpenML: A global machine learning lab
OpenML
Notebooks
Local apps
Cloud jobs
REST API
APIs in Python, R, Java,...
Web UI
data.publish()
pipeline.publish()
run.publish()
import via API
openml.get_data(1)
openml.get_flow(1)
openml.get_run(1)
All (meta)data is collected and organized automatically
Frictionless machine learning
Share from where you create
Import into your favorite working environment (in uniform formats)
Run wherever you want
data.publish()
OpenML
get_dataset(id)
run.publish()
Web UI (openml.org) beta: new.openml.org
datasets
flows (pipelines)
runs (performance)
accuracy
Tool integrations
from sklearn import ensemble
from openml import tasks, runs
clf = ensemble.RandomForestClassifier()
task = tasks.get_task(3954)
run = runs.run_model_on_task(clf, task)
run.publish()
More examples on https://docs.openml.org/Python-examples/
Tool integrations
import torch.nn
from openml import tasks, runs
model = torch.nn.Sequential(
processing_net, features_net, results_net)
task = tasks.get_task(3954)
run = runs.run_model_on_task(clf, task)
run.publish()
Full example on https://openml.github.io/blog/
Reproduce flows (automagically)
pipeline = sklearn.make_pipeline(...)
flow = sklearn.model_to_flow(pipeline)
run = openml.run_model_on_task(pipeline, task)
id = run.publish()
run = openml.get_run(id)
pipeline = openml.get_flow(run.flow_id, reinstantiate: True)
150000+ yearly users
8000+ registered contributors
500+ publications
20000+ datasets
8000+ flows
10.000.000+ runs
OpenML Community
Thanks to the entire OpenML star team
Jan van Rijn
Matthias Feurer
Heidi Seibold
Bernd Bischl
Andreas Müller
Erin Ledell
Guiseppe Casalicchio
Michel Lang
Pieter Gijsbers
Sahithya Ravi
Bilge Celik
Prabhant Singh
Janek Thomas
and many more!
Marcel Wever
Neil Lawrence
Markus Weimer
FAIR Workflow Traces for Scientific Workflow Research and Development
In collaboration with: Tainã Coleman, Loïc Pottier, Henri Casanova, and Ewa Deelman
WorkflowHub is a community framework that provides a collection of tools for analyzing workflow execution traces, producing realistic synthetic workflow traces, and simulating workflow executions
Concept
Traces
Collection of open access workflow traces from a production workflow system
This collection of workflow traces form an initial set of small- and large-scale workflow configurations
Python Package
open source Python package to analyze traces and generate representative synthetic traces in that same format
analyses can be performed to produce statistical summaries of workflow performance characteristics
WorkflowHub’s Python package attempts to fit data with 23 probability distributions provided as part of SciPy’s statistics submodule
Trace Generator
The WorkflowHub package provides a number of workflow recipes for generating realistic synthetic workflow traces
Accuracy
Questions (5 min)
Session 2 - Feedbacks from workflow users & developers
Jupyter and Galaxy
15:35 - 16:45 Feedbacks from workflow users & developers - Pitch talks of 5 min
Chair : Sarah Cohen-Boulakia
Jupyter Notebooks
Galaxy
5 mins questions at the end - put your questions in the chat
From Notebooks to FAIR Workflows
Daniel Garijo
Information Sciences Institute and�Department of Computer Science
http://w3id.org/people/dgarijo
@dgarijov�dgarijo@isi.edu
Computational Notebooks
Daniel Garijo
65
Figure source: https://towardsdatascience.com/the-complete-guide-to-jupyter-notebooks-for-data-science-8ff3591f69a4
The good:
The ugly:
See “I don’t like notebooks”, by Joel Grus
See “I like notebooks”, by Jeremy Howard
Notebooks and FAIR support
Daniel Garijo
66
Principle | Support (Notebook) | Notebook Metadata |
FINDABE | N/A | No standard metadata |
ACCESSIBLE | Markdown, JSON | N/A |
INTEROPERABLE | Python code is usually compatible between notebooks (assuming the same dependencies are available) | N/A |
REUSABLE | Notebooks are usually explained combining visualization and markdown narrative. Provenance traces are often included. | N/A |
When dependencies are available
From Notebooks to FAIR workflows
Daniel Garijo
67
Component
i1
i2
o1
Caveat: Notebook is converted into a black box. Missing dataflow
Why?
How?
From Notebooks to FAIR workflows
Daniel Garijo
68
Ideally: Capture the dataflow in a notebook and modularize cells
NiW: Converting Notebooks into Workflows to Capture Dataflow and Provenance . Carvalho, L. A. M. C.; Wang, R.; Gil, Y.; and Garijo, D. In Proceedings of the Workshop on Capturing Scientific Knowledge (SciKnow), held in conjunction with the ACM International Conference on Knowledge Capture (K-CAP), Austin, Texas, 2017.
Bridging the gap: From notebooks to FAIR workflows
How to ensure we can transform notebooks into FAIR workflows?
Daniel Garijo
69
[1] ProvBook: Provenance-based Semantic Enrichment of Interactive Notebooks for Reproducibility. Sheeba Samuel and Birgitta König-Ries. 17th International Semantic Web Conference (ISWC) Demo Track 2018
Complex synchronous and asynchronous workflows
from Metagenomics and COVID-19 research to Drug-Design and Climate science.
Björn Grüning - ELIXIR Germany
www.elixir-europe.org
A Data Analysis Gateway for Everyone!
Web-based User Interface (+ API access)
Data Management
From Tools to (async) Workflows
Combine 2.5000 Apps to powerful workflows
Microdata can be attached to tools and workflows schema.org compliant.
Synchronous Workflows
covid19.galaxyproject.org
Genomics
419
23K+
Exporting provenance data from Galaxy
Experiences from trying to …
Ignacio Eguinoa - ELIXIR Belgium
www.elixir-europe.org
Galaxy as a workflow execution platform
Save and export the details after a workflow execution
How to capture this provenance data and metadata?
Relevant aspects
Galaxy execution model: What Galaxy does when you run a tool
Separation between the GUI and the command line executable
Galaxy containers
and its applications
Bert Droesbeke
Galaxy - training - containers
docker run -p 8080:80 bgruening/galaxy-stable
training.galaxyproject.org/training-material
github.com/bgruening/docker-galaxy-stable
The concept
github.com/ELIXIR-Belgium/BioContainers_for_training
Applications
training.galaxyproject.org/training-material
covid19.galaxyproject.org
github.com/ELIXIR-Belgium/ena-upload-container
doi.org/10.1371/journal.pcbi.1005616
ENA-upload container
docker run -p "8080:80" --privileged quay.io/galaxy/ena-upload
rdm.elixir-belgium.org/covid-19
We’ve got you covered
Questions (5 min)
Session 2 -
Feedbacks from workflow users & developers
NextFlow and SnakeMake
Chair : Sarah Cohen-Boulakia
Nextflow
Snakemake
5 mins questions at the end - put your questions in the chat
Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal
Cyril Pommier & Célia Michotey
INRAE, ELIXIR-FR
p. 95
Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal
November 30 2020 / Cyril Pommier & Célia Michotey
FAIDARE: Plant data discovery web portal
IBET
INRA
VIB
EBI
Data Harvester
Swagger
MCPD
p. 96
Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal
November 30 2020 / Cyril Pommier & Célia Michotey
FAIDARE: Plant data discovery web portal
IBET
INRA
VIB
EBI
Data Harvester
Swagger
MCPD
Extract
Transform
Load
p. 97
Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal
November 30 2020 / Cyril Pommier & Célia Michotey
Data harvester
p. 98
Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal
November 30 2020 / Cyril Pommier & Célia Michotey
Feedback : pros
p. 99
Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal
November 30 2020 / Cyril Pommier & Célia Michotey
Feedback : cons
p. 100
Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal
November 30 2020 / Cyril Pommier & Célia Michotey
Perspectives
p. 101
Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal
November 30 2020 / Cyril Pommier & Célia Michotey
FAIR phylogenetic workflows with Nextflow?
International FAIR convergence Symposium 2020
FAIR Workflows Session
Frédéric Lemoine
Evolutionary Bioinformatics
11/30/2020
Phylogenetics
What is Phylogenetics?
103 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020
Used in many fields:
Phylogenetic workflow skeleton
104 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020
Sequences
Alignment
Tree
Genbank
GISAID
Uniprot
orthoDB
TreeBase
…
…
EASY?
iTOL
Usual characteristics
-Many tools (PhyML, Mafft, Blast, etc.), with tons of parameters (models parameters, etc.);
-Many formats (Phylip, Fasta, Newick, PhyloXML, Nexus, etc.);
-Lots of data manipulations (Reformatting, transforming, filtering, etc.), sometimes manual;
-Require lots of “dedicated” scripts (python, bash, awk, perl, etc.): methods rarely fully described
FAIR?
-Data (Input & Output data): Sometimes
-Workflows: Rarely
Architecture
Workflow:
Data:
105 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020
https://github.com/evolbioinfo/
https://hub.docker.com/orgs/evolbioinfo
~100 tools
~200 images
Phylogenetic workflows examples / a first step
BOOSTER* (Nextflow):
106 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020
* https://booster.pasteur.fr, Nature 2018
Phylogenetic workflows examples
Covid-Align* (Nextflow): Accurate multiple alignment of SARS-CoV2 Sequences
107 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020
* https://covalign.pasteur.cloud, Bioinformatics 2020
What is missing?
Developed Workflows are somewhat on their way to be FAIR, but still challenges:
108 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020
@
Institut Pasteur
25-28 rue du docteur Roux
75724 Paris Cedex 15
https://research.pasteur.fr/fr/team/evolutionary-bioinformatics/
Thank you!
Evolutionary Bioinformatics Unit
Towards more findable and reusable produced data
Knowledge Graphs, workflow provenance, and tools registries�
�Alban Gaignard1, Hala Skaf-Moli2, Khalid Belhajjame3��1 institut du thorax, CNRS
2 LS2N, Nantes University
3 LAMSADE, Paris-Dauphine University, PSL�
�International FAIR Convergence Symposium�(FAIR Workflows Session)
Life scientists often say “it’s easier to reprocess” ... � ... how to better reuse data ?
Workflows to the rescue
SnakeMake workflow engine
What about SnakeMake + provenance ?
Genericity: can be an advantage but strong limitations when considering community terms such as “gene expression level” or “reference genome”
Fine-grained: large graphs reporting each tool execution and consumed/produced data
PROV-O W3C standard
Automate FAIRification of Life Science data ?
Methods and tools : graph pattern matching, inference rules, SPARQL, Python, Jupyter
http://www.semantic-web-journal.net/system/files/swj2257.pdf (10.3233/SW-200374)
“Which was the reference genome used to produce this VCF file ?” ��“Which subset of the data should be re-investigated ?”
A Gaignard, H Skaf-Molli, K Belhajjame Findable and reusable workflow data products: A genomic workflow case study. Semantic Web Journal. https://doi.org/10.3233/SW-200374
Semantic tools registries
Bio.Tools + EDAM ontology
Example :
Machines and humans -oriented summaries
Conclusion and perspectives
Acknowledgments
Audrey Bihouée, �Institut du Thorax, �BiRD Bioinformatics facility, University of Nantes
Hala Skaf-Molli, �University of Nantes, LS2N
Khalid Belhajjame, LAMSADE, University of Paris-Dauphine, PSL
Questions ?
alban.gaignard@univ-nantes.fr
Questions (5 min)
Wrap up
Thanks to all the speakers!
Thanks to all the participants!
Slides (https://tinyurl.com/FAIRWFSlides) and Recording will be made available.