3 of 119

Welcome !

Pipelines are designed to analyse scientific datasets.

Scientific workflow systems have been developed to assist users.

Workflows should be FAIR in their own right, have an important role to play in making data FAIR and need access to FAIR data
What are the tools offered today to help produce FAIR data and develop FAIR workflows?
What is the feedback of scientific workflow users? Current challenges?
Pitch talks then Wooclap Questions

279 registrations!

4 of 119

Housekeeping

Cameras on for speakers and chairs only. Please turn off your camera unless presenting.

The presentations will be shared and the session is being recorded.

You are all muted by default: to ask a question use the chat AND raise your hand. The host can unmute you during the discussion.

Please use the chat for discussion mark your point as QUESTION if you are addressing it to the panellists.

If you are tweeting use the hashtag #FAIRconvergence

Cameras on for speakers and chairs only

5 of 119

Session 1 - Tools for FAIR workflows

Chair : Daniel Garijo

15.05 Carole Goble (ELIXIR-UK; EOSC-Life; The University of Manchester): �The WorkflowHub.eu FAIR workflow registry
15.10 Stian Soiland-Reyes (BioExcel; The University of Manchester; UvA): �RO-Crate for FAIR workflow packaging
15.15 Michael R. Crusoe (CWL; VU Amsterdam, ELIXIR-NL): �The CWL Standards support the FAIR Principles
15.20 Joaquin Vanschoren (OpenML community): �Sharing machine learning workflows with OpenML
15.25 Rafael Ferreira da Silva (ISI, USC): �FAIR Workflow Traces for Scientific Workflow Research and Development (workflowhub.org)

5 mins questions at the end - put your questions in the chat

6 of 119

The WorkflowHub.eu �FAIR workflow registry

Carole Goble, The University of Manchester / ELIXIR-UK / EOSC-Life

And the WorkflowHub Club

carole.goble@manchester.ac.uk

FAIR Computational Workflows Session, 30^th November 2020, FAIR Convergence Symposium

7 of 119

https://workflowhub.eu

Beta Release September 2020

FIND and ACCESS Workflows

The workflow registered are INTEROPERABLE and REUSABLE

Workflows are FAIR and the Registry is FAIR too.

Workflows are FAIR objects

in their own right.

A Registry for Computational Workflows.

8 of 119

https://workflowhub.eu

A Registry for Computational Workflows.

Workflow Management System agnostic.

Workflows may remain in their native repositories in their native form.

Open to workflows from all disciplines and any country.

Sponsored by the European Life Science community.

The WorkflowHub Club open community.

Beta Release September 2020

9 of 119

FAIR and richly featured

Registry

Entry

Nextflow

Native form

Common Workflow Language

10 of 119

Workflows organized by:

Teams

Collections

Properties

Tags
Type
Status
Dates….etc

Search & Browsing

Makers are custodians of their own workflows

Preserve personal attribution, affiliations and contribution credit

11 of 119

Workflows

COVID-19

85 people

registered

countries

organisations

teams

Submitters

Credited for registering

and curating entries

Contributors

Credited for developing

workflows

Scripts

Any WfMS

12 of 119

FAIR Machine Processable Metadata

Metadata about a workflow

Schema.org markup

Canonical description of the workflow

Links to containerized tools

Alongside native descriptions

Metadata for organizing & packaging

components of a workflow

An exchange format for workflows

13 of 119

Collection of files and file ids (urls)

Simple

upload

access

upload

download

access

Testing & Monitoring Systems

Search and Launch from within WfMS

TRS API

Other registries

14 of 119

Partnering with specific Workflow management Systems�for advanced integration and rich features

Your WfMS here

15 of 119

snapshots

versions

provenance

identifiers

citation

referencing

community standards

common

registration

metadata

canonical workflow descriptions & mark-up

licenses

analytics

API

supplementary materials

test, example data

documentation

Links to monitoring & testing services

packaging

LIVING CONTENT

16 of 119

Workflow registry

open for business!

https://workflowhub.eu

17 of 119

WorkflowHub.eu Open for Business!

We gratefully acknowledge the WorkflowHub Club, Bioschemas Group, RO-Crate community, CWL Community and our WfMS partners in Galaxy, Snakemake, Nextflow, CWL, SCIPION, NMRPipe

WorkflowHub Club �https://about.workflowhub.eu
Bioschemas Workflow group�https://bioschemas.org/groups/Workflow/
Research Object RO-Crate�https://www.researchobject.org/
Common Workflow Language�https://www.commonwl.org

�

18 of 119

@soilandreyes

https://orcid.org/0000-0001-9842-9718

Packaging workflows with RO-Crate

Stian Soiland-Reyes�The University of Manchester, BioExcel, ELIXIR-UK

University of Amsterdam

This work is licensed under a Creative Commons Attribution 4.0 International License.

This work is funded by the European Union contracts �BioExcel CoE �H2020-INFRAEDI-02-2018-823830, H2020-EINFRA-2015-1-675728�EOSC-Life �H2020-INFRAEOSC-2018-2-824087

19 of 119

Annual reminder on FAIR principles

Interoperable

I1. (meta)data use a formal, accessible, shared, and� broadly applicable language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles

I3. (meta)data include qualified references to other (meta)data

Findable

F1. (meta)data are assigned a globally unique and persistent identifier

F2. data are described with rich metadata (defined by R1 below)

F3. metadata clearly and explicitly include the identifier of the � data it describes

F4. (meta)data are registered or indexed in a searchable resource

Reusable

R1. meta(data) are richly described with a plurality of� accurate and relevant attributes

R1.1. (meta)data are released with a clear and accessible � data usage license

R1.2. (meta)data are associated with detailed provenance

R1.3. (meta)data meet domain-relevant community standards

Accessible

A1. (meta)data are retrievable by their identifier using a� standardized communications protocol

A1.1 the protocol is open, free, and universally implementable

A1.2 the protocol allows for an authentication and authorization� procedure, where necessary

A2. metadata are accessible, even when the � data are no longer available

https://doi.org/10.1038/sdata.2016.18

tl;dr: machine-readable metadata

20 of 119

Best practices for �workflow reproducibility

Methods

(..)

De novo assembly and binning

Raw reads from each run were first assembled with SPAdes v.3.10.0² ⁰ with option --meta²¹. Thereafter, MetaBAT 2¹⁵ (v.2.12.1) was used to bin the assemblies using a minimum contig length threshold of 2,000 bp (option --minContig 2000) and default parameters. Depth of coverage required for the binning was inferred by mapping the raw reads back to their assemblies with BWA-MEM v.0.7.16⁴⁵ and then calculating the corresponding read depths of each individual contig with samtools v.1.5⁴⁶ (‘samtools view -Sbu’ followed by ‘samtools sort’) together with the jgi_summarize_bam_contig_depths function from MetaBAT 2. The QS of each metagenome-assembled genome (MAG) was estimated with CheckM v.1.0.7²² using the lineage_wf workflow and calculated as: level of completeness − 5 × contamination. Ribosomal RNAs (rRNAs) were detected with the cmsearch function from INFERNAL v.1.1.2⁴⁷ (options -Z 1000 --hmmonly --cut_ga) using the Rfam⁴⁸ covariance models of the bacterial 5S, 16S and 23S rRNAs. Total alignment length was inferred by the sum of all non-overlapping hits. Each gene was considered present if more than 80% of the expected sequence length was contained in the MAG. Transfer RNAs (tRNAs) were identified with tRNAscan-s.e. v.2.0⁴⁹ using the bacterial tRNA model (option -B) and default parameters. Classification into high- and medium-quality MAGs was based on the criteria defined by the minimum information about a metagenome-assembled genome (MIMAG) standards²³ (high: >90% completeness and <5% contamination, presence of 5S, 16S and 23S rRNA genes, and at least 18 tRNAs; medium: ≥ 50% completeness and <10% contamination). (...)

(..)

Assignment of MAGs to reference databases

Four reference databases were used to classify the set of MAGs recovered from the human gut assemblies: HR, RefSeq, GenBank and a collection of MAGs from public datasets. HR comprised a total of 2,468 high-quality genomes (>90% completeness, <5% contamination) retrieved from both the HMP catalogue (https://www.hmpdacc.org/catalog/) and the HGG⁸. From the RefSeq database, we used all the complete bacterial genomes available (n = 8,778) as of January 2018. In the case of GenBank, a total of 153,359 bacterial and 4,053 eukaryotic genomes (3,456 fungal and 597 protozoan genomes) deposited as of August 2018 were considered. Lastly, we surveyed 18,227 MAGs from the largest datasets publicly available as of August 2018¹³^,¹⁶^,¹⁷^,¹⁸^,¹⁹, including those deposited in the Integrated Microbial Genomes and Microbiomes (IMG/M) database⁵². For each database, the function ‘mash sketch’ from Mash v.2.0⁵³ was used to convert the reference genomes into a MinHash sketch with default k-mer and sketch sizes. Then, the Mash distance between each MAG and the set of references was calculated with ‘mash dist’ to find the best match (that is, the reference genome with the lowest Mash distance). Subsequently, each MAG and its closest relative were aligned with dnadiff v.1.3 from MUMmer 3.23⁵⁴ to compare each pair of genomes with regard to the fraction of the MAG aligned (aligned query, AQ) and ANI.

(..)

https://doi.org/10.1038/s41586-019-0965-1

21 of 119

Best practices for �workflow reproducibility

https://doi.org/10.1038/s41586-019-0965-1

22 of 119

https://www.researchobject.org/

23 of 119

Semantic Web world vs Real World

Peter Sefton at Open Repositories 2019

https://www.researchobject.org/specs/

Excessive FAIR considered dangerous for your health .

24 of 119

2018 reboot → RO-Crate

RO-Crate is a community effort to establish a lightweight approach to packaging research data with their metadata.

It is based on schema.org annotations in JSON-LD, and aims to make best-practice in formal metadata description accessible and practical for use in a wider variety of situations, from an individual researcher working with a folder of data, to large data-intensive computational research environments.

RO-Crate is the marriage of Research Objects with DataCrate. It aims to build on their respective strengths, but also to draw on lessons learned from those projects and similar research data packaging efforts. For more details, see RO-Crate background.

The RO-Crate specification details how to capture a set of files and resources as a dataset with associated metadata – including contextual entities like people, organizations, publishers, funding, licensing, provenance, workflows, geographical places, subjects and repositories.

A growing list of RO-Crate tools and libraries simplify creation and consumption of RO-Crates, including the graphical interface Describo.

The RO-Crate community help shape the specification or get help with using it!

https://www.researchobject.org/ro-crate/

25 of 119

2018 reboot: Building the RO-Crate Community

https://www.researchobject.org/ro-crate/community.html

https://github.com/researchobject/ro-crate/issues/1

Monthly telcons (4th Thursday of month), everyone welcome!�https://s.apache.org/ro-crate-minutes

26 of 119

RO-Crate Specification

https://w3id.org/ro/crate/1.1

27 of 119

Best Practice Guidance, not strict specifications

Developer-friendly rather than semantic correctness

Focus on JSON, but gradual path to extensibility with Linked Data�(example: how to do ad-hoc terms)

Opinionated profile of schema.org

Example-driven documentation, not strict schemas

28 of 119

Workflows in RO-Crate

https://bioschemas.org/profiles/ComputationalWorkflow/

https://www.researchobject.org/ro-crate/1.1/workflows.html

Bioschemas aims to improve the Findability of data in the life sciences. It does this by encouraging people in the life sciences to use Schema.org markup in their websites so that they are indexable by search engines and other services. Bioschemas encourages the consistent use of markup to ease the consumption of the contained markup across many sites. This structured information then makes it easier to discover, collate, and analyse distributed data.

Bioschemas is making two main contributions:

Proposing new types and properties to Schema.org to allow for the description of life science resources.
Profiles over the Schema.org types that identify the essential properties to use in describing a resource.

Bioschemas started as a community effort in November 2015. It operates as an open community initiative with representatives from a wide variety of institutions. You are welcome to join the community.

29 of 119

Separation of concern

Interoperability, Explainability

Software packaging

Tool registries

Distribution, Packaging

Storage, Repositories �(including the Web!)

Describing, Relating, Typing, Contextualizing

Interactivity, Scalability

Challenges:

Packaging zoo - how to choose?

Using “just enough” of the stack

Lossy interoperability between layers

Avoid everyone concluding

“I’ll just make my own JSON/API”

RFC8493

http://

RFC3986

Identifiers (incl URIs)

urn:uuid:1ca3b9dc-a97c-408d-ab1c-8431909e343a

RFC4122

{ “@id”: “https://doi.org/10.4225/59/59672c09f4a4b”,

“@type”: “Dataset”,� “hasPart”: [ … ]

}

manifest-sha512.txt

a0ae93…77fb data/ro-crate-metadata.json

e5fec4…500b data/ro-crate-preview.html

a2f562…f3fa data/workflow.cwl�481bb7…10b7 data/chipseq_20200910.json

30 of 119

What is “the workflow”?

http://mmb.irbbarcelona.org/biobb/workflows

Same conceptual workflow;�multiple executable flavours for different workflow engines and specific use-cases (e.g. COVID-19)

https://github.com/bioexcel/pmxlaunchCV19/

31 of 119

Tooling!

How can I use it?

While we’re mostly focusing on the specification, some tools already exist for working with RO-Crates:

Describo interactive desktop application to create, update and export RO-Crates for different profiles. (~ beta)
CalcyteJS is a command-line tool to help create RO-Crates and HTML-readable rendering (~ beta)
ro-crate - JavaScript/NodeJS library for RO-Crate rendering as HTML. (~ beta)
ro-crate-js - utility to render HTML from RO-Crate (~ alpha)
ro-crate-ruby Ruby library to consume/produce RO-Crates (~ alpha)
ro-crate-py Python library to consume/produce RO-Crates (~ planning)

These applications use or expose RO-Crates:

Workflow Hub imports and exports Workflow RO-Crates
OCFL-indexer NodeJS application that walks the Oxford Common File Layout on the file system, validate RO-Crate Metadata Files and parse into objects registered in Elasticsearch. (~ alpha)
ONI indexer
ocfl-tools
ocfl-viewer
Research Object Composer is a REST API for gradually building and depositing Research Objects according to a pre-defined profile. (RO-Crate support alpha)

https://uts-eresearch.github.io/describo /

https://www.researchobject.org/ro-crate/implementations.html

32 of 119

Software Libraries

from rocrate import rocrate_api

# Workflow and extra file paths

wf_path = "test/test-data/Genomics-4-PE_Variation.ga"

extra_files = ["test/test-data/extra_file.txt"]

# Create base package

wf_crate = rocrate_api.make_workflow_rocrate(workflow_path=wf_path,

wf_type="Galaxy",

include_files=extra_files)

# Add authors info

author_metadata = {'name': 'Jaco Pastorius'}

jaco = wf_crate.add_person('#jaco', author_metadata)

wf_crate.creator = jaco

# Write to zip file

out_path = "/home/test_user/wf_crate.zip"

wf_crate.write_zip(out_path)

https://pypi.org/project/rocrate/

https://www.npmjs.com/package/ro-crate

https://github.com/ResearchObject/ro-crate-ruby

33 of 119

FAIR is not just machine-readable!

https://biocompute-objects.github.io/bco-ro-crate/

ro-crate-metadata.json

ro-crate-preview.html

nextflow.log

results/

https://github.com/UTS-eResearch/ro-crate-html-js

34 of 119

Timeline

History:�2018-10 RO-Lite conceived � IEEE eScience 2018

2019-02 RO-Lite 0.1 drafted

2019-11 RO-Crate 1.0 released

2020-10 RO-Crate 1.1 released

Workflow Run Provenance

Workflow Run job & test data

Containers and clouds

RO-Crate in Zenodo?

More tooling!

Next RO-Crate community call: 2020-01-07

https://www.researchobject.org/ro-crate/

More in session 260: �FAIR Data Provenance

Wednesday 15:00-17:00 UTC

35 of 119

FAIR Principles 4 Workflows with CWL and friends

Michael R. Crusoe�Common Workflow Language Project Lead

36 of 119

Annual reminder on FAIR principles

Interoperable

I1. (meta)data use a formal, accessible, shared, and� broadly applicable language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles

I3. (meta)data include qualified references to other (meta)data

Findable

F1. (meta)data are assigned a globally unique and persistent identifier

F2. data are described with rich metadata (defined by R1 below)

F3. metadata clearly and explicitly include the identifier of the � data it describes

F4. (meta)data are registered or indexed in a searchable resource

Reusable

R1. meta(data) are richly described with a plurality of� accurate and relevant attributes

R1.1. (meta)data are released with a clear and accessible � data usage license

R1.2. (meta)data are associated with detailed provenance

R1.3. (meta)data meet domain-relevant community standards

Accessible

A1. (meta)data are retrievable by their identifier using a� standardized communications protocol

A1.1 the protocol is open, free, and universally implementable

A1.2 the protocol allows for an authentication and authorization� procedure, where necessary

A2. metadata are accessible, even when the � data are no longer available

https://doi.org/10.1038/sdata.2016.18

37 of 119

Who will issue identifiers for workflows?

Should this be a Workflow-specific service?

Yes, Zenodo will give us a DOI for free, but I can’t query a Zenodo DOI for workflow metadata.

Nor can I use a Zenodo DOI to immediately get a URL to download the workflow.�– However if the workflow is stored inside a Workflow RO Crate, then the DOI becomes actionable

38 of 119

What API should we use to query workflow registries?

The Global Alliance 4 Genomics and Health (GA4GH) have specified TRS API, implemented by dockstore.org and workflowhub.eu.

Are the needs and assumptions of a human health association the correct ones for all scientific/research workflow users?

GA4GH is working through the complexities of Authorization and Authentication (FAIR Principle A1.2)

39 of 119

Requests for Workflow Environment Developers

Please

Make it easy and attractive for your users to add metadata
Prompt your users to apply a software license, preferably a major and widely recognized open source license
Integrate with WorkflowHub.eu and domain specific registries for discovery and publishing
Collect provenance information (if you don’t already) and export it automatically in CWLProv format

If you don’t like something about CWLProv, please work with us to improve it!

40 of 119

FAIR Principles & Workflows with CWL and friends

CWL + WorkflowHub.eu + CWLProv + GA4GH TRS + schema.org can be used to support all 17 of the FAIR principles — BUT fully realizing them is not automatic!

What about when WorkflowHub.eu content and API evolves? (we should archive the RO-Crates)

What if workflow authors don’t self-annotate sufficiently?

What if schema.org moves the canonical location of their RDF again?

Let’s work together to answer these questions!

41 of 119

OpenML

Sharing and reproducing

machine learning experiments

Organizing and automating machine learning

Joaquin Vanschoren and the OpenML team

42 of 119

Machine Learning: art or science?

data engineers

Process with many actors and tools (model lifecycle)

raw data

predict

ML engineers

Deployment

train

preprocess

43 of 119

What if…

we could organize the world’s machine learning information

and make it universally accessible and useful?

44 of 119

Machine Learning (FAIR) objects

Datasets

(+ meta-data)

Flows

(algorithm meta-data)

Runs

(model meta-data)

Tasks

(problem meta-data)

how to evaluate

dependencies,

detailed structure

(pipelines, neural nets)

configurations,

evaluations

file, url, version

Come with a lot of meta-data

45 of 119

Share and rediscover all used dataset, flows, and runs

For every model, get the exact dataset and algorithms used

How do I reproduce this result?

Reproducibility

metadata

How do I collect all this metadata?

46 of 119

System of execution

Manual annotations. Easy, but heterogeneous

System of record

meta-database

experiments, projects, models,…

visualizations, search

log_param(‘x’,1)

log_data_version(‘x’,1)

log_metric(‘x’,1)

47 of 119

Auto-annotation. Requires APIs, tool integration

meta-database

object store

experiments, projects,

visualizations, search

OpenML API

tool integrations

auto-log on demand

Also indexes all datasets, flows, runs

48 of 119

OpenML: A global machine learning lab

OpenML

Notebooks

Local apps

Cloud jobs

REST API

APIs in Python, R, Java,...

Web UI

data.publish()

pipeline.publish()

run.publish()

import via API

openml.get_data(1)

openml.get_flow(1)

openml.get_run(1)

All (meta)data is collected and organized automatically

49 of 119

Frictionless machine learning

Share from where you create

Import into your favorite working environment (in uniform formats)

Run wherever you want

data.publish()

OpenML

get_dataset(id)

run.publish()

50 of 119

Web UI (openml.org) beta: new.openml.org

datasets

flows (pipelines)

runs (performance)

accuracy

51 of 119

Tool integrations

from sklearn import ensemble

from openml import tasks, runs

clf = ensemble.RandomForestClassifier()

task = tasks.get_task(3954)

run = runs.run_model_on_task(clf, task)

run.publish()

More examples on https://docs.openml.org/Python-examples/

52 of 119

Tool integrations

import torch.nn

from openml import tasks, runs

model = torch.nn.Sequential(

processing_net, features_net, results_net)

task = tasks.get_task(3954)

run = runs.run_model_on_task(clf, task)

run.publish()

Full example on https://openml.github.io/blog/

53 of 119

Reproduce flows (automagically)

pipeline = sklearn.make_pipeline(...)

flow = sklearn.model_to_flow(pipeline)

run = openml.run_model_on_task(pipeline, task)

id = run.publish()

run = openml.get_run(id)

pipeline = openml.get_flow(run.flow_id, reinstantiate: True)

54 of 119

150000+ yearly users

8000+ registered contributors

500+ publications

20000+ datasets

8000+ flows

10.000.000+ runs

OpenML Community

55 of 119

Thanks to the entire OpenML star team

Jan van Rijn

Matthias Feurer

Heidi Seibold

Bernd Bischl

Andreas Müller

Erin Ledell

Guiseppe Casalicchio

Michel Lang

Pieter Gijsbers

Sahithya Ravi

Bilge Celik

Prabhant Singh

Janek Thomas

and many more!

Marcel Wever

Neil Lawrence

Markus Weimer

56 of 119

FAIR Workflow Traces for Scientific Workflow Research and Development

In collaboration with: Tainã Coleman, Loïc Pottier, Henri Casanova, and Ewa Deelman

57 of 119

WorkflowHub is a community framework that provides a collection of tools for analyzing workflow execution traces, producing realistic synthetic workflow traces, and simulating workflow executions

Concept

58 of 119

Traces

Collection of open access workflow traces from a production workflow system

This collection of workflow traces form an initial set of small- and large-scale workflow configurations

consume/produce large volumes of data
structures are sufficiently complex and heterogeneous

59 of 119

Python Package

open source Python package to analyze traces and generate representative synthetic traces in that same format

analyses can be performed to produce statistical summaries of workflow performance characteristics

WorkflowHub’s Python package attempts to fit data with 23 probability distributions provided as part of SciPy’s statistics submodule

60 of 119

Trace Generator

The WorkflowHub package provides a number of workflow recipes for generating realistic synthetic workflow traces

62 of 119

Questions (5 min)

63 of 119

Session 2 - Feedbacks from workflow users & developers

Jupyter and Galaxy

15:35 - 16:45 Feedbacks from workflow users & developers - Pitch talks of 5 min

Chair : Sarah Cohen-Boulakia

Jupyter Notebooks

15.35 Daniel Garijo (ISI, USC): From notebooks to FAIR workflows

Galaxy

15.40 Björn Grüning (Galaxy Europe, ELIXIR-DE): Complex synchronous and asynchronous workflows - from Metagenomics and COVID-19 research to Drug-Design and Climate science.
15.45 Ignacio Eguinoa (VIB): Managing provenance data from workflow execution using Galaxy and RO-Crate.
15.50 Bert Droesbeke (VIB): Galaxy workflows for COVID e.g. cleanup of viral reads and submission to ENA, containers for training workflows

5 mins questions at the end - put your questions in the chat

64 of 119

From Notebooks to FAIR Workflows

Daniel Garijo

Information Sciences Institute and�Department of Computer Science

http://w3id.org/people/dgarijo

@dgarijov�dgarijo@isi.edu

65 of 119

Computational Notebooks

Daniel Garijo

Figure source: https://towardsdatascience.com/the-complete-guide-to-jupyter-notebooks-for-data-science-8ff3591f69a4

The good:

Narrative, visualizations and code
Easy to prototype and test
Wide community support
Good to showcase usage examples

The ugly:

Opaque software dependencies
Local file dependencies and parameters
Hidden states and unordered cells
Difficult to integrate within a pipeline

See “I don’t like notebooks”, by Joel Grus

See “I like notebooks”, by Jeremy Howard

66 of 119

Notebooks and FAIR support

Daniel Garijo

Principle	Support (Notebook)	Notebook Metadata
FINDABE	N/A	No standard metadata
ACCESSIBLE	Markdown, JSON	N/A
INTEROPERABLE	Python code is usually compatible between notebooks (assuming the same dependencies are available)	N/A
REUSABLE	Notebooks are usually explained combining visualization and markdown narrative. Provenance traces are often included.	N/A

When dependencies are available

67 of 119

From Notebooks to FAIR workflows

Daniel Garijo

Component

Caveat: Notebook is converted into a black box. Missing dataflow

Why?

Easier integration of multiple tools and environments
Easier to maintain and deploy
Clear dataflow for complex pipelines (easier to inspect)
Workflows can be described in community repositories (findability support)

How?

Use annotations to create a workflow component

68 of 119

From Notebooks to FAIR workflows

Daniel Garijo

Ideally: Capture the dataflow in a notebook and modularize cells

NiW: Converting Notebooks into Workflows to Capture Dataflow and Provenance . Carvalho, L. A. M. C.; Wang, R.; Gil, Y.; and Garijo, D. In Proceedings of the Workshop on Capturing Scientific Knowledge (SciKnow), held in conjunction with the ACM International Conference on Knowledge Capture (K-CAP), Austin, Texas, 2017.

69 of 119

Bridging the gap: From notebooks to FAIR workflows

How to ensure we can transform notebooks into FAIR workflows?

Automate ways to enrich and annotate notebooks with metadata (e.g. [1] )

Establish and follow good practices for software engineering:

Modularity of cells
Parametrization of notebook (description of I/O)
Meaningful variable names

Test notebook execution with Docker/Binder

Tests sequential execution of cells

Daniel Garijo

[1] ProvBook: Provenance-based Semantic Enrichment of Interactive Notebooks for Reproducibility. Sheeba Samuel and Birgitta König-Ries. 17th International Semantic Web Conference (ISWC) Demo Track 2018

70 of 119

Complex synchronous and asynchronous workflows

from Metagenomics and COVID-19 research to Drug-Design and Climate science.

Björn Grüning - ELIXIR Germany

www.elixir-europe.org

71 of 119

A Data Analysis Gateway for Everyone!

72 of 119

Web-based User Interface (+ API access)

73 of 119

Data Management

Rich, persistent metadata system
Histories
Data libraries
2-stage delete process
Access management control
User quota

Group/role model
Reference data / user data
Data importer and data exporter

bag-it import
RO-Crate export (next talk)
Invenio integration WIP

LIMS integration (everything is scriptable)

74 of 119

From Tools to (async) Workflows

Combine 2.5000 Apps to powerful workflows

Microdata can be attached to tools and workflows schema.org compliant.

75 of 119

Synchronous Workflows

Notebooks can be stored as Galaxy assets
Notebook can be shared, visualised as any other Galaxy object
Galaxy data can be used inside a Notebook

76 of 119

covid19.galaxyproject.org

All workflows available in WorkflowHub.eu
All data available on usegalaxy.eu
All provanance information available in usegalaxy.eu

80 of 119

Exporting provenance data from Galaxy

Experiences from trying to …

Ignacio Eguinoa - ELIXIR Belgium

www.elixir-europe.org

81 of 119

Galaxy as a workflow execution platform

Save and export the details after a workflow execution

A system initially thought for bench scientist.

hides some (all) of the execution details
sets up an enriched environment where jobs are executed.

Execution “unit” is represented by a tool (as it is defined in Galaxy)

not necessary a single unit of common software ( Linux command, bwa, bedtools, samtools...) -> granularity of steps
User only sees the outputs of a step defined in the tool.

Tool definition is preprocessed to generate the executable code

Can’t statically parse a tool to obtain the executing code.

82 of 119

How to capture this provenance data and metadata?

RO-Crate:

Directory-based organization of: inputs, outputs, scripts, intermediate files, reference data, etc.
Schema based metadata.
Semantically rich annotation of the analysis process using CWLProv

83 of 119

Relevant aspects

Not yet available in Galaxy: Exporting an interoperable and detailed package of a job/workflow execution.
Goal: extract information to be consumed by other systems in FAIR ecosystem �→ important to involve other stakeholders in the standard/s used to pack data: archiving systems, data management platforms….
Galaxy is moving to a more open format in terms of defining the execution:

Workflows format2, dynamic tool definition.

Preprocessing of the tool definition to obtain an executable piece of code is dispersed through different parts of the Galaxy machinery.

References to built-in data (e.g reference genomes)
Tools can reference server built-in files.

Finest granularity possible is determined by the tool definition.

It can contain several “steps” in one e.g: mapping + sorting the resulting bam file

Galaxy tool format provides lots of useful information about the structure of the tool.

Clear inputs and outputs + formats.
Environment details based on clear dependencies list.

85 of 119

Galaxy execution model: What Galaxy does when you run a tool

Standard Galaxy tools are defined in a directory which normally contains one or more files

XML tool definition

structured inputs and outputs
template code to generate job command line code.

Tool data files: static data that can be used by the tool.
Extra modules/script files: .py, .sh ...

Life of a job in Galaxy:

Obtain dictionary with inputs (UI parsing, API call parameter, Rerun previous job..)
Use this as input for the parsing: generate a shell script from the tool command line definition.
Set up the environment based on requirements list
Run the generated script in the environment.

86 of 119

Separation between the GUI and the command line executable

87 of 119

Galaxy containers

and its applications

Bert Droesbeke

88 of 119

Galaxy - training - containers

Galaxy Training Network: 192 tutorials spread over 21 topics
You can run the workflows on usegalaxy.eu / .org
Use available test-data or your own data
Public instances are not always appropriate

Running on large local data sets
Sensitive data
Bad internet connection

Solution: Galaxy Docker containers

docker run -p 8080:80 bgruening/galaxy-stable

training.galaxyproject.org/training-material

github.com/bgruening/docker-galaxy-stable

89 of 119

The concept

Workflow centric approach
Disadvantages of the container

Performance drawback
Big image size

github.com/ELIXIR-Belgium/BioContainers_for_training

90 of 119

Applications

Galaxy Training Network (GTN)

SARS-COV-2 analysis workflows

E-Biokit Africa

ENA-upload container

training.galaxyproject.org/training-material

covid19.galaxyproject.org

github.com/ELIXIR-Belgium/ena-upload-container

doi.org/10.1371/journal.pcbi.1005616

91 of 119

ENA-upload container

Custom tailored Galaxy container
Credential system for brokering
Remove human traces from raw reads

based on pipeline from ENA

Submit reads and metadata to ENA

based on ena-upload-cli

Step by step guide for SARS-COV-2 raw read submission

docker run -p "8080:80" --privileged quay.io/galaxy/ena-upload

rdm.elixir-belgium.org/covid-19

92 of 119

We’ve got you covered

Running Galaxy workflows

Public instances
Local instances
Containers with everything included

93 of 119

Questions (5 min)

94 of 119

Session 2 -

Feedbacks from workflow users & developers

NextFlow and SnakeMake

Chair : Sarah Cohen-Boulakia

Nextflow

16.00 Célia Michotey & Cyril Pommier (INRAE, ELIXIR-FR): Reproducible indexation using Nextflow for FAIDARE, a FAIR plant data discovery portal
16.05 Frédéric Lemoine (Institut Pasteur) : FAIR phylogenetic workflows with Nextflow

Snakemake

16.10 Alban Gaignard (CNRS, ELIXIR-FR): Towards more findable and reusable produced data - knowledge graphs, provenance instrumented workflows and tools registries

5 mins questions at the end - put your questions in the chat

95 of 119

Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal

Cyril Pommier & Célia Michotey

INRAE, ELIXIR-FR

p. 95

Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal

November 30 2020 / Cyril Pommier & Célia Michotey

96 of 119

FAIDARE: Plant data discovery web portal

IBET

INRA

VIB

EBI

Data Harvester

Swagger

MCPD

Discoverability of public data on plant biology
Federation of established data repositories.
Based on the Breeding API (BrAPI) specifications
Genotype and phenotype datasets first.

p. 96

Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal

November 30 2020 / Cyril Pommier & Célia Michotey

97 of 119

FAIDARE: Plant data discovery web portal

Centralized index (Elasticsearch )
Data/Metadata harvester

IBET

INRA

VIB

EBI

Data Harvester

Swagger

https://urgi.versailles.inrae.fr/faidare

MCPD

Extract

Transform

Load

p. 97

Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal

November 30 2020 / Cyril Pommier & Célia Michotey

98 of 119

Data harvester

Extract metadata from semi harmonized sources
Nextflow

Orchestration of all sources ET(L)
INRAE Gitlab
One specific case : URGI databases Extraction
Error recovery
Parallelisation
Reproducibility

Workflow: Nextflow
Environment: Docker based

Python Extract and Transform scripts

https://github.com/elixir-europe/plant-brapi-etl-faidare
BrAPI extract
Transform
Includes validation

p. 98

Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal

November 30 2020 / Cyril Pommier & Célia Michotey

99 of 119

Feedback : pros

Efficient orchestration
Execution status clear
Reproducible
Execution reports

Bottleneck identifications

Error Analysis

Clear
Each step has dedicated folder

Individual logs
Commands
Data

But...

p. 99

Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal

November 30 2020 / Cyril Pommier & Célia Michotey

100 of 119

Feedback : cons

Error Analysis

Encapsulation makes reproducing individual steps complicated
Complexity and debugging of encapsulated python scripts (brapi-etl-faidare)?

Error recovery can be random

Difficulties for debugging

Black box effect

Due to simple nextflow encapsulating complex script ?

Non universal portability

MacOS
Some *nix workstation: Memory resource managment

p. 100

Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal

November 30 2020 / Cyril Pommier & Célia Michotey

101 of 119

Perspectives

Nextflow is the right technology for us
Environment limitation

Dedicated VM

Too much complexity in the harvester

Better usage of nextflow capacities
→ Refactoring to make python ETL more atomic

p. 101

Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal

November 30 2020 / Cyril Pommier & Célia Michotey

102 of 119

FAIR phylogenetic workflows with Nextflow?

International FAIR convergence Symposium 2020

FAIR Workflows Session

Frédéric Lemoine

Evolutionary Bioinformatics

11/30/2020

103 of 119

Phylogenetics

What is Phylogenetics?

In Biology, Phylogenetics is the study of the evolutionary history and relationships among individuals or groups of organisms (e.g. species, or populations);
Nowadays, it is almost exclusively based on molecular data (dna, protein sequences).

103 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020

Used in many fields:

Epidemiology
Genomics
Oncology
Forensics
...

104 of 119

Phylogenetic workflow skeleton

104 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020

Sequences

Alignment

Tree

Genbank

GISAID

Uniprot

orthoDB

TreeBase

…

EASY?

iTOL

Usual characteristics

-Many tools (PhyML, Mafft, Blast, etc.), with tons of parameters (models parameters, etc.);

-Many formats (Phylip, Fasta, Newick, PhyloXML, Nexus, etc.);

-Lots of data manipulations (Reformatting, transforming, filtering, etc.), sometimes manual;

-Require lots of “dedicated” scripts (python, bash, awk, perl, etc.): methods rarely fully described

FAIR?

-Data (Input & Output data): Sometimes

-Workflows: Rarely

105 of 119

Architecture

Workflow:

Findable? ~ Lack meta-data
Accessible? e.g. Github/Zenodo
Interoperable? X
Reusable?

Rerun / replicate /reproduce
! Repurposing (nf modules)

Data:

Often lack metadata /provenance

105 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020

https://github.com/evolbioinfo/

https://hub.docker.com/orgs/evolbioinfo

~100 tools

~200 images

106 of 119

Phylogenetic workflows examples / a first step

BOOSTER* (Nextflow):

Workflow

Available
Difficult to reuse (specific)
No containers

106 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020

* https://booster.pasteur.fr, Nature 2018

107 of 119

Phylogenetic workflows examples

Covid-Align* (Nextflow): Accurate multiple alignment of SARS-CoV2 Sequences

Workflow

Available (github)
Interoperable? X
Use Singularity

107 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020

* https://covalign.pasteur.cloud, Bioinformatics 2020

108 of 119

What is missing?

Developed Workflows are somewhat on their way to be FAIR, but still challenges:

Providing provenance template to nextflow (possible in Nextflow, but no proper “prov”)
Annotate systematically tools/containers with ontologies (i.e. EDAM) and each process
No fully supported “interoperable” workflow language
Workflow repositories: WorkflowHub, Nf-core?

Metadata
Systematic Tests
Follow guidelines (containers, common structure, etc.)

108 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020

109 of 119

Institut Pasteur

25-28 rue du docteur Roux

75724 Paris Cedex 15

https://research.pasteur.fr/fr/team/evolutionary-bioinformatics/

Thank you!

Evolutionary Bioinformatics Unit

110 of 119

Towards more findable and reusable produced data

Knowledge Graphs, workflow provenance, and tools registries�

�Alban Gaignard¹, Hala Skaf-Moli², Khalid Belhajjame³��¹ institut du thorax, CNRS

² LS2N, Nantes University

³ LAMSADE, Paris-Dauphine University, PSL�

�International FAIR Convergence Symposium�(FAIR Workflows Session)

111 of 119

Life scientists often say “it’s easier to reprocess” ... � ... how to better reuse data ?

Workflows to the rescue

Automation for large scale data analysis
Abstraction to represent, share scientific protocols
Transparency through provenance metadata

SnakeMake workflow engine

Rule-based workflow engine written in Python
Data parallelism, execution on Clusters / HPC �→ massive data production
No machine-readable reports
No standard metadata for data findability and sharing

112 of 119

What about SnakeMake + provenance ?

Genericity: can be an advantage but strong limitations when considering community terms such as “gene expression level” or “reference genome”

need for domain-specific annotations �→ FAIR processed data

Fine-grained: large graphs reporting each tool execution and consumed/produced data

need for humanly tractable datasets �→ FAIR summaries

PROV-O W3C standard

113 of 119

Automate FAIRification of Life Science data ?

Methods and tools : graph pattern matching, inference rules, SPARQL, Python, Jupyter

http://www.semantic-web-journal.net/system/files/swj2257.pdf (10.3233/SW-200374)

“Which was the reference genome used to produce this VCF file ?” ��“Which subset of the data should be re-investigated ?”

A Gaignard, H Skaf-Molli, K Belhajjame Findable and reusable workflow data products: A genomic workflow case study. Semantic Web Journal. https://doi.org/10.3233/SW-200374

1 of 119

2 of 119

3 of 119

4 of 119

5 of 119

6 of 119

7 of 119

8 of 119

9 of 119

10 of 119

11 of 119

12 of 119

13 of 119

14 of 119

15 of 119

16 of 119

17 of 119

18 of 119

19 of 119

20 of 119

21 of 119

22 of 119

23 of 119

24 of 119

25 of 119

26 of 119

27 of 119

28 of 119

29 of 119

30 of 119

31 of 119

32 of 119

33 of 119

34 of 119

35 of 119

36 of 119

37 of 119

38 of 119

39 of 119

40 of 119

41 of 119

42 of 119

43 of 119

44 of 119

45 of 119

46 of 119

47 of 119

48 of 119

49 of 119

50 of 119

51 of 119

52 of 119

53 of 119

54 of 119

55 of 119

56 of 119

57 of 119

58 of 119

59 of 119

60 of 119

61 of 119

62 of 119

63 of 119

64 of 119

65 of 119

66 of 119

67 of 119

68 of 119

69 of 119

70 of 119

71 of 119

72 of 119

73 of 119

74 of 119

75 of 119

76 of 119

77 of 119

78 of 119

79 of 119

80 of 119