1 of 119

Session 3 -

Wooclap & Panelist Questions

→ Please go to https://app.wooclap.com/FAIRWF

2 of 119

FAIR Computational workflows

International FAIR Convergence Symposium 2020

Sarah Cohen-Boulakia*, Daniel Garijo**, Carole Goble***

* Université Paris-Saclay, France�**Information Sciences Institute, University of Southern California�*** The University of Manchester, UK

3 of 119

Welcome !

Pipelines are designed to analyse scientific datasets.

Scientific workflow systems have been developed to assist users.

  • Workflows should be FAIR in their own right, have an important role to play in making data FAIR and need access to FAIR data
  • What are the tools offered today to help produce FAIR data and develop FAIR workflows?
  • What is the feedback of scientific workflow users? Current challenges?
  • Pitch talks then Wooclap Questions

279 registrations!

4 of 119

Housekeeping

Cameras on for speakers and chairs only. Please turn off your camera unless presenting.

The presentations will be shared and the session is being recorded.

You are all muted by default: to ask a question use the chat AND raise your hand. The host can unmute you during the discussion.

Please use the chat for discussion mark your point as QUESTION if you are addressing it to the panellists.

If you are tweeting use the hashtag #FAIRconvergence

Cameras on for speakers and chairs only

5 of 119

Session 1 - Tools for FAIR workflows

Chair : Daniel Garijo

  • 15.05 Carole Goble (ELIXIR-UK; EOSC-Life; The University of Manchester): �The WorkflowHub.eu FAIR workflow registry
  • 15.10 Stian Soiland-Reyes (BioExcel; The University of Manchester; UvA): �RO-Crate for FAIR workflow packaging
  • 15.15 Michael R. Crusoe (CWL; VU Amsterdam, ELIXIR-NL): �The CWL Standards support the FAIR Principles
  • 15.20 Joaquin Vanschoren (OpenML community): �Sharing machine learning workflows with OpenML
  • 15.25 Rafael Ferreira da Silva (ISI, USC): �FAIR Workflow Traces for Scientific Workflow Research and Development (workflowhub.org)

5 mins questions at the end - put your questions in the chat

6 of 119

The WorkflowHub.eu �FAIR workflow registry

Carole Goble, The University of Manchester / ELIXIR-UK / EOSC-Life

And the WorkflowHub Club

carole.goble@manchester.ac.uk

FAIR Computational Workflows Session, 30th November 2020, FAIR Convergence Symposium

7 of 119

https://workflowhub.eu

Beta Release September 2020

FIND and ACCESS Workflows

The workflow registered are INTEROPERABLE and REUSABLE

Workflows are FAIR and the Registry is FAIR too.

Workflows are FAIR objects

in their own right.

A Registry for Computational Workflows.

8 of 119

https://workflowhub.eu

A Registry for Computational Workflows.

Workflow Management System agnostic.

Workflows may remain in their native repositories in their native form.

Open to workflows from all disciplines and any country.

Sponsored by the European Life Science community.

The WorkflowHub Club open community.

Beta Release September 2020

9 of 119

9

FAIR and richly featured

Registry

Entry

Nextflow

Native form

Common Workflow Language

10 of 119

Workflows organized by:

Teams

Collections

Properties

  • Tags
  • Type
  • Status
  • Dates….etc

Search & Browsing

Makers are custodians of their own workflows

Preserve personal attribution, affiliations and contribution credit

11 of 119

70

Workflows

29

COVID-19

85 people

registered

15

countries

32

organisations

26

teams

Submitters

Credited for registering

and curating entries

Contributors

Credited for developing

workflows

Scripts

Any WfMS

12 of 119

FAIR Machine Processable Metadata

Metadata about a workflow

Schema.org markup

Canonical description of the workflow

Links to containerized tools

Alongside native descriptions

Metadata for organizing & packaging

components of a workflow

An exchange format for workflows

13 of 119

Collection of files and file ids (urls)

Simple

upload

upload

access

upload

download

access

Testing & Monitoring Systems

Search and Launch from within WfMS

TRS API

Register (push) / Harvest (pull)

Other registries

14 of 119

Partnering with specific Workflow management Systems�for advanced integration and rich features

Your WfMS here

15 of 119

snapshots

versions

provenance

identifiers

citation

referencing

community standards

common

registration

metadata

canonical workflow descriptions & mark-up

licenses

analytics

API

supplementary materials

test, example data

documentation

Links to monitoring & testing services

packaging

LIVING CONTENT

16 of 119

16

Workflow registry

open for business!

https://workflowhub.eu

17 of 119

WorkflowHub.eu Open for Business!

We gratefully acknowledge the WorkflowHub Club, Bioschemas Group, RO-Crate community, CWL Community and our WfMS partners in Galaxy, Snakemake, Nextflow, CWL, SCIPION, NMRPipe

18 of 119

Packaging workflows with RO-Crate

Stian Soiland-Reyes�The University of Manchester, BioExcel, ELIXIR-UK

University of Amsterdam

This work is funded by the European Union contracts �BioExcel CoEH2020-INFRAEDI-02-2018-823830, H2020-EINFRA-2015-1-675728EOSC-Life �H2020-INFRAEOSC-2018-2-824087

19 of 119

Annual reminder on FAIR principles

Interoperable

I1. (meta)data use a formal, accessible, shared, and� broadly applicable language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles

I3. (meta)data include qualified references to other (meta)data

Findable

F1. (meta)data are assigned a globally unique and persistent identifier

F2. data are described with rich metadata (defined by R1 below)

F3. metadata clearly and explicitly include the identifier of the � data it describes

F4. (meta)data are registered or indexed in a searchable resource

Reusable

R1. meta(data) are richly described with a plurality of� accurate and relevant attributes

R1.1. (meta)data are released with a clear and accessible � data usage license

R1.2. (meta)data are associated with detailed provenance

R1.3. (meta)data meet domain-relevant community standards

Accessible

A1. (meta)data are retrievable by their identifier using a� standardized communications protocol

A1.1 the protocol is open, free, and universally implementable

A1.2 the protocol allows for an authentication and authorization� procedure, where necessary

A2. metadata are accessible, even when the � data are no longer available

tl;dr: machine-readable metadata

20 of 119

Best practices for �workflow reproducibility

Methods

(..)

De novo assembly and binning

Raw reads from each run were first assembled with SPAdes v.3.10.020 with option --meta21. Thereafter, MetaBAT 215 (v.2.12.1) was used to bin the assemblies using a minimum contig length threshold of 2,000 bp (option --minContig 2000) and default parameters. Depth of coverage required for the binning was inferred by mapping the raw reads back to their assemblies with BWA-MEM v.0.7.1645 and then calculating the corresponding read depths of each individual contig with samtools v.1.546 (‘samtools view -Sbu’ followed by ‘samtools sort’) together with the jgi_summarize_bam_contig_depths function from MetaBAT 2. The QS of each metagenome-assembled genome (MAG) was estimated with CheckM v.1.0.722 using the lineage_wf workflow and calculated as: level of completeness − 5 × contamination. Ribosomal RNAs (rRNAs) were detected with the cmsearch function from INFERNAL v.1.1.247 (options -Z 1000 --hmmonly --cut_ga) using the Rfam48 covariance models of the bacterial 5S, 16S and 23S rRNAs. Total alignment length was inferred by the sum of all non-overlapping hits. Each gene was considered present if more than 80% of the expected sequence length was contained in the MAG. Transfer RNAs (tRNAs) were identified with tRNAscan-s.e. v.2.049 using the bacterial tRNA model (option -B) and default parameters. Classification into high- and medium-quality MAGs was based on the criteria defined by the minimum information about a metagenome-assembled genome (MIMAG) standards23 (high: >90% completeness and <5% contamination, presence of 5S, 16S and 23S rRNA genes, and at least 18 tRNAs; medium: ≥ 50% completeness and <10% contamination). (...)

(..)

Assignment of MAGs to reference databases

Four reference databases were used to classify the set of MAGs recovered from the human gut assemblies: HR, RefSeq, GenBank and a collection of MAGs from public datasets. HR comprised a total of 2,468 high-quality genomes (>90% completeness, <5% contamination) retrieved from both the HMP catalogue (https://www.hmpdacc.org/catalog/) and the HGG8. From the RefSeq database, we used all the complete bacterial genomes available (n = 8,778) as of January 2018. In the case of GenBank, a total of 153,359 bacterial and 4,053 eukaryotic genomes (3,456 fungal and 597 protozoan genomes) deposited as of August 2018 were considered. Lastly, we surveyed 18,227 MAGs from the largest datasets publicly available as of August 201813,16,17,18,19, including those deposited in the Integrated Microbial Genomes and Microbiomes (IMG/M) database52. For each database, the function ‘mash sketch’ from Mash v.2.053 was used to convert the reference genomes into a MinHash sketch with default k-mer and sketch sizes. Then, the Mash distance between each MAG and the set of references was calculated with ‘mash dist’ to find the best match (that is, the reference genome with the lowest Mash distance). Subsequently, each MAG and its closest relative were aligned with dnadiff v.1.3 from MUMmer 3.2354 to compare each pair of genomes with regard to the fraction of the MAG aligned (aligned query, AQ) and ANI.

(..)

21 of 119

Best practices for �workflow reproducibility

22 of 119

23 of 119

Semantic Web world vs Real World

Peter Sefton at Open Repositories 2019

Excessive FAIR considered dangerous for your health .

24 of 119

2018 reboot → RO-Crate

RO-Crate is a community effort to establish a lightweight approach to packaging research data with their metadata.

It is based on schema.org annotations in JSON-LD, and aims to make best-practice in formal metadata description accessible and practical for use in a wider variety of situations, from an individual researcher working with a folder of data, to large data-intensive computational research environments.

RO-Crate is the marriage of Research Objects with DataCrate. It aims to build on their respective strengths, but also to draw on lessons learned from those projects and similar research data packaging efforts. For more details, see RO-Crate background.

The RO-Crate specification details how to capture a set of files and resources as a dataset with associated metadata – including contextual entities like people, organizations, publishers, funding, licensing, provenance, workflows, geographical places, subjects and repositories.

A growing list of RO-Crate tools and libraries simplify creation and consumption of RO-Crates, including the graphical interface Describo.

The RO-Crate community help shape the specification or get help with using it!

25 of 119

2018 reboot: Building the RO-Crate Community

Monthly telcons (4th Thursday of month), everyone welcome!�https://s.apache.org/ro-crate-minutes

26 of 119

RO-Crate Specification

27 of 119

Best Practice Guidance, not strict specifications

Developer-friendly rather than semantic correctness

Focus on JSON, but gradual path to extensibility with Linked Data�(example: how to do ad-hoc terms)

Opinionated profile of schema.org

Example-driven documentation, not strict schemas

28 of 119

Workflows in RO-Crate

29 of 119

Separation of concern

Interoperability, Explainability

Software packaging

Tool registries

Distribution, Packaging

Storage, Repositories �(including the Web!)

Describing, Relating, Typing, Contextualizing

Interactivity, Scalability

Challenges:

Packaging zoo - how to choose?

Using “just enough” of the stack

Lossy interoperability between layers

Avoid everyone concluding

“I’ll just make my own JSON/API”

http://

Identifiers (incl URIs)

urn:uuid:1ca3b9dc-a97c-408d-ab1c-8431909e343a

{ “@id”: “https://doi.org/10.4225/59/59672c09f4a4b”,

“@type”: “Dataset”,� “hasPart”: [ … ]

}

manifest-sha512.txt

a0ae93…77fb data/ro-crate-metadata.json

e5fec4…500b data/ro-crate-preview.html

a2f562…f3fa data/workflow.cwl�481bb7…10b7 data/chipseq_20200910.json

30 of 119

What is “the workflow”?

Same conceptual workflow;�multiple executable flavours for different workflow engines and specific use-cases (e.g. COVID-19)

31 of 119

Tooling!

How can I use it?

While we’re mostly focusing on the specification, some tools already exist for working with RO-Crates:

  • Describo interactive desktop application to create, update and export RO-Crates for different profiles. (~ beta)
  • CalcyteJS is a command-line tool to help create RO-Crates and HTML-readable rendering (~ beta)
  • ro-crate - JavaScript/NodeJS library for RO-Crate rendering as HTML. (~ beta)
  • ro-crate-js - utility to render HTML from RO-Crate (~ alpha)
  • ro-crate-ruby Ruby library to consume/produce RO-Crates (~ alpha)
  • ro-crate-py Python library to consume/produce RO-Crates (~ planning)

These applications use or expose RO-Crates:

32 of 119

Software Libraries

from rocrate import rocrate_api

# Workflow and extra file paths

wf_path = "test/test-data/Genomics-4-PE_Variation.ga"

extra_files = ["test/test-data/extra_file.txt"]

# Create base package

wf_crate = rocrate_api.make_workflow_rocrate(workflow_path=wf_path,

wf_type="Galaxy",

include_files=extra_files)

# Add authors info

author_metadata = {'name': 'Jaco Pastorius'}

jaco = wf_crate.add_person('#jaco', author_metadata)

wf_crate.creator = jaco

# Write to zip file

out_path = "/home/test_user/wf_crate.zip"

wf_crate.write_zip(out_path)

33 of 119

FAIR is not just machine-readable!

/

ro-crate-metadata.json

ro-crate-preview.html

nextflow.log

results/

34 of 119

Timeline

History:�2018-10 RO-Lite conceived � IEEE eScience 2018

2019-02 RO-Lite 0.1 drafted

2019-11 RO-Crate 1.0 released

2020-10 RO-Crate 1.1 released

Next:

Workflow Run Provenance

Workflow Run job & test data

Containers and clouds

RO-Crate in Zenodo?

More tooling!

Next RO-Crate community call: 2020-01-07

https://www.researchobject.org/ro-crate/

More in session 260: �FAIR Data Provenance

Wednesday 15:00-17:00 UTC

35 of 119

FAIR Principles 4 Workflows with CWL and friends

Michael R. Crusoe�Common Workflow Language Project Lead

36 of 119

Annual reminder on FAIR principles

Interoperable

I1. (meta)data use a formal, accessible, shared, and� broadly applicable language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles

I3. (meta)data include qualified references to other (meta)data

Findable

F1. (meta)data are assigned a globally unique and persistent identifier

F2. data are described with rich metadata (defined by R1 below)

F3. metadata clearly and explicitly include the identifier of the � data it describes

F4. (meta)data are registered or indexed in a searchable resource

Reusable

R1. meta(data) are richly described with a plurality of� accurate and relevant attributes

R1.1. (meta)data are released with a clear and accessible � data usage license

R1.2. (meta)data are associated with detailed provenance

R1.3. (meta)data meet domain-relevant community standards

Accessible

A1. (meta)data are retrievable by their identifier using a� standardized communications protocol

A1.1 the protocol is open, free, and universally implementable

A1.2 the protocol allows for an authentication and authorization� procedure, where necessary

A2. metadata are accessible, even when the � data are no longer available

37 of 119

Who will issue identifiers for workflows?

Should this be a Workflow-specific service?

Yes, Zenodo will give us a DOI for free, but I can’t query a Zenodo DOI for workflow metadata.

Nor can I use a Zenodo DOI to immediately get a URL to download the workflow.�– However if the workflow is stored inside a Workflow RO Crate, then the DOI becomes actionable

38 of 119

What API should we use to query workflow registries?

The Global Alliance 4 Genomics and Health (GA4GH) have specified TRS API, implemented by dockstore.org and workflowhub.eu.

Are the needs and assumptions of a human health association the correct ones for all scientific/research workflow users?

GA4GH is working through the complexities of Authorization and Authentication (FAIR Principle A1.2)

39 of 119

Requests for Workflow Environment Developers

Please

  • Make it easy and attractive for your users to add metadata
  • Prompt your users to apply a software license, preferably a major and widely recognized open source license
  • Integrate with WorkflowHub.eu and domain specific registries for discovery and publishing
  • Collect provenance information (if you don’t already) and export it automatically in CWLProv format
    • If you don’t like something about CWLProv, please work with us to improve it!

40 of 119

FAIR Principles & Workflows with CWL and friends

CWL + WorkflowHub.eu + CWLProv + GA4GH TRS + schema.org can be used to support all 17 of the FAIR principles — BUT fully realizing them is not automatic!

What about when WorkflowHub.eu content and API evolves? (we should archive the RO-Crates)

What if workflow authors don’t self-annotate sufficiently?

What if schema.org moves the canonical location of their RDF again?

Let’s work together to answer these questions!

41 of 119

OpenML

Sharing and reproducing

machine learning experiments

Organizing and automating machine learning

Joaquin Vanschoren and the OpenML team

42 of 119

Machine Learning: art or science?

data engineers

Process with many actors and tools (model lifecycle)

raw data

predict

ML engineers

Deployment

train

preprocess

43 of 119

What if…

we could organize the world’s machine learning information

and make it universally accessible and useful?

44 of 119

Machine Learning (FAIR) objects

Datasets

(+ meta-data)

Flows

(algorithm meta-data)

Runs

(model meta-data)

Tasks

(problem meta-data)

how to evaluate

dependencies,

detailed structure

(pipelines, neural nets)

configurations,

evaluations

file, url, version

Come with a lot of meta-data

45 of 119

Share and rediscover all used dataset, flows, and runs

For every model, get the exact dataset and algorithms used

How do I reproduce this result?

Reproducibility

metadata

metadata

metadata

How do I collect all this metadata?

46 of 119

System of execution

Manual annotations. Easy, but heterogeneous

System of record

meta-database

experiments, projects, models,…

visualizations, search

log_param(‘x’,1)

log_data_version(‘x’,1)

log_metric(‘x’,1)

47 of 119

Auto-annotation. Requires APIs, tool integration

meta-database

object store

experiments, projects,

visualizations, search

OpenML API

tool integrations

auto-log on demand

Also indexes all datasets, flows, runs

48 of 119

OpenML: A global machine learning lab

OpenML

Notebooks

Local apps

Cloud jobs

REST API

APIs in Python, R, Java,...

Web UI

data.publish()

pipeline.publish()

run.publish()

import via API

openml.get_data(1)

openml.get_flow(1)

openml.get_run(1)

All (meta)data is collected and organized automatically

49 of 119

Frictionless machine learning

Share from where you create

Import into your favorite working environment (in uniform formats)

Run wherever you want

data.publish()

OpenML

get_dataset(id)

run.publish()

50 of 119

Web UI (openml.org) beta: new.openml.org

datasets

flows (pipelines)

runs (performance)

accuracy

51 of 119

Tool integrations

from sklearn import ensemble

from openml import tasks, runs

clf = ensemble.RandomForestClassifier()

task = tasks.get_task(3954)

run = runs.run_model_on_task(clf, task)

run.publish()

More examples on https://docs.openml.org/Python-examples/

52 of 119

Tool integrations

import torch.nn

from openml import tasks, runs

model = torch.nn.Sequential(

processing_net, features_net, results_net)

task = tasks.get_task(3954)

run = runs.run_model_on_task(clf, task)

run.publish()

Full example on https://openml.github.io/blog/

53 of 119

Reproduce flows (automagically)

pipeline = sklearn.make_pipeline(...)

flow = sklearn.model_to_flow(pipeline)

run = openml.run_model_on_task(pipeline, task)

id = run.publish()

run = openml.get_run(id)

pipeline = openml.get_flow(run.flow_id, reinstantiate: True)

54 of 119

150000+ yearly users

8000+ registered contributors

500+ publications

20000+ datasets

8000+ flows

10.000.000+ runs

OpenML Community

55 of 119

Thanks to the entire OpenML star team

Jan van Rijn

Matthias Feurer

Heidi Seibold

Bernd Bischl

Andreas Müller

Erin Ledell

Guiseppe Casalicchio

Michel Lang

Pieter Gijsbers

Sahithya Ravi

Bilge Celik

Prabhant Singh

Janek Thomas

and many more!

Marcel Wever

Neil Lawrence

Markus Weimer

56 of 119

FAIR Workflow Traces for Scientific Workflow Research and Development

In collaboration with: Tainã Coleman, Loïc Pottier, Henri Casanova, and Ewa Deelman

57 of 119

WorkflowHub is a community framework that provides a collection of tools for analyzing workflow execution traces, producing realistic synthetic workflow traces, and simulating workflow executions

Concept

58 of 119

Traces

Collection of open access workflow traces from a production workflow system

This collection of workflow traces form an initial set of small- and large-scale workflow configurations

  • consume/produce large volumes of data
  • structures are sufficiently complex and heterogeneous

59 of 119

Python Package

open source Python package to analyze traces and generate representative synthetic traces in that same format

analyses can be performed to produce statistical summaries of workflow performance characteristics

WorkflowHub’s Python package attempts to fit data with 23 probability distributions provided as part of SciPy’s statistics submodule

60 of 119

Trace Generator

The WorkflowHub package provides a number of workflow recipes for generating realistic synthetic workflow traces

61 of 119

Accuracy

62 of 119

Questions (5 min)

63 of 119

Session 2 - Feedbacks from workflow users & developers

Jupyter and Galaxy

15:35 - 16:45 Feedbacks from workflow users & developers - Pitch talks of 5 min

Chair : Sarah Cohen-Boulakia

Jupyter Notebooks

  • 15.35 Daniel Garijo (ISI, USC): From notebooks to FAIR workflows

Galaxy

  • 15.40 Björn Grüning (Galaxy Europe, ELIXIR-DE): Complex synchronous and asynchronous workflows - from Metagenomics and COVID-19 research to Drug-Design and Climate science.
  • 15.45 Ignacio Eguinoa (VIB): Managing provenance data from workflow execution using Galaxy and RO-Crate.
  • 15.50 Bert Droesbeke (VIB): Galaxy workflows for COVID e.g. cleanup of viral reads and submission to ENA, containers for training workflows

5 mins questions at the end - put your questions in the chat

64 of 119

From Notebooks to FAIR Workflows

Daniel Garijo

Information Sciences Institute and�Department of Computer Science

http://w3id.org/people/dgarijo

@dgarijov�dgarijo@isi.edu

65 of 119

Computational Notebooks

Daniel Garijo

65

Figure source: https://towardsdatascience.com/the-complete-guide-to-jupyter-notebooks-for-data-science-8ff3591f69a4

The good:

  • Narrative, visualizations and code
  • Easy to prototype and test
  • Wide community support
  • Good to showcase usage examples

The ugly:

  • Opaque software dependencies
  • Local file dependencies and parameters
  • Hidden states and unordered cells
  • Difficult to integrate within a pipeline

See “I don’t like notebooks”, by Joel Grus

See “I like notebooks”, by Jeremy Howard

66 of 119

Notebooks and FAIR support

Daniel Garijo

66

Principle

Support (Notebook)

Notebook Metadata

FINDABE

N/A

No standard metadata

ACCESSIBLE

Markdown, JSON

N/A

INTEROPERABLE

Python code is usually compatible between notebooks (assuming the same dependencies are available)

N/A

REUSABLE

Notebooks are usually explained combining visualization and markdown narrative. Provenance traces are often included.

N/A

When dependencies are available

67 of 119

From Notebooks to FAIR workflows

Daniel Garijo

67

Component

i1

i2

o1

Caveat: Notebook is converted into a black box. Missing dataflow

Why?

  • Easier integration of multiple tools and environments
  • Easier to maintain and deploy
  • Clear dataflow for complex pipelines (easier to inspect)
  • Workflows can be described in community repositories (findability support)

How?

  • Use annotations to create a workflow component

68 of 119

From Notebooks to FAIR workflows

Daniel Garijo

68

Ideally: Capture the dataflow in a notebook and modularize cells

NiW: Converting Notebooks into Workflows to Capture Dataflow and Provenance . Carvalho, L. A. M. C.; Wang, R.; Gil, Y.; and Garijo, D. In Proceedings of the Workshop on Capturing Scientific Knowledge (SciKnow), held in conjunction with the ACM International Conference on Knowledge Capture (K-CAP), Austin, Texas, 2017.

69 of 119

Bridging the gap: From notebooks to FAIR workflows

How to ensure we can transform notebooks into FAIR workflows?

  • Automate ways to enrich and annotate notebooks with metadata (e.g. [1] )

  • Establish and follow good practices for software engineering:
    • Modularity of cells
    • Parametrization of notebook (description of I/O)
    • Meaningful variable names

  • Test notebook execution with Docker/Binder
    • Tests sequential execution of cells

Daniel Garijo

69

[1] ProvBook: Provenance-based Semantic Enrichment of Interactive Notebooks for Reproducibility. Sheeba Samuel and Birgitta König-Ries. 17th International Semantic Web Conference (ISWC) Demo Track 2018

70 of 119

Complex synchronous and asynchronous workflows

from Metagenomics and COVID-19 research to Drug-Design and Climate science.

Björn Grüning - ELIXIR Germany

www.elixir-europe.org

71 of 119

A Data Analysis Gateway for Everyone!

72 of 119

Web-based User Interface (+ API access)

73 of 119

Data Management

  • Rich, persistent metadata system
  • Histories
  • Data libraries
  • 2-stage delete process
  • Access management control
  • User quota

  • Group/role model
  • Reference data / user data
  • Data importer and data exporter
    • bag-it import
    • RO-Crate export (next talk)
    • Invenio integration WIP
  • LIMS integration (everything is scriptable)

74 of 119

From Tools to (async) Workflows

Combine 2.5000 Apps to powerful workflows

Microdata can be attached to tools and workflows schema.org compliant.

75 of 119

Synchronous Workflows

  • Notebooks can be stored as Galaxy assets
  • Notebook can be shared, visualised as any other Galaxy object
  • Galaxy data can be used inside a Notebook

76 of 119

covid19.galaxyproject.org

  • All workflows available in WorkflowHub.eu
  • All data available on usegalaxy.eu
  • All provanance information available in usegalaxy.eu

77 of 119

Genomics

78 of 119

79 of 119

419

23K+

80 of 119

Exporting provenance data from Galaxy

Experiences from trying to …

Ignacio Eguinoa - ELIXIR Belgium

www.elixir-europe.org

81 of 119

Galaxy as a workflow execution platform

Save and export the details after a workflow execution

  • A system initially thought for bench scientist.
      • hides some (all) of the execution details
      • sets up an enriched environment where jobs are executed.
  • Execution “unit” is represented by a tool (as it is defined in Galaxy)
      • not necessary a single unit of common software ( Linux command, bwa, bedtools, samtools...) -> granularity of steps
      • User only sees the outputs of a step defined in the tool.
  • Tool definition is preprocessed to generate the executable code
      • Can’t statically parse a tool to obtain the executing code.

82 of 119

How to capture this provenance data and metadata?

  • RO-Crate:
      • Directory-based organization of: inputs, outputs, scripts, intermediate files, reference data, etc.
      • Schema based metadata.
      • Semantically rich annotation of the analysis process using CWLProv

83 of 119

Relevant aspects

  • Not yet available in Galaxy: Exporting an interoperable and detailed package of a job/workflow execution.
  • Goal: extract information to be consumed by other systems in FAIR ecosystem �→ important to involve other stakeholders in the standard/s used to pack data: archiving systems, data management platforms….
  • Galaxy is moving to a more open format in terms of defining the execution:
    • Workflows format2, dynamic tool definition.
  • Preprocessing of the tool definition to obtain an executable piece of code is dispersed through different parts of the Galaxy machinery.
    • References to built-in data (e.g reference genomes)
    • Tools can reference server built-in files.
  • Finest granularity possible is determined by the tool definition.
    • It can contain several “steps” in one e.g: mapping + sorting the resulting bam file
  • Galaxy tool format provides lots of useful information about the structure of the tool.
    • Clear inputs and outputs + formats.
    • Environment details based on clear dependencies list.

84 of 119

85 of 119

Galaxy execution model: What Galaxy does when you run a tool

  • Standard Galaxy tools are defined in a directory which normally contains one or more files
    • XML tool definition
      • structured inputs and outputs
      • template code to generate job command line code.
    • Tool data files: static data that can be used by the tool.
    • Extra modules/script files: .py, .sh ...
  • Life of a job in Galaxy:
    • Obtain dictionary with inputs (UI parsing, API call parameter, Rerun previous job..)
    • Use this as input for the parsing: generate a shell script from the tool command line definition.
    • Set up the environment based on requirements list
    • Run the generated script in the environment.

86 of 119

Separation between the GUI and the command line executable

87 of 119

Galaxy containers

and its applications

Bert Droesbeke

88 of 119

Galaxy - training - containers

  • Galaxy Training Network: 192 tutorials spread over 21 topics
  • You can run the workflows on usegalaxy.eu / .org
  • Use available test-data or your own data
  • Public instances are not always appropriate
    • Running on large local data sets
    • Sensitive data
    • Bad internet connection
  • Solution: Galaxy Docker containers

docker run -p 8080:80 bgruening/galaxy-stable

training.galaxyproject.org/training-material

github.com/bgruening/docker-galaxy-stable

89 of 119

The concept

  • Workflow centric approach
  • Disadvantages of the container
    • Performance drawback
    • Big image size

github.com/ELIXIR-Belgium/BioContainers_for_training

90 of 119

Applications

  • Galaxy Training Network (GTN)

  • SARS-COV-2 analysis workflows

  • E-Biokit Africa

  • ENA-upload container

training.galaxyproject.org/training-material

covid19.galaxyproject.org

github.com/ELIXIR-Belgium/ena-upload-container

doi.org/10.1371/journal.pcbi.1005616

91 of 119

ENA-upload container

  • Custom tailored Galaxy container
  • Credential system for brokering
  • Remove human traces from raw reads
    • based on pipeline from ENA
  • Submit reads and metadata to ENA
    • based on ena-upload-cli
  • Step by step guide for SARS-COV-2 raw read submission

docker run -p "8080:80" --privileged quay.io/galaxy/ena-upload

rdm.elixir-belgium.org/covid-19

92 of 119

We’ve got you covered

  • Running Galaxy workflows
    • Public instances
    • Local instances
    • Containers with everything included

93 of 119

Questions (5 min)

94 of 119

Session 2 -

Feedbacks from workflow users & developers

NextFlow and SnakeMake

Chair : Sarah Cohen-Boulakia

Nextflow

  • 16.00 Célia Michotey & Cyril Pommier (INRAE, ELIXIR-FR): Reproducible indexation using Nextflow for FAIDARE, a FAIR plant data discovery portal
  • 16.05 Frédéric Lemoine (Institut Pasteur) : FAIR phylogenetic workflows with Nextflow

Snakemake

  • 16.10 Alban Gaignard (CNRS, ELIXIR-FR): Towards more findable and reusable produced data - knowledge graphs, provenance instrumented workflows and tools registries

5 mins questions at the end - put your questions in the chat

95 of 119

Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal

Cyril Pommier & Célia Michotey

INRAE, ELIXIR-FR

p. 95

Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal

November 30 2020 / Cyril Pommier & Célia Michotey

96 of 119

FAIDARE: Plant data discovery web portal

IBET

INRA

VIB

EBI

Data Harvester

Swagger

MCPD

  • Discoverability of public data on plant biology
  • Federation of established data repositories.
  • Based on the Breeding API (BrAPI) specifications
  • Genotype and phenotype datasets first.

p. 96

Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal

November 30 2020 / Cyril Pommier & Célia Michotey

97 of 119

FAIDARE: Plant data discovery web portal

  • Centralized index (Elasticsearch )
  • Data/Metadata harvester

IBET

INRA

VIB

EBI

Data Harvester

Swagger

MCPD

Extract

Transform

Load

p. 97

Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal

November 30 2020 / Cyril Pommier & Célia Michotey

98 of 119

Data harvester

  • Extract metadata from semi harmonized sources
  • Nextflow
    • Orchestration of all sources ET(L)
    • INRAE Gitlab
    • One specific case : URGI databases Extraction
    • Error recovery
    • Parallelisation
    • Reproducibility
      • Workflow: Nextflow
      • Environment: Docker based
  • Python Extract and Transform scripts

p. 98

Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal

November 30 2020 / Cyril Pommier & Célia Michotey

99 of 119

Feedback : pros

  • Efficient orchestration
  • Execution status clear
  • Reproducible
  • Execution reports
    • Bottleneck identifications
  • Error Analysis
    • Clear
    • Each step has dedicated folder
      • Individual logs
      • Commands
      • Data
    • But...

p. 99

Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal

November 30 2020 / Cyril Pommier & Célia Michotey

100 of 119

Feedback : cons

  • Error Analysis
    • Encapsulation makes reproducing individual steps complicated
    • Complexity and debugging of encapsulated python scripts (brapi-etl-faidare)?
  • Error recovery can be random
    • Difficulties for debugging
  • Black box effect
    • Due to simple nextflow encapsulating complex script ?
  • Non universal portability
    • MacOS
    • Some *nix workstation: Memory resource managment

p. 100

Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal

November 30 2020 / Cyril Pommier & Célia Michotey

101 of 119

Perspectives

  • Nextflow is the right technology for us
  • Environment limitation
    • Dedicated VM
  • Too much complexity in the harvester
    • Better usage of nextflow capacities
    • → Refactoring to make python ETL more atomic

p. 101

Nextflow based reproducible indexation for FAIDARE, a FAIR plant data discovery portal

November 30 2020 / Cyril Pommier & Célia Michotey

102 of 119

FAIR phylogenetic workflows with Nextflow?

International FAIR convergence Symposium 2020

FAIR Workflows Session

Frédéric Lemoine

Evolutionary Bioinformatics

11/30/2020

103 of 119

Phylogenetics

What is Phylogenetics?

  • In Biology, Phylogenetics is the study of the evolutionary history and relationships among individuals or groups of organisms (e.g. species, or populations);
  • Nowadays, it is almost exclusively based on molecular data (dna, protein sequences).

103 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020

Used in many fields:

  • Epidemiology
  • Genomics
  • Oncology
  • Forensics
  • ...

104 of 119

Phylogenetic workflow skeleton

104 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020

Sequences

Alignment

Tree

Genbank

GISAID

Uniprot

orthoDB

TreeBase

EASY?

iTOL

Usual characteristics

-Many tools (PhyML, Mafft, Blast, etc.), with tons of parameters (models parameters, etc.);

-Many formats (Phylip, Fasta, Newick, PhyloXML, Nexus, etc.);

-Lots of data manipulations (Reformatting, transforming, filtering, etc.), sometimes manual;

-Require lots of “dedicated” scripts (python, bash, awk, perl, etc.): methods rarely fully described

FAIR?

-Data (Input & Output data): Sometimes

-Workflows: Rarely

105 of 119

Architecture

Workflow:

  • Findable? ~ Lack meta-data
  • Accessible? e.g. Github/Zenodo
  • Interoperable? X
  • Reusable?
      • Rerun / replicate /reproduce
      • ! Repurposing (nf modules)

Data:

  • Often lack metadata /provenance

105 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020

https://github.com/evolbioinfo/

https://hub.docker.com/orgs/evolbioinfo

~100 tools

~200 images

106 of 119

Phylogenetic workflows examples / a first step

BOOSTER* (Nextflow):

  • Workflow
    • Available
    • Difficult to reuse (specific)
    • No containers

106 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020

107 of 119

Phylogenetic workflows examples

Covid-Align* (Nextflow): Accurate multiple alignment of SARS-CoV2 Sequences

  • Workflow
    • Available (github)
    • Interoperable? X
    • Use Singularity

107 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020

* https://covalign.pasteur.cloud, Bioinformatics 2020

108 of 119

What is missing?

Developed Workflows are somewhat on their way to be FAIR, but still challenges:

    • Providing provenance template to nextflow (possible in Nextflow, but no proper “prov”)
    • Annotate systematically tools/containers with ontologies (i.e. EDAM) and each process
    • No fully supported “interoperable” workflow language
    • Workflow repositories: WorkflowHub, Nf-core?
      1. Metadata
      2. Systematic Tests
      3. Follow guidelines (containers, common structure, etc.)

108 | F. Lemoine | FAIR phylogenetic workflows | 30/11/2020

109 of 119

@

Institut Pasteur

25-28 rue du docteur Roux

75724 Paris Cedex 15

https://research.pasteur.fr/fr/team/evolutionary-bioinformatics/

Thank you!

Evolutionary Bioinformatics Unit

110 of 119

Towards more findable and reusable produced data

Knowledge Graphs, workflow provenance, and tools registries

�Alban Gaignard1, Hala Skaf-Moli2, Khalid Belhajjame3��1 institut du thorax, CNRS

2 LS2N, Nantes University

3 LAMSADE, Paris-Dauphine University, PSL

�International FAIR Convergence Symposium�(FAIR Workflows Session)

111 of 119

Life scientists often say “it’s easier to reprocess” ... � ... how to better reuse data ?

Workflows to the rescue

  • Automation for large scale data analysis
  • Abstraction to represent, share scientific protocols
  • Transparency through provenance metadata

SnakeMake workflow engine

  • Rule-based workflow engine written in Python
  • Data parallelism, execution on Clusters / HPC �→ massive data production
  • No machine-readable reports
  • No standard metadata for data findability and sharing

112 of 119

What about SnakeMake + provenance ?

Genericity: can be an advantage but strong limitations when considering community terms such as “gene expression level” or “reference genome”

    • need for domain-specific annotations �→ FAIR processed data

Fine-grained: large graphs reporting each tool execution and consumed/produced data

    • need for humanly tractable datasets �→ FAIR summaries

PROV-O W3C standard

113 of 119

Automate FAIRification of Life Science data ?

Methods and tools : graph pattern matching, inference rules, SPARQL, Python, Jupyter

“Which was the reference genome used to produce this VCF file ?” ��“Which subset of the data should be re-investigated ?”

A Gaignard, H Skaf-Molli, K Belhajjame Findable and reusable workflow data products: A genomic workflow case study. Semantic Web Journal. https://doi.org/10.3233/SW-200374

114 of 119

Semantic tools registries

Bio.Tools + EDAM ontology

  • Scientific topics
  • Nature of processing
  • Nature of data (inputs / outputs)

Example :

  • JASPAR predicts transcription factor (TFs) binding sites …
  • JASPAR takes as input TFs names or IDs.

115 of 119

Machines and humans -oriented summaries

  1. It’s possible to automatically display the typical bioinformatics tasks data originate from
  2. It’s possible to document data with text leveraging ontology definitions (EDAM)
  1. It’s possible to automatically produce machine-oriented nanoPublications

116 of 119

Conclusion and perspectives

    • Provenance as a raw material to assemble FAIR experiment reports
    • Domain-specific annotations + tools / workflow registries are essential ! ��→ Still missing ? bridge between Snakemake workflows and tools IDs
  • Potential applications to machine learning workflows
    • Improve explainability of predictions
    • Improve reuse of pre-trained models�
  • How to mine distributed workflows / provenance traces on the web ? �→ Findability as a key requirement to envisage (distributed/federated) querying

117 of 119

Acknowledgments

Audrey Bihouée, �Institut du Thorax, �BiRD Bioinformatics facility, University of Nantes

Hala Skaf-Molli, �University of Nantes, LS2N

Khalid Belhajjame, LAMSADE, University of Paris-Dauphine, PSL

Questions ?

alban.gaignard@univ-nantes.fr

118 of 119

Questions (5 min)

119 of 119

Wrap up

Thanks to all the speakers!

Thanks to all the participants!

  • Workflow Systems are a promising means to (automatically) produce FAIR data
  • Workflows reach beyond WfMS - scripts, notebooks, traces
  • Workflows should be FAIR in their own right
    • The FAIR Principles for workflows
    • Research Data Alliance FAIR for Research Software (FAIR4RS) WG

Slides (https://tinyurl.com/FAIRWFSlides) and Recording will be made available.