From Scientific Workflow Provenance
to Linked Experiment Reports
Hala Skaf-Molli
University of Nantes
Distributed Data Management Team
�SemSci 2018@ISWC: Enabling Open Semantic Science
9 october 2018, Monterey, California, USA
1
2
Reuse instead of re-executing a scientific workflow ?
3
Scientific Workflow for RNA-seq
4
Scientific experiment for RNA sequencing
to quantify gene expression levels under multiple biological conditions
Scientific Workflow for RNA-seq
5
Scientific experiment for RNA sequencing to quantify gene expression levels under multiple biological conditions
Scientific Workflow for RNA-seq
6
Scientific experiment for RNA sequencing to quantify gene expression levels under multiple biological conditions
Scientific Workflow for RNA-seq
7
Scientific experiment for RNA sequencing to quantify gene expression levels under multiple biological conditions
Scientific Workflow for RNA-seq
8
Scientific experiment for RNA sequencing to quantify gene expression levels under multiple biological conditions
Running RNA-seq Workflow is resources consuming
9
| 1 sample |
Input data | 2 x 17 Gb |
1-core CPU | 170 hours |
32-cores CPU | 32 hours |
Output data | 12 Gb |
Running RNA-seq Workflows is resources consuming
10
| 1 sample | 300 samples |
Input data | 2 x 17 Gb | 10.2 Tb |
1-core CPU | 170 hours | 5.9 years |
32-cores CPU | 32 hours | 14 months |
Output data | 12 Gb | 3.6 Tb |
Reuse instead of re-run !
11
How can find it?
With Provenance ?
12
PROV-O: The provenance ontology
13
https://www.w3.org/TR/prov-o
Prov traces in action ...
Many information for reproducibility not for reuse !
14
Need to know the context of the experiment.
Hypothesis, material, ...
15
Domain-specific reports = PROV + context
As a scientist, I can search for species associated to the samples (NCBITaxon:9606 is Homo sapiens), scientific hypotheses, ..
16
Domain-specific reports = PROV + context
As a scientist, I can search for species associated to the samples (NCBITaxon:9606 is Homo sapiens), scientific hypotheses, ..
17
Domain-specific reports = PROV + context
As a scientist, I can search for species associated to the samples (NCBITaxon:9606 is Homo sapiens), scientific hypotheses, ..
18
Domain-specific reports = PROV + context
As a scientist, I can search for species associated to the samples (NCBITaxon:9606 is Homo sapiens), scientific hypotheses, ..
19
Experiment reporting
20
PoeM : form provenance to reports
21
Alban Gaignard, Hala Skaf-Molli, Audrey Bihouée: From Scientific Workflow Patterns to 5-star Linked Open Data. 8th USENIX Workshop on the Theory and Practice of Provenance, TaPP 2016.
Domain-specific annotations
22
Domain-specific annotations
23
Producing reports with rules
24
Producing reports with rules
25
Producing reports with rules
26
Producing reports with rules
27
Producing reports with rules
28
A rule as a SPARQL query
29
PoeM : form provenance to reports
30
Alban Gaignard, Hala Skaf-Molli, Audrey Bihouée: From Scientific Workflow Patterns to 5-star Linked Open Data. 8th USENIX Workshop on the Theory and Practice of Provenance, TaPP 2016.
Experimentation
31
Experiment report for 6 samples
32
Experiment report for 6 samples
33
Now, I can retrieve species associated to the samples, scientific hypotheses, ...
Summary
34
From single-site to multi-site...
35
Why multi-site workflow ?
36
A Multi-site Workflow
37
DNA Data Pre-processing
Depends
on research
questions
Chained Workflow
38
Data exchange
Running in heterogeneous systems
39
Generate reports as for single workflow and combine ?
Provenance heterogeneity
40
Systems with different PROV extensions
41
We need to reconcile heterogeneous provenance.
42
Approach
43
Alban Gaignard, Khalid Belhajjame, Hala Skaf-Molli: SHARP: Harmonizing and Bridging Cross-Workflow Provenance. ESWC (Satellite Events) 2017: 219-234
Multi-provenance entities linking: owl:sameAs
44
Provenance Harmonization
45
Provenance Saturation
46
Top 10 predicates
47
Most abstract relation for data lineage
48
Experiment reports as Nanopublications
49
50
Life Science Question: “Which reference genome was used when predicting the phenotypes ?”
Summary
51
management and stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016).
Summary
52
Next steps ..
53
Thanks to
54
Khalid Belhajjame
University of Paris-Dauphine
Alban Gaignard
Institut du Thorax
LS2N, CNRS
Audrey Bihouée
Institut du Thorax
University of Nantes
References
55
Merci ..
56