1 of 56

From Scientific Workflow Provenance

to Linked Experiment Reports

Hala Skaf-Molli

University of Nantes

Distributed Data Management Team

SemSci 2018@ISWC: Enabling Open Semantic Science

9 october 2018, Monterey, California, USA

1

2 of 56

2

3 of 56

Reuse instead of re-executing a scientific workflow ?

3

4 of 56

Scientific Workflow for RNA-seq

4

Scientific experiment for RNA sequencing

to quantify gene expression levels under multiple biological conditions

5 of 56

Scientific Workflow for RNA-seq

5

Scientific experiment for RNA sequencing to quantify gene expression levels under multiple biological conditions

6 of 56

Scientific Workflow for RNA-seq

6

Scientific experiment for RNA sequencing to quantify gene expression levels under multiple biological conditions

7 of 56

Scientific Workflow for RNA-seq

7

Scientific experiment for RNA sequencing to quantify gene expression levels under multiple biological conditions

8 of 56

Scientific Workflow for RNA-seq

8

Scientific experiment for RNA sequencing to quantify gene expression levels under multiple biological conditions

9 of 56

Running RNA-seq Workflow is resources consuming

9

1 sample

Input data

2 x 17 Gb

1-core CPU

170 hours

32-cores CPU

32 hours

Output data

12 Gb

10 of 56

Running RNA-seq Workflows is resources consuming

10

1 sample

300 samples

Input data

2 x 17 Gb

10.2 Tb

1-core CPU

170 hours

5.9 years

32-cores CPU

32 hours

14 months

Output data

12 Gb

3.6 Tb

11 of 56

Reuse instead of re-run !

11

12 of 56

How can find it?

With Provenance ?

12

13 of 56

PROV-O: The provenance ontology

13

https://www.w3.org/TR/prov-o

14 of 56

Prov traces in action ...

  • Large PROV graph
  • Technical vocabularies

Many information for reproducibility not for reuse !

14

15 of 56

Need to know the context of the experiment.

Hypothesis, material, ...

15

16 of 56

Domain-specific reports = PROV + context

As a scientist, I can search for species associated to the samples (NCBITaxon:9606 is Homo sapiens), scientific hypotheses, ..

16

17 of 56

Domain-specific reports = PROV + context

As a scientist, I can search for species associated to the samples (NCBITaxon:9606 is Homo sapiens), scientific hypotheses, ..

17

18 of 56

Domain-specific reports = PROV + context

As a scientist, I can search for species associated to the samples (NCBITaxon:9606 is Homo sapiens), scientific hypotheses, ..

18

19 of 56

Domain-specific reports = PROV + context

As a scientist, I can search for species associated to the samples (NCBITaxon:9606 is Homo sapiens), scientific hypotheses, ..

19

20 of 56

Experiment reporting

  • Already exists, but manually written in HTML.
  • We want to generate them automatically.
  • Why is it challenging?
    • Reports are domain dependant.

20

21 of 56

PoeM : form provenance to reports

21

Alban Gaignard, Hala Skaf-Molli, Audrey Bihouée: From Scientific Workflow Patterns to 5-star Linked Open Data. 8th USENIX Workshop on the Theory and Practice of Provenance, TaPP 2016.

22 of 56

Domain-specific annotations

  • Link steps of this sequence pattern to domain vocabularies
  • P-Plan and EDAM ontology

22

23 of 56

Domain-specific annotations

  • MicroPublication and open annotation ontology

23

24 of 56

Producing reports with rules

24

25 of 56

Producing reports with rules

25

26 of 56

Producing reports with rules

26

27 of 56

Producing reports with rules

27

28 of 56

Producing reports with rules

28

29 of 56

A rule as a SPARQL query

29

30 of 56

PoeM : form provenance to reports

30

Alban Gaignard, Hala Skaf-Molli, Audrey Bihouée: From Scientific Workflow Patterns to 5-star Linked Open Data. 8th USENIX Workshop on the Theory and Practice of Provenance, TaPP 2016.

31 of 56

Experimentation

  • Real-life RNA-seq workflow to study 6 mice samples
  • WF implemented in Galaxy
  • We developed a tool to export PROV traces from Galaxy histories as Linked Data
  • Code and demo: http://poem.univ-nantes.fr

31

32 of 56

Experiment report for 6 samples

32

33 of 56

Experiment report for 6 samples

33

Now, I can retrieve species associated to the samples, scientific hypotheses, ...

34 of 56

Summary

  • Poem is a “semi-automated” approach for generating linked experiment reports.
  • Poem is designed for mining a single-site workflow.

34

35 of 56

From single-site to multi-site...

  • Ok, I generate reports for single workflow...
  • Can I generate reports for multi-site workflow ?

35

36 of 56

Why multi-site workflow ?

  • Scientific experimentations involve geographically distributed partners
  • Each partner has own infrastructure and specific expertises
    • Multi-site workflow

36

37 of 56

A Multi-site Workflow

37

DNA Data Pre-processing

Depends

on research

questions

38 of 56

Chained Workflow

38

Data exchange

39 of 56

Running in heterogeneous systems

39

40 of 56

Generate reports as for single workflow and combine ?

Provenance heterogeneity

40

41 of 56

Systems with different PROV extensions

41

42 of 56

We need to reconcile heterogeneous provenance.

42

43 of 56

Approach

43

Alban Gaignard, Khalid Belhajjame, Hala Skaf-Molli: SHARP: Harmonizing and Bridging Cross-Workflow Provenance. ESWC (Satellite Events) 2017: 219-234

44 of 56

Multi-provenance entities linking: owl:sameAs

  • Two different PROV entities associated with the same file content should be the same
    • SHA-512 fingerprint of files
    • Annotate PROV entities with the SHA-512 digest
    • Produce automatically owl:sameAs using SPARQL Construct query

44

45 of 56

Provenance Harmonization

  • Data integration problem 1
    • Mediator, mappings, etc
  • Provenance vocabularies in WF systems are extensions of PROV-O
    • Concepts, relations are “implicity” mapped to PROV-O

45

  1. A. Halevy, and Z. Ives. Principles of Data Integration. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2012.

46 of 56

Provenance Saturation

  • PROV Inferences
    • Repeatedly inferring new facts based on existing ones until no new facts are produced
  • We implemented provenance constraints using jena rules and its OWL reasoner.

46

47 of 56

Top 10 predicates

47

48 of 56

Most abstract relation for data lineage

48

49 of 56

Experiment reports as Nanopublications

  • Linking data to scientific context
    • P-Plan, EDAM, SIO

49

50 of 56

50

Life Science Question: “Which reference genome was used when predicting the phenotypes ?”

51 of 56

Summary

  • Preliminary results of “semi-automated” approach for FAIR1 experiments reports.
  • Single or multi-site scientific workflow:
    • MicroPublication, Nanopublication

51

  1. Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data

management and stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016).

52 of 56

Summary

  • Provenance and domain specific annotations are major ingredients for producing findable and reusable data.

52

53 of 56

Next steps ..

  • Evaluation with life scientists and bioinformaticians
  • Decentralized provenance harmonization
  • Federated queries over decentralised provenance and decentralised linked experiment reports

53

54 of 56

Thanks to

54

Khalid Belhajjame

University of Paris-Dauphine

Alban Gaignard

Institut du Thorax

LS2N, CNRS

Audrey Bihouée

Institut du Thorax

University of Nantes

55 of 56

References

  • Alban Gaignard, Khalid Belhajjame, Hala Skaf-Molli: SHARP: Harmonizing Cross-workflow Provenance. SeWeBMeDA@ESWC 2017: 50-64
  • Alban Gaignard, Khalid Belhajjame, Hala Skaf-Molli: SHARP: Harmonizing Galaxy and Taverna Workflow Provenance. SeWeBMeDA@ESWC 2017: 80-83 (demo)
  • Alban Gaignard, Khalid Belhajjame, Hala Skaf-Molli: SHARP: Harmonizing and Bridging Cross-Workflow Provenance. ESWC (Satellite Events) 2017: 219-234
  • Alban Gaignard, Hala Skaf-Molli, Audrey Bihouée: From Scientific Workflow Patterns to 5-star Linked Open Data. 8th USENIX Workshop on the Theory and Practice of Provenance, TaPP 2016.

55

56 of 56

Merci ..

56