1 of 26

Bioschemas: �Marking up biodiversity websites to improve �data discovery and web-scale integration

* Wimmics: AI in bridging social semantics and formal semantics on the Web

Franck MICHEL*

Workshop on �Data Standards & Common Language

NFDI 4 Biodiversity

1

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

2 of 26

Semantic markup for web pages

2

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

3 of 26

: semantic markup for resources on the internet

Collaborative community project founded in 2011 by

Define a common vocabulary to markup resources on the internet

schema.org

What we are �talking about: �types (797)

What we can say about those things: �properties (1453)

3

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

4 of 26

Bioschemas: schema.org extension for Life Sciences

Community initiative built on top of Schema.org

Aim

Help search engines understand and index webpages

Improve resources discoverability and interoperability

Approach

Reuse/extend Schema.org for life sciences

Keep it simple (no complex domain ontology)

Provide guidelines on how to markup resources

      • Minimum/recommended/optional properties
      • Link to other vocabularies & domain ontologies

Support software

Specification

Data model

Minimum information

Controlled vocabularies

Cardinality

Documentation

Examples

New (properties | types)

4

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

5 of 26

Bioschemas: schema.org extension for Life Sciences

Released terms

  • BioChementity
  • ChemicalSubstance
  • ComputationalTool
  • ComputationalWorkflow
  • DataCatalog
  • Dataset
  • Gene
  • MolecularEntity
  • Protein
  • Sample
  • Taxon

Terms in draft status

  • BioSample
  • LabProtocol
  • Phenotype
  • ProteinStructure
  • RNA
  • TaxonName

5

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

6 of 26

Bioschemas: schema.org extension for Life Sciences

  • BioChementity
  • ChemicalSubstance
  • ComputationalTool
  • ComputationalWorkflow
  • DataCatalog
  • Dataset
  • Gene
  • MolecularEntity
  • Protein
  • Sample
  • Taxon
  • BioSample
  • LabProtocol
  • Phenotype
  • ProteinStructure
  • RNA
  • TaxonName

Released terms

Terms in draft status

6

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

7 of 26

Taxon

Type: http://schema.org/Taxon

Profile: https://bioschemas.org/profiles/Taxon provides usage recommendations

7

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

8 of 26

Example: webpage about taxon Delphinapterus leucas

<script type="application/ld+json">

{

"@context": [

"http://schema.org",

{ "dct": "http://purl.org/dc/terms/" }

],

"@type" : "Taxon",

"@id" : "60932",

"dct:conformsTo" : {

"@id": "https://bioschemas.org/profiles/Taxon/0.6-RELEASE",

"@type": "CreativeWork"

},

"name": "Delphinapterus leucas (Pallas, 1776)",

"taxonRank": "species"

}

</script>

8

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

9 of 26

Example markup of a page about taxon Delphinapterus leucas

<script type="application/ld+json">

{

"@context": [

"http://schema.org",

{ "dct": "http://purl.org/dc/terms/"

"dwc": "http://rs.tdwg.org/dwc/terms/",

"dwc:vernacularName": { "@container": "@language" }

}

],

"@type" : "Taxon",

"@id" : "60932",

"dct:conformsTo" : {

"@id": "https://bioschemas.org/profiles/Taxon/0.6-RELEASE",

"@type": "CreativeWork"

},

"name": "Delphinapterus leucas (Pallas, 1776)",

"taxonRank": ["species", { "@id": "http://www.wikidata.org/entity/Q7432" } ],

"additionalType": "dwc:Taxon",

"alternateName": [ "Balaena albicans Muller, 1776", "Beluga catodon Gray, 1846" ],

"dwc:vernacularName": [

{ "@language": "en", "@value": "Beluga Whale" },

{ "@language": "fr", "@value": "Bélouga" }

],

...

9

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

10 of 26

Example markup of a page about taxon Delphinapterus leucas

...

"parentTaxon": {

"@type": "Taxon",

"name": "Delphinapterus Lacépède, 1804",

"mainEntityOfPage": "https://inpn.mnhn.fr/espece/cd_nom/191588?lg=en",

"taxonRank" : "genus"

},

"image": "https://inpn.mnhn.fr/photos/uploads/webtofs/inpn/3/181473.jpg"

"sameAs": [

"http://doris.ffessm.fr/Especes/Delphinapterus-leucas-Beluga-868",

"http://www.marinespecies.org/aphia.php?p=taxdetails&id=137115",

"http://www.iucnredlist.org/details/6335"

],

"identifier": [

{ "@type": "PropertyValue",

"name": "WoRMS id",

"propertyID": "http://www.wikidata.org/entity/P850", # WoRMS id

"value": "137115"

}

],

}

</script>

10

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

11 of 26

TaxonName

Taxon 0.7-DRAFT

11

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

12 of 26

BioSample

12

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

13 of 26

Live �deployments

Photo: https://www.flickr.com/photos/35034363287@N01/2284904309

13

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

14 of 26

Bioschemas: profiles & deployments

Released Profiles

  • ChemicalSubstance
  • ComputationalTool
  • ComputationalWorkflow
  • DataCatalog
  • Dataset
  • Gene
  • MolecularEntity
  • Protein
  • Sample
  • Taxon, TaxonName

Picture: Carole Goble, Turing Lecture 2018

100+ deployments, 100M webpages

14

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

15 of 26

Taxon/TaxonName deployments

Leisure sea fishing legislation.

PSB Int. for Plant Phenotype Analysis

Profiles for researchers, organizations, journals, publishers…

15

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

16 of 26

Why do (early) deployments matter?

  • A way for the community to show its interest �in having these terms
  • Necessary for Schema.org to endorse new types
  • First step to foster novel applications (chicken & egg)

16

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

17 of 26

Searching, aggregating, exploiting �Bioschemas markup

17

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

18 of 26

Exploiting Bioschemas markup

18

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

19 of 26

Community Registry: IDPcentral

name

description...

Protein

FAIR community registry of Bioschemas metadata

name

description...

SequenceAnnotation

rangeStart

rangeEnd...

SequenceRange

BMUSE

Slide by Alasdair Gray. (Bio)schemas: Making (not only life sciences) data resources more Interoperable and Discoverable on the Web. NFDI InfraTalk, 2022.

19

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

20 of 26

BMUSE: Bioschemas Markup Scraper and Extractor

  • Built by/for Bioschemas
  • Data harvester�List of URLs and/or sitemaps
  • Extracts markup� JSON-LD and RDFa� Static or dynamic
  • Returns 1 file per harvested page with provenance
  • Compute- and time-intensive, in particular with dynamic pages

Slide by Alasdair Gray. (Bio)schemas: Making (not only life sciences) data resources more Interoperable and Discoverable on the Web. NFDI InfraTalk, 2022.

20

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

21 of 26

From page-centric data to concept-centric knowledge

Harvested markup is page-centric.�

But multiple sites/pages may represent the same concept, each site using their own IDs.

  • Need for concept merging:
  • Id reconciliation through Uniprot (when possible)
  • Deal with names discrepancies,
  • Keep provenance information

Slide by Alasdair Gray. (Bio)schemas: Making (not only life sciences) data resources more Interoperable and Discoverable on the Web. NFDI InfraTalk, 2022.

21

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

22 of 26

Next steps

22

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

23 of 26

Bioschemas work on biodiversity

Currently:

Taxon, TaxonNameLinks to DwC terms

BioSample, Dataset…

Future

Occurrence Links to DwC occurrences?

SpecimenLinks to ABCD, openDS, MIDS?

TraitsLinks to traits ontologies?

23

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

24 of 26

Marking up biodiversity resources… at scale

GBIF, EoL, CoL, iDigBio, DiSSCo…

Museum collections,

Literature (BHL, Plazi…),

Citizen science platforms,

Independent institutions,

Associations,

Grey literature…

24

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

25 of 26

Take-aways

Marking up webpages

Let’s have search engines �do the job for us!

    • Discovery is the 1st step towards �data integration
    • Connect pieces of data at web scale
    • Existing search engines to do that for us?
    • Develop “domain-specific search engines”
      • Dataset search engines
      • Species Search Engine?
    • Increases data visibility and discoverability
    • Relatively inexpensive
    • Connect unconnected pieces of data,�e.g. “grey literature”

Not the magic bullet

    • High level description, not a rich ontology
    • No standard for values, e.g. taxonomic ranks
    • How to link to 3rd party domain vocabularies
    • Concept merging issue�…

25

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France

26 of 26

https://bioschemas.org/�https://github.com/BioSchemas/specifications/wiki

Questions?

26

Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France