1 of 50

Making (not only life sciences) data resources more Interoperable and Discoverable on the Web

Alasdair J.G. Gray Bioschemas Steering Council Chair�Heriot-Watt University – ELIXIR-UK

NFDI InfraTalk

4 April 2022

(Bio)schemas:

2 of 50

Bioschemas: Markup for the Life Sciences

Picture: Carole Goble, Turing Lecture 2018

3 of 50

Schema.org: Enhanced Search Results

Picture: Carole Goble, Turing Lecture 2018

4 of 50

Global, lightweight vocabulary of terms

What we can say about those things

  • Properties�(1,447)

What we are talking about

  • Types�(792)

5 of 50

Google Search

http://bioschemas.org

5

6 of 50

Google Search

http://bioschemas.org

6

Oct 2020

Nov 2021

7 of 50

Google Dataset Search (Nov 2021)

http://bioschemas.org

7

https://datasetsearch.research.google.com

https://www.blog.google/products/search/making-it-easier-discover-datasets/

8 of 50

Datasets

Schema definition:

  • Dataset: A body of structured information describing some topic(s) of interest�http://schema.org/Dataset
  • ~120 properties including:
    • name
    • description
    • isFamilyFriendly

8

9 of 50

Google Dataset Profile

Google Dataset Profile

  • 2 required properties
    • Used for Google Dataset Search
  • 10 recommended properties
  • Link to DataCatalog
  • Link to DataDownload

Other profiles: Events, Jobs, ...

https://developers.google.com/search/docs/data-types/dataset

10 of 50

Bioschemas: Schema.org for the life sciences

Profile over Schema.org model + Bioschemas extensions

Layer of constraints + documentation

  • Aim: Improve data discoverability and interoperability in the Life Sciences
  • Approach:
    • Add Life Science types to Schema.org
      • 6 Merged into Pending
    • Provide usage guidelines and examples
      • 6 Minimal properties
      • Link to domain ontologies
    • Support software

Data model

Minimum information

Controlled vocabularies

Cardinality

Documentation

New (properties | types)

Bioschemas �Profile

Data model

Marginality (Minimum | Recommended | Optional)

Controlled vocabularies

Cardinality (ONE | MANY)

Documentation

Examples

New types and properties

11 of 50

Bioschemas: Lightweight semantics

  • Many domain ontologies
    • Designed to model biology
  • Bioschemas focus on search!

12 of 50

Findable

Accessible

Interoperable

Reusable

  • Globally unique identifiers
  • Community defined enriched metadata
  • Indexable by search engines
  • JSON-LD/RDFa
  • Link to controlled vocabularies
  • Links to other resources
  • License
  • Provenance
  • Retrievable
  • HTTP

13 of 50

Bioschemas Community

bioschemas.org/�liveDeploys

bioschemas.org/�liveDeploys

bioschemas.org/�liveDeploys

23

Types

37

Profiles

83

Sites

80M

Pages

Over

162

Profile deployments

61

ELIXIR deployments

14 of 50

Markup

15 of 50

Live Deployment List�bioschemas.org/liveDeploys

16 of 50

Existing Deployed Markup

MolecularEntity profile

  • ChEMBL
  • Guide to Pharmacology
  • MassBank Europe
  • Scholia

17 of 50

ChemicalSubstance Profile: https://bioschemas.org/profiles/ChemicalSubstance

18 of 50

MolecularEntity Profile: https://bioschemas.org/profiles/MolecularEntity

19 of 50

Adding Bioschemas to a Resource

  1. Decide on page type: e.g. Gene, Protein, ...
  2. Map to Bioschemas profile
    • Use generator
  3. Embed JSON-LD�into pages
  4. Register live deployment

http://tiny.cc/bs-live-deploy

20 of 50

Profile Creation Process

Mapping

Profile

Use cases

Mockup

Adoption

Testing

Application

21 of 50

Bioschemas: Profiles & Deployments

Released Profiles

  • ChemicalSubstance
  • ComputationalTool
  • ComputationalWorkflow
  • DataCatalog
  • Dataset
  • Gene
  • MolecularEntity
  • Protein
  • Sample
  • Taxon

Picture: Carole Goble, Turing Lecture 2018

100+ Deployments: Logos of some sites with Bioschemas markup

22 of 50

Exploiting Bioschemas Markup

23 of 50

Specialised Search: TeSS

http://bioschemas.org

23

29 November 2018

  • contact
  • description
  • endDate
  • eventType
  • hostInstitution
  • location
  • name
  • startDate

Bioschemas Course:

  • contact
  • description
  • endDate
  • eventType
  • hostInstitution
  • description
  • keywords
  • name
  • about
  • abstract

Bioschemas CourseInstance:

Bioschemas TrainingMaterial:

toxicology

No need for custom APIs

No concept merging

24 of 50

Bioschemas Profile for Workflows

270+ workflow management systems

Bioschemas profile

- Minimum Information for Registering a Computational Workflow

- Creators, inputs, outputs, WfMS type ….

Working with WfMS providers to extract and add Bioschemas markup

Workflow Registry for discovery

Workflow Package

for exchange & portability

Search

Mark-up,

Validation

https://bioschemas.org/profiles/ComputationalWorkflow/

25 of 50

Community Registry: IDPcentral

http://bioschemas.org

25

29 November 2018

  • name
  • description
  • ...

Protein

No need for custom APIs

FAIR community registry of Bioschemas metadata

  • name
  • description
  • ...

SequenceAnnotation

  • rangeStart
  • rangeEnd
  • ...

SequenceRange

Concept merging

26 of 50

IDP Data Sources

  • Curated from publications
    • Structure disorder
    • Functional annotation
  • 2,038 protein entries
  • Experimental and predicted
    • All known protein sequences
  • 189M protein entries
    • 2,074 included in sitemap
  • Deposition database of protein structural assemblies
  • 172 protein assemblies
    • 90 distinct proteins

26

27 of 50

Bioschemas Markup for IDP

  • Protein
  • SequenceAnnotation
  • SequenceRange
  • Taxon
  • Dataset
  • Scholarly Article

Not shown:

  • DataCatalog

27

Legend

Red: Schema.org

Blue: Pending Schema.org

Green: Bioschemas

28 of 50

BMUSE: Bioschemas Markup Scraper and Extractor

  • Data harvester
    • List of URLs and sitemaps
  • Extracts markup
    • JSON-LD or RDFa or both
    • Static or dynamic
  • Returns markup with provenance
    • Where
    • When
    • Tool version
  • File per page of harvested markup

29 of 50

Harvested Markup

  • Markup harvested through standard API
    • HTTP Get requests
    • Saves time as common API and model (Bioschemas) for all sites
  • Harvested markup is page centric
    • Multiple sites represent the same concept
    • Sites use their own IDs

No need for custom APIs

Concept merging

30 of 50

Identity Reconciliation

  • All sources include cross-reference to UniProt (schema:sameAs)�
  • Differences in UniProt URL
    • https://www.uniprot.org/uniprot/
    • http://purl.uniprot.org/uniprot/

  • Queries extract UniProt accession�
  • IDP-KG IDs generated as�https://idpcentral.org/id/<accession>

Concept merging

31 of 50

IDP-KG: Merging Entries Options

  1. Pick canonical source�Useful for values like name
    • Supplement with other entries when values are missing
  2. Concatenate entries from all sources�Useful for values like synonym
  3. Keep all values with provenance of source�Wikidata approach

Concept merging

Keep all values with provenance �of source

32 of 50

Data Verification

  • Process tested with small samples of data with known properties:
    • Number of proteins per source
    • Overlap of proteins
  • Full harvest run and overlaps verified against sources

32

33 of 50

IDP-KG Interfaces

SPARQL Endpoint

Annotation count per protein

REST API

34 of 50

Querying IDP-KG

34

35 of 50

Querying IDP-KG

35

Count by type

Protein information

Annotation count per protein

Annotation information

Annotation count by term code

36 of 50

36

37 of 50

Bioschemas Data Harvest

38 of 50

Project 29:

#29_bioschemas

Alban Gaignard, Leyla Jael Garcia Castro, Alasdair J. G. Gray / Online

Pages harvested

413,748

Sites harvested

25 (partially)

https://swel.macs.hw.ac.uk/data/

39 of 50

Global Research Graph: OpenAIRE

http://bioschemas.org

39

29 November 2018

No need for custom APIs

Concept merging

Terminology transformation

Markup needs to have:

  • Sitemap: Currently only 28/73
  • Link to
    • Publication
    • Dataset

40 of 50

EOSC-ENHANCE Bioschema use-case

VALIDATE

REGISTER

12345

12345

Added-value services

Slide credit: Paolo Manghi, ISTI, CNR, Pisa, Italy

41 of 50

Project 29:

#29_bioschemas

Alban Gaignard, Leyla Jael Garcia Castro, Alasdair J. G. Gray / Online

Bioschemas

  • Deploy markup
    • Include links to publications
  • Implement sitemap

FAIRsharing

  • Register on FAIRsharing

Harvesting & Mapping

  • Harvest data
  • Map Bioschemas to Datacite

Lorem 4

  • Harvest data
  • Align identifiers
  • Store in KG

42 of 50

Connecting TeSS and bio.tools

Work by Alban Gaignard �(ELIXIR-FR)

2019

43 of 50

Automated Data Curation

http://bioschemas.org

  • description
  • keywords
  • name
  • provider
  • url

Bioschema DataCatalog:

  • alternateName
  • citation
  • dateCreated
  • licence

44 of 50

Data Exchange: Without an API�MarRef → BioSamples

https://github.com/EBIBioSamples/bioschemas_marref_demo/blob/master/Summary.md

Bioschemas �Scraper

45 of 50

Rich snippet generation

46 of 50

Summary

47 of 50

Bioschemas: Markup for the Life Sciences

Picture: Carole Goble, Turing Lecture 2018

48 of 50

Bioschemas

What?

  • Exploiting schema.org to make Life Sciences resources more discoverable
    • Search engines will index and understand markup

How?

  • Extending Schema.org vocabulary for life sciences
  • Guidelines on how to markup resources
  • Software to support deployment and consumption

Approach can be followed by other domains!

49 of 50

Future Challenges

  • Harvesting is compute and time intensive
    • Both provider and consumer
    • Solutions around data dumps being discussed
  • Data quality is very variable
    • Requires validators

49

50 of 50

Acknowledgements http://bioschemas.org/people

http://bioschemas.org

Join Bioschemas: http://bioschemas.org/howtojoin/

bioschemas.org

@bioschemas

github.com/�bioschemas

Bioschemas�Community Call

4th Monday of the month

17:00 CET, 25 April 2022

tiny.cc/bs-slack