1 of 50

Making (not only life sciences) data resources more Interoperable and Discoverable on the Web

Alasdair J.G. Gray �Bioschemas Steering Council Chair�Heriot-Watt University – ELIXIR-UK

NFDI InfraTalk

4 April 2022

(Bio)schemas:

2 of 50

Bioschemas: Markup for the Life Sciences

Picture: Carole Goble, Turing Lecture 2018

3 of 50

Schema.org: Enhanced Search Results

Picture: Carole Goble, Turing Lecture 2018

4 of 50

Global, lightweight vocabulary of terms

What we can say about those things

Properties�(1,447)

What we are talking about

Types�(792)

5 of 50

Google Search

http://bioschemas.org

5

6 of 50

Google Search

http://bioschemas.org

6

Oct 2020

Nov 2021

7 of 50

Google Dataset Search (Nov 2021)

http://bioschemas.org

7

https://datasetsearch.research.google.com

https://www.blog.google/products/search/making-it-easier-discover-datasets/

8 of 50

Datasets

Schema definition:

Dataset: A body of structured information describing some topic(s) of interest�http://schema.org/Dataset
~120 properties including:

name
description
isFamilyFriendly

8

9 of 50

Google Dataset Profile

2 required properties

Used for Google Dataset Search

10 recommended properties
Link to DataCatalog
Link to DataDownload

Other profiles: Events, Jobs, ...

https://developers.google.com/search/docs/data-types/dataset

10 of 50

Bioschemas: Schema.org for the life sciences

Profile over Schema.org model + Bioschemas extensions

Layer of constraints + documentation

Aim: Improve data discoverability and interoperability in the Life Sciences
Approach:

Add Life Science types to Schema.org

6 Merged into Pending

Provide usage guidelines and examples

6 Minimal properties
Link to domain ontologies

Support software

Data model

Minimum information

Controlled vocabularies

Cardinality

Documentation

New (properties | types)

Bioschemas �Profile

Data model

Marginality (Minimum | Recommended | Optional)

Controlled vocabularies

Cardinality (ONE | MANY)

Documentation

Examples

New types and properties

11 of 50

Bioschemas: Lightweight semantics

Many domain ontologies

Designed to model biology

Bioschemas focus on search!

12 of 50

Findable

Accessible

Interoperable

Reusable

Globally unique identifiers
Community defined enriched metadata
Indexable by search engines

JSON-LD/RDFa
Link to controlled vocabularies
Links to other resources

License
Provenance

Retrievable
HTTP

13 of 50

Bioschemas Community

bioschemas.org/�liveDeploys

23

Types

37

Profiles

83

Sites

80M

Pages

Over

162

Profile deployments

61

ELIXIR deployments

14 of 50

Markup

15 of 50

Live Deployment List�bioschemas.org/liveDeploys

16 of 50

Existing Deployed Markup

MolecularEntity profile

ChEMBL
Guide to Pharmacology
MassBank Europe
Scholia

17 of 50

ChemicalSubstance Profile: https://bioschemas.org/profiles/ChemicalSubstance

18 of 50

MolecularEntity Profile: https://bioschemas.org/profiles/MolecularEntity

19 of 50

Adding Bioschemas to a Resource

Decide on page type: e.g. Gene, Protein, ...
Map to Bioschemas profile

Use generator

Embed JSON-LD�into pages
Register live deployment

http://tiny.cc/bs-live-deploy

20 of 50

Profile Creation Process

Mapping

Profile

Use cases

Mockup

Adoption

Testing

Application

21 of 50

Bioschemas: Profiles & Deployments

Released Profiles

ChemicalSubstance
ComputationalTool
ComputationalWorkflow
DataCatalog
Dataset
Gene
MolecularEntity
Protein
Sample
Taxon

Picture: Carole Goble, Turing Lecture 2018

100+ Deployments: Logos of some sites with Bioschemas markup

22 of 50

Exploiting Bioschemas Markup

23 of 50

Specialised Search: TeSS

http://bioschemas.org

23

29 November 2018

contact
description
endDate
eventType
hostInstitution

location
name
startDate
…

Bioschemas Course:

contact
description
endDate
eventType
hostInstitution

description
keywords
name
about
abstract

Bioschemas CourseInstance:

Bioschemas TrainingMaterial:

toxicology

No need for custom APIs

No concept merging

24 of 50

Bioschemas Profile for Workflows�

270+ workflow management systems

Bioschemas profile

- Minimum Information for Registering a Computational Workflow

- Creators, inputs, outputs, WfMS type ….

Working with WfMS providers to extract and add Bioschemas markup

Workflow Registry for discovery

Workflow Package

for exchange & portability

Search

Mark-up,

Validation

https://bioschemas.org/profiles/ComputationalWorkflow/

25 of 50

Community Registry: IDPcentral

http://bioschemas.org

25

29 November 2018

name
description
...

Protein

No need for custom APIs

FAIR community registry of Bioschemas metadata

name
description
...

SequenceAnnotation

rangeStart
rangeEnd
...

SequenceRange

Concept merging

26 of 50

IDP Data Sources

Curated from publications

Structure disorder
Functional annotation

2,038 protein entries

Experimental and predicted

All known protein sequences

189M protein entries

2,074 included in sitemap

Deposition database of protein structural assemblies
172 protein assemblies

90 distinct proteins

26

27 of 50

Bioschemas Markup for IDP

Protein
SequenceAnnotation
SequenceRange
Taxon
Dataset
Scholarly Article

Not shown:

DataCatalog

27

Legend

Red: Schema.org

Blue: Pending Schema.org

Green: Bioschemas

28 of 50

BMUSE: Bioschemas Markup Scraper and Extractor

Data harvester

List of URLs and sitemaps

Extracts markup

JSON-LD or RDFa or both
Static or dynamic

Returns markup with provenance

Where
When
Tool version

File per page of harvested markup

https://github.com/HW-SWeL/BMUSE

https://app.swaggerhub.com/apis-docs/swel/BMUSE/

29 of 50

Harvested Markup

Markup harvested through standard API

HTTP Get requests
Saves time as common API and model (Bioschemas) for all sites

Harvested markup is page centric

Multiple sites represent the same concept
Sites use their own IDs

No need for custom APIs

Concept merging

30 of 50

Identity Reconciliation

All sources include cross-reference to UniProt (schema:sameAs)�
Differences in UniProt URL

https://www.uniprot.org/uniprot/
http://purl.uniprot.org/uniprot/

Queries extract UniProt accession�
IDP-KG IDs generated as�https://idpcentral.org/id/<accession>

Concept merging

31 of 50

IDP-KG: Merging Entries Options

Pick canonical source�Useful for values like name

Supplement with other entries when values are missing

Concatenate entries from all sources�Useful for values like synonym
Keep all values with provenance of source�Wikidata approach

Concept merging

Keep all values with provenance �of source

32 of 50

Data Verification

Process tested with small samples of data with known properties:

Number of proteins per source
Overlap of proteins

Full harvest run and overlaps verified against sources

Query to analyse the number of proteins by dataset groups

32

33 of 50

IDP-KG Interfaces

SPARQL Endpoint

Annotation count per protein

REST API

34 of 50

Querying IDP-KG

34

35 of 50

Querying IDP-KG

35

Count by type

Protein information

Annotation count per protein

Annotation information

Annotation count by term code

36 of 50

36

37 of 50

Bioschemas Data Harvest

https://github.com/HW-SWeL/COVID-19

38 of 50

Project 29:

#29_bioschemas

Alban Gaignard, Leyla Jael Garcia Castro, Alasdair J. G. Gray / Online

Pages harvested

413,748

Sites harvested

25 (partially)

https://swel.macs.hw.ac.uk/data/

39 of 50

Global Research Graph: OpenAIRE

http://bioschemas.org

39

29 November 2018

No need for custom APIs

Concept merging

Terminology transformation

Markup needs to have:

Sitemap: Currently only 28/73
Link to

Publication
Dataset

40 of 50

EOSC-ENHANCE Bioschema use-case

VALIDATE

REGISTER

12345

Added-value services

Slide credit: Paolo Manghi, ISTI, CNR, Pisa, Italy

41 of 50

Project 29:

#29_bioschemas

Alban Gaignard, Leyla Jael Garcia Castro, Alasdair J. G. Gray / Online

Bioschemas

Deploy markup

Include links to publications

Implement sitemap

FAIRsharing

Register on FAIRsharing

Harvesting & Mapping

Harvest data
Map Bioschemas to Datacite

Lorem 4

Harvest data
Align identifiers
Store in KG

42 of 50

Connecting TeSS and bio.tools

https://github.com/bio-tools/biotoolsShim/blob/master/json2rdf/BioSchemas-Tess-Bio.Tools.ipynb

Work by Alban Gaignard �(ELIXIR-FR)

2019

43 of 50

Automated Data Curation

http://bioschemas.org

description
keywords
name
provider
url

Bioschema DataCatalog:

alternateName
citation
dateCreated
licence
…

44 of 50

Data Exchange: Without an API�MarRef → BioSamples

https://github.com/EBIBioSamples/bioschemas_marref_demo/blob/master/Summary.md

Bioschemas �Scraper

45 of 50

Rich snippet generation

46 of 50

Summary

47 of 50

Bioschemas: Markup for the Life Sciences

Picture: Carole Goble, Turing Lecture 2018

48 of 50

Bioschemas

What?

Exploiting schema.org to make Life Sciences resources more discoverable

Search engines will index and understand markup

How?

Extending Schema.org vocabulary for life sciences
Guidelines on how to markup resources
Software to support deployment and consumption

Approach can be followed by other domains!

49 of 50

Future Challenges

Harvesting is compute and time intensive

Both provider and consumer
Solutions around data dumps being discussed

Data quality is very variable

Requires validators

49

DOI: 10.37044/osf.io/y6gbq

50 of 50

Acknowledgements http://bioschemas.org/people

http://bioschemas.org

Join Bioschemas: http://bioschemas.org/howtojoin/

bioschemas.org

@bioschemas

github.com/�bioschemas

Bioschemas�Community Call

4th Monday of the month

17:00 CET, 25 April 2022

tiny.cc/bs-slack