1 of 40

Biolink Model

Workshop

2 of 40

Overview

  • Introduction to Biolink Model
  • Q&A and Discussion
  • Modeling a test dataset using Biolink Model
  • KG-COVID-19 and Biolink Model
  • Addressing community use-cases
  • Aligning community schemas

3 of 40

Introduction

4 of 40

Ontology

“Ontology is a formal specification of a shared conceptualization”

Tom Gruber

An ontology is a formal specification of concepts, from a particular domain of knowledge, that are arranged in a hierarchy, as a directed acyclic graph, where concepts are defined in relation to other concepts in the graph.

5 of 40

Knowledge Graph

  • A knowledge graph has many definitions
  • A broad and inclusive definition:

A knowledge graph (KG) is a graph that represents knowledge where entities are represented as nodes and relationships between these entities are represented as edges.

6 of 40

Knowledge Graph

  • Nodes in a network represent entities (or concepts) and edges represent relationships between entities
  • Graph formalisms:
    • Property graphs
    • RDF graphs
  • Knowledge graphs (KGs) have been around since 1970s (Semantic networks) or 2012 (Google Knowledge Graph)

7 of 40

Knowledge Graph

  • KGs are at the peak
  • Several KGs in the industry
  • And a proliferating number of KGs in life sciences

8 of 40

Biomedical KGs

  • Semantic MEDLINE Database
  • Hetionet
  • WikiData
  • Monarch Initiative
  • SPOKE
  • COVID-19 Graph
  • KG-COVID-19
  • ...

9 of 40

Advantages of KGs

  • Ability to flexibly represent heterogeneous data and knowledge
  • Ontological KGs allow for deductive inference through logical rules
  • KGs are useful for graph-based machine learning
    • Edge prediction
    • Node classification
  • Store of information for downstream applications

10 of 40

Challenges with KGs

  • The formalism used for representing nodes and edges
  • The vocabulary used for representing nodes and edges
    • In the case of nodes: ’Gene’, ‘gene’, ‘gene locus’, ‘genomic feature’, ‘sequenceFeature’
    • In the case of edges: ‘has_phenotype’, ‘has phenotype’, ‘HAS PHENOTYPE’
  • KGs typically lack schemas; Developed as silos
  • Choice of identifiers
  • Modeling decisions for representing data from various sources

11 of 40

NCATS Biomedical Data Translator

12 of 40

NCATS Biomedical Data Translator

  • To build a system that is capable of taking existing biological and biomedical datasets and translating them into insights
  • Multi-site, multi-year project
  • 14 teams spread across the US + Netherlands
    • Knowledge Providers
    • Autonomous Relay Agents
    • Autonomous Relay Service

13 of 40

Overview of the architecture

https://doi.org/10.1111/cts.12591

14 of 40

Overview of the architecture

We as a consortium agree on 3 things:

  • Shared data model
  • Shared specification for exchange
  • Shared set of tools

15 of 40

Biolink Model

16 of 40

Biolink Model

  • A high-level data model for representing biological and biomedical knowledge
  • Bridges multiple vocabularies and ontologies
  • Agnostic to the graph formalism used to represent knowledge
  • The model consists of:
    • Entities
    • Associations
    • Predicates
    • Properties

17 of 40

Entities

  • Nodes in a graph
  • Represents entities found in biological and biomedical knowledge
  • Arranged in a hierarchy
  • Root of all entities is the ‘named thing’ class

Examples: Gene, Protein, Disease, Phenotypic Feature

18 of 40

Entities

Each entity class has

  • its own unique stable URI
  • mappings to other ontologies
  • list of valid ID prefixes

Higher-level terms that can be used to categorize nodes in a KG.

For more detailed typing, one can use specific terms from an ontology.

19 of 40

Associations

  • Edges in a graph
  • Represents assertions or statements
  • A hierarchy of associations
  • The root of all associations is the ‘association’ class.

Example: GeneToGeneAssociation, GeneToDiseaseAssociation, DiseaseToPhenotypicFeatureAssociation

20 of 40

Associations

  • An association connects a subject node and an object node via a predicate
  • The nature of the association is defined based on its properties
  • An association can have properties like provided_by, evidence, publications
  • Certain associations can have additional properties that are unique, as in the case of DiseaseToPhenotypicFeatureAssociation
    • frequency_qualifier

21 of 40

Predicates

  • High-level relationships
  • Used as predicate in a statement
  • Has mappings to other ontologies
    • Most commonly to Relations Ontology

22 of 40

Biolink Model

  • Common dialect for representing knowledge
  • Bridge across Neo4j and RDF graphs
  • Mapping existing models to the Biolink Model
  • Shared space for discussion on,
    • Evidence
    • Provenance
    • Context
    • Confidence

23 of 40

Curation process

Within Translator,

  • Weekly data modeling calls
  • Following a set of well defined guidelines for governance
  • Set up weekly Help Desk for users

Broadly, we want to support use cases from the wider community.

24 of 40

Tools to work with Biolink Model

25 of 40

Biolinkml

  • Biolink Modeling Language: https://github.com/biolink/biolinkml
  • A modeling framework
  • YAML as the source of truth
  • Generate,
    • JSON Schema: validation for JSON
    • Python Dataclasses: building Python APIs and writing ETL
    • Java classes: building Java APIs and writing ETL
    • GraphQL: building APIs on top of data stores
    • JSON-LD context: RDF to JSON serialization
    • RDF Turtle: RDF graphs
    • OWL: reasoning
    • Shape Expressions (ShEx): validation of RDF graphs

26 of 40

Biolinkml

  • Polymorphism
    • (mixins/traits + strict inheritance)
    • Class-specific overrides of slots
  • Rich annotation
  • Imports (borrowed from OWL)
  • Formal semantics

definition

Class definition

Slot definition

element

has 0..*

is_a 0..1

range 0..1

schema

imports 0..*

Core metamodel

(simplified subset)

27 of 40

Biolink Model Toolkit

  • A Python API for working with the Biolink Model
  • Provides convenience methods for querying the model
  • https://github.com/biolink/biolink-model-toolkit

28 of 40

Knowledge Graph Exchange

  • https://github.com/biolink/kgx
  • A Python library and set of command line utilities for exchanging Knowledge Graphs (KGs) that conform to or are aligned to the Biolink Model.
  • KGX allows you to work with,
    • SPARQL endpoints or RDF serializations
    • Neo4j endpoints or Neo4j dumps
    • CSV/TSV
    • JSON
    • OWL
    • OBOGraph JSON

29 of 40

Knowledge Graph Exchange

  • Transform KGs from one graph formalism to another
  • Create KGs or subgraphs
  • Merge two or more KGs
  • Validate KG against the Biolink Model
  • Apply graph operations
    • Graph Merge
    • Clique Merge
    • ID remapping
    • Graph summary

30 of 40

Biolink Model in the real world

31 of 40

SRI Reference KG

  • A Biolink Model compliant version of the Monarch Knowledge Graph
  • Built for the NCATS Biomedical Data Translator
  • Contains ontologies like GO, HP, MONDO, UBERON, ChEBI
  • Contains datasets like BioGRID, Reactome, ClinVar, Orphanet, OMIM, STRING
  • Available at scigraph.ncats.io

32 of 40

KG-COVID-19

  • A framework for building a KG with datasets relevant for COVID-19
  • Part of an even larger project called the National Virtual Biotechnology Laboratory
  • First instance of the Knowledge Graph Hub
  • https://github.com/Knowledge-Graph-Hub/kg-covid-19

33 of 40

Illuminating the Druggable Genome

  • Build a KG as a substrate for machine learning with the goal of link prediction
  • Uses Biolink Model for representing various curated datasets and databases like DrugCentral

34 of 40

Bringing it all together

35 of 40

Knowledge Graph Hub

  • The goal of KG Hub is to serve as a collective resource to simplify the process of generating biological and biomedical KGs and thus reducing the barrier for entry to new participants
  • In a KG Hub, each independent effort for building a KG is an instance of the KG Hub
  • Each KG follows a set of design principles, one of which is the use of Biolink Model as a schema

36 of 40

Knowledge Graph Hub

  • Also provides recommendations on set of design patterns to use for building and exchanging KGs
  • More information about KG Hub can be found at knowledge-graph-hub.github.io
  • Instances of KG Hub can be found at https://github.com/knowledge-Graph-Hub

37 of 40

Biolink Model - Future directions

  • The model itself is constantly being refined
  • Future additions include,
    • Variant and genotypes (GENO)
    • Evidence, Provenance, Context, and Confidence (SEPIO)
    • Genomic locations (FALDO)
  • Align to other controlled vocabularies, schemas, and ontologies

38 of 40

Contributing to the Biolink Model

  • GitHub: https://github.com/biolink/biolink-model/
    • Issues for requesting improvements to the model
    • Pull Requests for contributing to the model
  • Documentation: https://biolink.github.io/biolink-model/
  • Biolink Model on Gitter

39 of 40

Acknowledgement

  • Members from NCATS Biomedical Data Translator Consortium
  • Members from Monarch Initiative
  • Members of BBOP

Funding

  • NCATS Biomedical Data Translator grant 1OT2TR003449-01

  • Chris Mungall (LBNL)
  • Harold Solbrig (John Hopkins University)
  • Anne Thessen (Oregon State University)
  • Matt Brush (Oregon Health State University)
  • Richard Bruskiewich (STAR Informatics)
  • Justin Reese (LBNL)
  • Jim Balhoff (RENCI)
  • Melissa Haendel (Oregon State University)
  • Kent Shefchek (Oregon State University)
  • Michel Dumontier (Maastricht University)
  • Vincent Emonet (Maastricht University)
  • Stephen Ramsey (Oregon State University)
  • Andrew Su (Scripps Research Institute)
  • Karamarie Fecho (RENCI)
  • Chris Bizon (RENCI)
  • Steven Cox (RENCI)
  • Vlado Dancik (Broad Institute)

40 of 40

Q&A