1 of 51

Building CoronaWhy Knowledge Graph

FREYA Guest webinar

Slava Tykhonov

Senior Information Scientist

(DANS-KNAW, the Netherlands)

01.09.2020

2 of 51

About me: DANS-KNAW projects (2016-2020)

CLARIAH+ (ongoing)
EOSC Synergy (ongoing)
SSHOC Dataverse (ongoing)
CESSDA DataverseEU 2018
Time Machine Europe Supervisor at DANS-KNAW
PARTHENOS Horizon 2020
CESSDA PID (Personal Identifiers) Horizon 2020
CLARIAH
RDA (Research Data Alliance) PITTS Horizon 2020
CESSDA SaW H2020-EU.1.4.1.1 Horizon 2020

Source: LinkedIn

3 of 51

Motivation

4 of 51

7 weeks in lockdown in Spain

Resistere (I will resist)

5 of 51

About CoronaWhy

1300+ people registered in the

organization, more than 300 actively contributing!

www.coronawhy.org

6 of 51

COVID-19 Open Research Dataset Challenge (CORD-19)

It’s all started from this (March, 2020):

“In response to the COVID-19 pandemic and with the view to boost research, the Allen Institute for AI together with CZI, MSR, Georgetown, NIH & The White House is collecting and making available for free the COVID-19 Open Research Dataset (CORD-19). This resource is updated weekly and contains over 52,000 scholarly articles, including 41,000 with full text, about COVID-19 and other viruses of the coronavirus family.” (Kaggle)

7 of 51

Motivation of CoronaWhy community members

Credits: Andre Ye

8 of 51

CoronaWhy Funding

Initial: $5k from Google on GCP and $4k from Amazon on AWS (April 2020)

Donations: $9k and 15k british pounds to sustain CoronaWhy infrastructure

9 of 51

CoronaWhy Community Tasks (March-April)

Task-Risk helps to identify risk factors that can increase the chance of being infected, or affects the severity or the survival outcome of the infection
Task-Ties to explore transmission, incubation and environment stability
Match Clinical Trials allows exploration of the results from the COVID-19 International Clinical Trials dataset
COVID-19 Literature Visualization helps to explore the data behind the AI-powered literature review
Named Entity Recognition across the entire corpus of CORD-19 papers with full text

10 of 51

CORD-19 affiliations recognized with Deep Learning

Source: CORD-19 map visualization and institution affiliation data

11 of 51

Collaboration with other organizations

Harvard Medical School, INDRA integration
Helix Group, Stanford University
NASA JPL, COVID-19 knowledge graph and GeoParser
Kaggle, coronamed application
Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, knowledge graph
dcyphr, a platform for creating and engaging with distillations of academic articles
CAMARADES (Collaborative Approach to Meta-Analysis and Review of Animal Data from Experimental Studies)

We’ve got almost endless data streams...

12 of 51

Looking for Commons

Merce Crosas, “Harvard Data Commons”

13 of 51

Building a horizontal platform to serve vertical teams

Source: CoronaWhy infrastructure introduction

14 of 51

Turning FAIR into reality!

DANS-KNAW is one of worldwide leaders in FAIR Data (FAIRsFAIR)

15 of 51

Standing on the Shoulders of Giants: infrastructure

16 of 51

Standing on the Shoulders of Giants: Big Data of the Past

T imemachine.eu

17 of 51

Dataverse as data integration point

Available as a service for the community from April, 2020
Used by CoronaWhy vertical teams for the data exchange and share
Intended to help researchers to make their data FAIR
One of the biggest COVID-19 data archives in the world with 700k files
New teams are getting own data containers and can reuse data collected and produced by others

http://datasets.coronawhy.org

18 of 51

Dataset from CoronaWhy vertical teams

Source: CoronaWhy Dataverse

19 of 51

COVID-19 data files verification

We do a verification of every file by importing its contents to dataframe.

All column names (variables) extracted from tabular data available as labels in files metadata

We’ve enabled Dataverse data previewers to browse through the content of files without download!

We’re starting internal challenges to build ML models for the metadata classification

20 of 51

Dataverse content in Jupyter notebooks

Source: Dataverse examples on Google Colabs

21 of 51

COVID-19 Data Crowdsourcing

CoronaWhy data management team does does the review of all harvested datasets and try to identify the important data.

We’re approaching github owners by creating issues in their repos and inviting them to help us.

More than 20% of data owners joining CoronaWhy community or interested to curate their datasets.

Bottom-up data collection works!

22 of 51

Challenge of data integration and various ontologies

CORD-19 collection workflows with NLP pipeline:

manual annotation and labelling of COVID-19 related papers
automatic entity extraction and classification of text fragments
statements extraction and curation
linking papers to specific research questions with relationships extraction

Dataverse Data Lake streaming COVID-19 datasets from various sources:

medical data
socio-economic Data
political data and statistics

23 of 51

The importance of standards and ontologies

Generic controlled vocabularies to link metadata in the bibliographic collections are well known: ORCID, GRID, GeoNames, Getty.

Medical knowledge graphs powered by:

Biological Expression Language (BEL)
Medical Subject Headings (MeSH®) by U.S. National Library of Medicine (NIH)
Wikidata (Open ontology) - Wikipedia

Integration based on metadata standards:

MARC21, Dublin Core (DC), Data Documentation Initiative (DDI)

24 of 51

Biological Expression Language (BEL)

BEL was integrated in CoronaWhy infrastructure in April, 2020

25 of 51

Statements extraction with INDRA

Source: EMMAA (Ecosystem of Machine-maintained Models with Automated Assembly)

“INDRA (Integrated Network and Dynamical Reasoning Assembler) is an automated model assembly system, originally developed for molecular systems biology and currently being generalized to other domains.”

Developed as a part of Harvard Program in Therapeutic Science and the Laboratory of Systems Pharmacology at Harvard Medical School.

http://indra.bio

26 of 51

Knowledge Graph curation in INDRA

27 of 51

Advanced Text Analysis (NLP pipeline)

Source: D.Shlepakov, How to Build Knowledge Graph, notebook

We need a good understanding of all domain specific texts to make right statements:

28 of 51

Building domain specific knowledge graphs

We’re collecting all possible COVID-19 data and archiving in our Dataverse
Looking for various related controlled vocabularies and ontologies
Building and reusing conversion pipelines to get all data values linked in RDF format

The ultimate goal is to automate the process of the Knowledge extraction by using the latest developments in Artificial Intelligence and Deep Learning.

29 of 51

Visual graph of COVID-19 dataset

Source: CoronaWhy GraphDB

30 of 51

SPARQL endpoint for CoronaWhy KG

Source: YASGUI

31 of 51

Do you know a lot of people that can use SPARQL?

Source: Semantic annotation of the Laboratory Chemical Safety Summary in PubChem

32 of 51

CLARIAH conclusions

“By developing these decentralised, yet controlled Knowledge Graph development practices we have contributed to increasing interoperability in the humanities and enabling new research opportunities to a wide range of scholars. However, we observe that users without Semantic Web knowledge find these technologies hard to use, and place high value in end-user tools that enable engagement. Therefore, for the future we emphasise the importance of tools to specifically target the goals of concrete communities – in our case, the analytical and quantitative answering of humanities research questions for humanities scholars. In this sense, usability is not just important in a tool context; in our view, we need to empower users in deciding under what models these tools operate.” (CLARIAH: Enabling Interoperability Between Humanities Disciplines with Ontologies)

Chicken-egg problem: users are building tools without data models and ontologies but in reality they need to build a knowledge graph with common ontologies first!

33 of 51

Linked Data integration challenges

datasets are very heterogeneous and multilingual
data usually lacks sufficient data quality control
data providers using different modeling schemas and styles
linked data cleansing and versioning is very difficult to track and maintain properly, web resources aren’t persistent
even modern data repositories providing only metadata records describing data without giving access to individual data items stored in files
difficult to assign and manually keep up-to-date entity relationships in knowledge graph

CoronaWhy has too much information streams that seems to be impossible to integrate and give back to COVID-19 researchers. So, do we have a solution?

34 of 51

Bibliographic Framework (BIBFRAME) as a Web of Data

“The Library of Congress officially launched its Bibliographic Framework Initiative in May 2011. The Initiative aims to re-envision and, in the long run, implement a new bibliographic environment for libraries that makes "the network" central and makes interconnectedness commonplace.”

“Instead of thousands of catalogers repeatedly describing the same resources, the effort of one cataloger could be shared with many.” (Source)

In 2019 BIBFRAME 2.0, the Library of Congress Pilot, was announced.

Let’s take a journey and move from domain specific ontology to bibliographic!

35 of 51

BIBFRAME 2.0 concepts

Work. The highest level of abstraction, a Work, in the BIBFRAME context, reflects the conceptual essence of the cataloged resource: authors, languages, and what it is about (subjects).
Instance. A Work may have one or more individual, material embodiments, for example, a particular published form. These are Instances of the Work. An Instance reflects information such as its publisher, place and date of publication, and format.
Item. An item is an actual copy (physical or electronic) of an Instance. It reflects information such as its location (physical or virtual), shelf mark, and barcode.
Agents: Agents are people, organizations, jurisdictions, etc., associated with a Work or Instance through roles such as author, editor, artist, photographer, composer, illustrator, etc.
Subjects: A Work might be “about” one or more concepts. Such a concept is said to be a “subject” of the Work. Concepts that may be subjects include topics, places, temporal expressions, events, works, instances, items, agents, etc.
Events: Occurrences, the recording of which may be the content of a Work.

Source: the Library of Congress

36 of 51

MARC as a foundation of the structured Data Hub

Source: the Library of Congress, USA

MARC standard was developed in the 1960s to create records that could be read by computers and shared among libraries. The term MARC is an abbreviation for MAchine Readable Cataloging.

The MARC 21 bibliographic format was created for the international community. It’s very rich, with more than 2,000 data elements defined!

It’s identified by its ISO (International Standards Organization) number: ISO 2709

37 of 51

How to integrate data in the common KG?

Use MARC 21 as a basis for all bibliographic and authority records
All controlled vocabularies should be expressed in MARC 21 format for Authority Data, we need to build an authority linking process with the “human in the loop” approach that will allow to verify AI predicted links.
Different MARC 21 fields could be linked to the different ontologies and/or even interlinked. For example, we can get some entities linked to both MeSH and Wikidata in the same bibliographic record to increase the interoperability of the Knowledge Graph.
Every CORD-19 paper can get a metadata enrichment provided by any research team working on the NLP extraction of entities, relations or linking CV together.

38 of 51

COVID-19 paper in MARC 21 representation

Structure of the bibliographic record:

authority records contain information about authors and affiliation
Medical entities extracted by NLP pipeline interlinked in 650x fields
part of metadata fields generated and filled by Machine Learning models, part contributed by human experts
provenance information kept in 833x fields indicating fully or partially machine-generated records
Relations between entities stored in 730x fields

Source: CoronaWhy CORD-19 portal

39 of 51

Vufind discovery tool for libraries powered by MARC 21

Source: CoronaWhy VuFind

40 of 51

Landing page of publications from CORD-19

Source: CoronaWhy VuFind

41 of 51

CORD-19 collection in BIBFRAME 2.0

42 of 51

CoronaWhy Graph published as RDF

Source: CoronaWhy Dataverse

43 of 51

Human-in-the-Loop for Machine Learning

“Computers are incredibly fast, accurate and stupid; humans are incredibly slow, inaccurate and brilliant; together they are powerful beyond imagination." Albert Einstein

“A combination of AI and Human Intelligence gives rise to an extremely high level of accuracy and intelligence (Super Intelligence)”

Source: Hackernoon.com

44 of 51

CLARIAH and Network Digital Heritage (NDE) GraphiQL

CoronaWhy is running own instance of NDE on Kubernetes cluster and maintains the support of another ontologies (MeSH, ICD, NDF) available via their SPARQL endpoints!

If you want to query API:

curl "http://nde.dataverse-dev.coronawhy.org/nde/graphql?query=%20%7B%20terms(match%3A%22COVID%22%2Cdataset%3A%5B%22wikidata%22%5D)%20%7B%20dataset%20terms%20%7Buri%2C%20altLabel%7D%20%7D%20%7D"

45 of 51

Increasing Dataverse metadata interoperability

External controlled vocabularies support contributed by SSHOC project (data infrastructure for the EOSC)

46 of 51

Hypothes.is annotations as a peer review service

AI pipeline does domain specific entities extraction and ranking of relevant CORD-19 papers.
Automatic entities and statements will be added, important fragments should be highlighted.
Human annotators should verify results and validate all statements.

47 of 51

Doccano annotation with Machine Learning

Source: Doccano Labs

48 of 51

Building an Operating System for Open Science

CoronaWhy Common Research and Data Infrastructure is distributed and robust enough to be scaled up and reused for other tasks like cancer research

All services are build from Open Source components

Data processed and published in FAIR way, the provenance information is the part of our Data Lake

Data evaluation and credibility is the top priority, we’re providing tools for the expert community for the verification of our datasets

The transparency of data and services guarantees the reproducibility of all experiments and get bring new insights in COVID-19 research

49 of 51

CoronaWhy Common Research and Data Infrastructure

Data preprocessing pipeline implemented on Jupyter notebook Docker with extra modules.

Dataverse as data repository to store data from automatic and curated workflows.

Elasticsearch has CORD-19 indexes on sections and sentences level with spacy enrichments. Other indexes: MeSH, Geonames, GRID.

Hypothesis and Doccano annotation services to annotate publications

Virtuoso and GraphDB with public SPARQL endpoints to query COVID-19 Knowledge Graph

Other services: Colab, MongoDB, Kibana, BEL Commons 3.0, INDRA, Geoparser, Tabula

https://github.com/CoronaWhy/covid-19-infrastructure

50 of 51

Source: CoronaWhy API built on FastAPI framework for Python

51 of 51

Thank you! Questions?

@Slava Tykhonov on CoronaWhy Slack

vyacheslav.tykhonov@dans.knaw.nl