Building CoronaWhy Knowledge Graph
FREYA Guest webinar
Slava Tykhonov
Senior Information Scientist
(DANS-KNAW, the Netherlands)
01.09.2020
About me: DANS-KNAW projects (2016-2020)
2
Source: LinkedIn
Motivation
3
7 weeks in lockdown in Spain
4
About CoronaWhy
5
1300+ people registered in the
organization, more than 300 actively contributing!
COVID-19 Open Research Dataset Challenge (CORD-19)
It’s all started from this (March, 2020):
“In response to the COVID-19 pandemic and with the view to boost research, the Allen Institute for AI together with CZI, MSR, Georgetown, NIH & The White House is collecting and making available for free the COVID-19 Open Research Dataset (CORD-19). This resource is updated weekly and contains over 52,000 scholarly articles, including 41,000 with full text, about COVID-19 and other viruses of the coronavirus family.” (Kaggle)
6
Motivation of CoronaWhy community members
7
Credits: Andre Ye
CoronaWhy Funding
Initial: $5k from Google on GCP and $4k from Amazon on AWS (April 2020)
Donations: $9k and 15k british pounds to sustain CoronaWhy infrastructure
8
CoronaWhy Community Tasks (March-April)
9
CORD-19 affiliations recognized with Deep Learning
10
Collaboration with other organizations
We’ve got almost endless data streams...
11
Looking for Commons
Merce Crosas, “Harvard Data Commons”
12
Building a horizontal platform to serve vertical teams
13
Source: CoronaWhy infrastructure introduction
Standing on the Shoulders of Giants: infrastructure
15
Standing on the Shoulders of Giants: Big Data of the Past
16
Dataverse as data integration point
17
Dataset from CoronaWhy vertical teams
18
Source: CoronaWhy Dataverse
COVID-19 data files verification
19
We do a verification of every file by importing its contents to dataframe.
All column names (variables) extracted from tabular data available as labels in files metadata
We’ve enabled Dataverse data previewers to browse through the content of files without download!
We’re starting internal challenges to build ML models for the metadata classification
Dataverse content in Jupyter notebooks
20
COVID-19 Data Crowdsourcing
CoronaWhy data management team does does the review of all harvested datasets and try to identify the important data.
We’re approaching github owners by creating issues in their repos and inviting them to help us.
More than 20% of data owners joining CoronaWhy community or interested to curate their datasets.
Bottom-up data collection works!
21
Challenge of data integration and various ontologies
CORD-19 collection workflows with NLP pipeline:
Dataverse Data Lake streaming COVID-19 datasets from various sources:
22
The importance of standards and ontologies
Generic controlled vocabularies to link metadata in the bibliographic collections are well known: ORCID, GRID, GeoNames, Getty.
Medical knowledge graphs powered by:
Integration based on metadata standards:
23
Biological Expression Language (BEL)
24
BEL was integrated in CoronaWhy infrastructure in April, 2020
Statements extraction with INDRA
25
Source: EMMAA (Ecosystem of Machine-maintained Models with Automated Assembly)
“INDRA (Integrated Network and Dynamical Reasoning Assembler) is an automated model assembly system, originally developed for molecular systems biology and currently being generalized to other domains.”
Developed as a part of Harvard Program in Therapeutic Science and the Laboratory of Systems Pharmacology at Harvard Medical School.
Knowledge Graph curation in INDRA
26
Advanced Text Analysis (NLP pipeline)
Source: D.Shlepakov, How to Build Knowledge Graph, notebook
27
We need a good understanding of all domain specific texts to make right statements:
Building domain specific knowledge graphs
28
The ultimate goal is to automate the process of the Knowledge extraction by using the latest developments in Artificial Intelligence and Deep Learning.
Visual graph of COVID-19 dataset
29
Source: CoronaWhy GraphDB
SPARQL endpoint for CoronaWhy KG
30
Source: YASGUI
Do you know a lot of people that can use SPARQL?
31
Source: Semantic annotation of the Laboratory Chemical Safety Summary in PubChem
CLARIAH conclusions
“By developing these decentralised, yet controlled Knowledge Graph development practices we have contributed to increasing interoperability in the humanities and enabling new research opportunities to a wide range of scholars. However, we observe that users without Semantic Web knowledge find these technologies hard to use, and place high value in end-user tools that enable engagement. Therefore, for the future we emphasise the importance of tools to specifically target the goals of concrete communities – in our case, the analytical and quantitative answering of humanities research questions for humanities scholars. In this sense, usability is not just important in a tool context; in our view, we need to empower users in deciding under what models these tools operate.” (CLARIAH: Enabling Interoperability Between Humanities Disciplines with Ontologies)
Chicken-egg problem: users are building tools without data models and ontologies but in reality they need to build a knowledge graph with common ontologies first!
32
Linked Data integration challenges
CoronaWhy has too much information streams that seems to be impossible to integrate and give back to COVID-19 researchers. So, do we have a solution?
33
Bibliographic Framework (BIBFRAME) as a Web of Data
“The Library of Congress officially launched its Bibliographic Framework Initiative in May 2011. The Initiative aims to re-envision and, in the long run, implement a new bibliographic environment for libraries that makes "the network" central and makes interconnectedness commonplace.”
“Instead of thousands of catalogers repeatedly describing the same resources, the effort of one cataloger could be shared with many.” (Source)
In 2019 BIBFRAME 2.0, the Library of Congress Pilot, was announced.
Let’s take a journey and move from domain specific ontology to bibliographic!
34
BIBFRAME 2.0 concepts
35
Source: the Library of Congress
MARC as a foundation of the structured Data Hub
36
Source: the Library of Congress, USA
MARC standard was developed in the 1960s to create records that could be read by computers and shared among libraries. The term MARC is an abbreviation for MAchine Readable Cataloging.
The MARC 21 bibliographic format was created for the international community. It’s very rich, with more than 2,000 data elements defined!
It’s identified by its ISO (International Standards Organization) number: ISO 2709
How to integrate data in the common KG?
37
COVID-19 paper in MARC 21 representation
38
Structure of the bibliographic record:
Source: CoronaWhy CORD-19 portal
Vufind discovery tool for libraries powered by MARC 21
39
Source: CoronaWhy VuFind
Landing page of publications from CORD-19
40
Source: CoronaWhy VuFind
CORD-19 collection in BIBFRAME 2.0
41
CoronaWhy Graph published as RDF
42
Source: CoronaWhy Dataverse
Human-in-the-Loop for Machine Learning
“Computers are incredibly fast, accurate and stupid; humans are incredibly slow, inaccurate and brilliant; together they are powerful beyond imagination." Albert Einstein
“A combination of AI and Human Intelligence gives rise to an extremely high level of accuracy and intelligence (Super Intelligence)”
43
Source: Hackernoon.com
CLARIAH and Network Digital Heritage (NDE) GraphiQL
44
CoronaWhy is running own instance of NDE on Kubernetes cluster and maintains the support of another ontologies (MeSH, ICD, NDF) available via their SPARQL endpoints!
If you want to query API:
curl "http://nde.dataverse-dev.coronawhy.org/nde/graphql?query=%20%7B%20terms(match%3A%22COVID%22%2Cdataset%3A%5B%22wikidata%22%5D)%20%7B%20dataset%20terms%20%7Buri%2C%20altLabel%7D%20%7D%20%7D"
Increasing Dataverse metadata interoperability
45
External controlled vocabularies support contributed by SSHOC project (data infrastructure for the EOSC)
Hypothes.is annotations as a peer review service
46
Doccano annotation with Machine Learning
Building an Operating System for Open Science
48
CoronaWhy Common Research and Data Infrastructure is distributed and robust enough to be scaled up and reused for other tasks like cancer research
All services are build from Open Source components
Data processed and published in FAIR way, the provenance information is the part of our Data Lake
Data evaluation and credibility is the top priority, we’re providing tools for the expert community for the verification of our datasets
The transparency of data and services guarantees the reproducibility of all experiments and get bring new insights in COVID-19 research
CoronaWhy Common Research and Data Infrastructure
Data preprocessing pipeline implemented on Jupyter notebook Docker with extra modules.
Dataverse as data repository to store data from automatic and curated workflows.
Elasticsearch has CORD-19 indexes on sections and sentences level with spacy enrichments. Other indexes: MeSH, Geonames, GRID.
Hypothesis and Doccano annotation services to annotate publications
Virtuoso and GraphDB with public SPARQL endpoints to query COVID-19 Knowledge Graph
Other services: Colab, MongoDB, Kibana, BEL Commons 3.0, INDRA, Geoparser, Tabula
https://github.com/CoronaWhy/covid-19-infrastructure
49
50
Source: CoronaWhy API built on FastAPI framework for Python
51