1 of 51

Building CoronaWhy Knowledge Graph

FREYA Guest webinar

Slava Tykhonov

Senior Information Scientist

(DANS-KNAW, the Netherlands)

01.09.2020

2 of 51

About me: DANS-KNAW projects (2016-2020)

  • CLARIAH+ (ongoing)
  • EOSC Synergy (ongoing)
  • SSHOC Dataverse (ongoing)
  • CESSDA DataverseEU 2018
  • Time Machine Europe Supervisor at DANS-KNAW
  • PARTHENOS Horizon 2020
  • CESSDA PID (Personal Identifiers) Horizon 2020
  • CLARIAH
  • RDA (Research Data Alliance) PITTS Horizon 2020
  • CESSDA SaW H2020-EU.1.4.1.1 Horizon 2020

2

Source: LinkedIn

3 of 51

Motivation

3

4 of 51

7 weeks in lockdown in Spain

4

5 of 51

About CoronaWhy

5

1300+ people registered in the

organization, more than 300 actively contributing!

6 of 51

COVID-19 Open Research Dataset Challenge (CORD-19)

It’s all started from this (March, 2020):

“In response to the COVID-19 pandemic and with the view to boost research, the Allen Institute for AI together with CZI, MSR, Georgetown, NIH & The White House is collecting and making available for free the COVID-19 Open Research Dataset (CORD-19). This resource is updated weekly and contains over 52,000 scholarly articles, including 41,000 with full text, about COVID-19 and other viruses of the coronavirus family.” (Kaggle)

6

7 of 51

Motivation of CoronaWhy community members

7

Credits: Andre Ye

8 of 51

CoronaWhy Funding

Initial: $5k from Google on GCP and $4k from Amazon on AWS (April 2020)

Donations: $9k and 15k british pounds to sustain CoronaWhy infrastructure

8

9 of 51

CoronaWhy Community Tasks (March-April)

  1. Task-Risk helps to identify risk factors that can increase the chance of being infected, or affects the severity or the survival outcome of the infection
  2. Task-Ties to explore transmission, incubation and environment stability
  3. Match Clinical Trials allows exploration of the results from the COVID-19 International Clinical Trials dataset
  4. COVID-19 Literature Visualization helps to explore the data behind the AI-powered literature review
  5. Named Entity Recognition across the entire corpus of CORD-19 papers with full text

9

10 of 51

CORD-19 affiliations recognized with Deep Learning

10

11 of 51

Collaboration with other organizations

  • Harvard Medical School, INDRA integration
  • Helix Group, Stanford University
  • NASA JPL, COVID-19 knowledge graph and GeoParser
  • Kaggle, coronamed application
  • Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, knowledge graph
  • dcyphr, a platform for creating and engaging with distillations of academic articles
  • CAMARADES (Collaborative Approach to Meta-Analysis and Review of Animal Data from Experimental Studies)

We’ve got almost endless data streams...

11

12 of 51

Looking for Commons

Merce Crosas, “Harvard Data Commons

12

13 of 51

Building a horizontal platform to serve vertical teams

13

Source: CoronaWhy infrastructure introduction

14 of 51

Turning FAIR into reality!

14

DANS-KNAW is one of worldwide leaders in FAIR Data (FAIRsFAIR)

15 of 51

Standing on the Shoulders of Giants: infrastructure

15

16 of 51

Standing on the Shoulders of Giants: Big Data of the Past

16

17 of 51

Dataverse as data integration point

17

  • Available as a service for the community from April, 2020
  • Used by CoronaWhy vertical teams for the data exchange and share
  • Intended to help researchers to make their data FAIR
  • One of the biggest COVID-19 data archives in the world with 700k files
  • New teams are getting own data containers and can reuse data collected and produced by others

http://datasets.coronawhy.org

18 of 51

Dataset from CoronaWhy vertical teams

18

19 of 51

COVID-19 data files verification

19

We do a verification of every file by importing its contents to dataframe.

All column names (variables) extracted from tabular data available as labels in files metadata

We’ve enabled Dataverse data previewers to browse through the content of files without download!

We’re starting internal challenges to build ML models for the metadata classification

20 of 51

Dataverse content in Jupyter notebooks

20

21 of 51

COVID-19 Data Crowdsourcing

CoronaWhy data management team does does the review of all harvested datasets and try to identify the important data.

We’re approaching github owners by creating issues in their repos and inviting them to help us.

More than 20% of data owners joining CoronaWhy community or interested to curate their datasets.

Bottom-up data collection works!

21

22 of 51

Challenge of data integration and various ontologies

CORD-19 collection workflows with NLP pipeline:

  • manual annotation and labelling of COVID-19 related papers
  • automatic entity extraction and classification of text fragments
  • statements extraction and curation
  • linking papers to specific research questions with relationships extraction

Dataverse Data Lake streaming COVID-19 datasets from various sources:

  • medical data
  • socio-economic Data
  • political data and statistics

22

23 of 51

The importance of standards and ontologies

Generic controlled vocabularies to link metadata in the bibliographic collections are well known: ORCID, GRID, GeoNames, Getty.

Medical knowledge graphs powered by:

  • Biological Expression Language (BEL)
  • Medical Subject Headings (MeSH®) by U.S. National Library of Medicine (NIH)
  • Wikidata (Open ontology) - Wikipedia

Integration based on metadata standards:

  • MARC21, Dublin Core (DC), Data Documentation Initiative (DDI)

23

24 of 51

Biological Expression Language (BEL)

24

BEL was integrated in CoronaWhy infrastructure in April, 2020

25 of 51

Statements extraction with INDRA

25

Source: EMMAA (Ecosystem of Machine-maintained Models with Automated Assembly)

“INDRA (Integrated Network and Dynamical Reasoning Assembler) is an automated model assembly system, originally developed for molecular systems biology and currently being generalized to other domains.”

Developed as a part of Harvard Program in Therapeutic Science and the Laboratory of Systems Pharmacology at Harvard Medical School.

http://indra.bio

26 of 51

Knowledge Graph curation in INDRA

26

27 of 51

Advanced Text Analysis (NLP pipeline)

27

We need a good understanding of all domain specific texts to make right statements:

28 of 51

Building domain specific knowledge graphs

28

  • We’re collecting all possible COVID-19 data and archiving in our Dataverse
  • Looking for various related controlled vocabularies and ontologies
  • Building and reusing conversion pipelines to get all data values linked in RDF format

The ultimate goal is to automate the process of the Knowledge extraction by using the latest developments in Artificial Intelligence and Deep Learning.

29 of 51

Visual graph of COVID-19 dataset

29

30 of 51

SPARQL endpoint for CoronaWhy KG

30

Source: YASGUI

31 of 51

Do you know a lot of people that can use SPARQL?

31

Source: Semantic annotation of the Laboratory Chemical Safety Summary in PubChem

32 of 51

CLARIAH conclusions

By developing these decentralised, yet controlled Knowledge Graph development practices we have contributed to increasing interoperability in the humanities and enabling new research opportunities to a wide range of scholars. However, we observe that users without Semantic Web knowledge find these technologies hard to use, and place high value in end-user tools that enable engagement. Therefore, for the future we emphasise the importance of tools to specifically target the goals of concrete communities – in our case, the analytical and quantitative answering of humanities research questions for humanities scholars. In this sense, usability is not just important in a tool context; in our view, we need to empower users in deciding under what models these tools operate.” (CLARIAH: Enabling Interoperability Between Humanities Disciplines with Ontologies)

Chicken-egg problem: users are building tools without data models and ontologies but in reality they need to build a knowledge graph with common ontologies first!

32

33 of 51

Linked Data integration challenges

  • datasets are very heterogeneous and multilingual
  • data usually lacks sufficient data quality control
  • data providers using different modeling schemas and styles
  • linked data cleansing and versioning is very difficult to track and maintain properly, web resources aren’t persistent
  • even modern data repositories providing only metadata records describing data without giving access to individual data items stored in files
  • difficult to assign and manually keep up-to-date entity relationships in knowledge graph

CoronaWhy has too much information streams that seems to be impossible to integrate and give back to COVID-19 researchers. So, do we have a solution?

33

34 of 51

Bibliographic Framework (BIBFRAME) as a Web of Data

“The Library of Congress officially launched its Bibliographic Framework Initiative in May 2011. The Initiative aims to re-envision and, in the long run, implement a new bibliographic environment for libraries that makes "the network" central and makes interconnectedness commonplace.”

“Instead of thousands of catalogers repeatedly describing the same resources, the effort of one cataloger could be shared with many.” (Source)

In 2019 BIBFRAME 2.0, the Library of Congress Pilot, was announced.

Let’s take a journey and move from domain specific ontology to bibliographic!

34

35 of 51

BIBFRAME 2.0 concepts

35

  • Work. The highest level of abstraction, a Work, in the BIBFRAME context, reflects the conceptual essence of the cataloged resource: authors, languages, and what it is about (subjects).
  • Instance. A Work may have one or more individual, material embodiments, for example, a particular published form. These are Instances of the Work. An Instance reflects information such as its publisher, place and date of publication, and format.
  • Item. An item is an actual copy (physical or electronic) of an Instance. It reflects information such as its location (physical or virtual), shelf mark, and barcode.
  • Agents: Agents are people, organizations, jurisdictions, etc., associated with a Work or Instance through roles such as author, editor, artist, photographer, composer, illustrator, etc.
  • Subjects: A Work might be “about” one or more concepts. Such a concept is said to be a “subject” of the Work. Concepts that may be subjects include topics, places, temporal expressions, events, works, instances, items, agents, etc.
  • Events: Occurrences, the recording of which may be the content of a Work.

36 of 51

MARC as a foundation of the structured Data Hub

36

MARC standard was developed in the 1960s to create records that could be read by computers and shared among libraries. The term MARC is an abbreviation for MAchine Readable Cataloging.

The MARC 21 bibliographic format was created for the international community. It’s very rich, with more than 2,000 data elements defined!

It’s identified by its ISO (International Standards Organization) number: ISO 2709

37 of 51

How to integrate data in the common KG?

  • Use MARC 21 as a basis for all bibliographic and authority records
  • All controlled vocabularies should be expressed in MARC 21 format for Authority Data, we need to build an authority linking process with the “human in the loop” approach that will allow to verify AI predicted links.
  • Different MARC 21 fields could be linked to the different ontologies and/or even interlinked. For example, we can get some entities linked to both MeSH and Wikidata in the same bibliographic record to increase the interoperability of the Knowledge Graph.
  • Every CORD-19 paper can get a metadata enrichment provided by any research team working on the NLP extraction of entities, relations or linking CV together.

37

38 of 51

COVID-19 paper in MARC 21 representation

38

Structure of the bibliographic record:

  • authority records contain information about authors and affiliation
  • Medical entities extracted by NLP pipeline interlinked in 650x fields
  • part of metadata fields generated and filled by Machine Learning models, part contributed by human experts
  • provenance information kept in 833x fields indicating fully or partially machine-generated records
  • Relations between entities stored in 730x fields

39 of 51

Vufind discovery tool for libraries powered by MARC 21

39

40 of 51

Landing page of publications from CORD-19

40

41 of 51

CORD-19 collection in BIBFRAME 2.0

41

42 of 51

CoronaWhy Graph published as RDF

42

43 of 51

Human-in-the-Loop for Machine Learning

Computers are incredibly fast, accurate and stupid; humans are incredibly slow, inaccurate and brilliant; together they are powerful beyond imagination." Albert Einstein

“A combination of AI and Human Intelligence gives rise to an extremely high level of accuracy and intelligence (Super Intelligence)”

43

44 of 51

CLARIAH and Network Digital Heritage (NDE) GraphiQL

44

CoronaWhy is running own instance of NDE on Kubernetes cluster and maintains the support of another ontologies (MeSH, ICD, NDF) available via their SPARQL endpoints!

If you want to query API:

curl "http://nde.dataverse-dev.coronawhy.org/nde/graphql?query=%20%7B%20terms(match%3A%22COVID%22%2Cdataset%3A%5B%22wikidata%22%5D)%20%7B%20dataset%20terms%20%7Buri%2C%20altLabel%7D%20%7D%20%7D"

45 of 51

Increasing Dataverse metadata interoperability

45

External controlled vocabularies support contributed by SSHOC project (data infrastructure for the EOSC)

46 of 51

Hypothes.is annotations as a peer review service

46

  1. AI pipeline does domain specific entities extraction and ranking of relevant CORD-19 papers.
  2. Automatic entities and statements will be added, important fragments should be highlighted.
  3. Human annotators should verify results and validate all statements.

47 of 51

Doccano annotation with Machine Learning

48 of 51

Building an Operating System for Open Science

48

CoronaWhy Common Research and Data Infrastructure is distributed and robust enough to be scaled up and reused for other tasks like cancer research

All services are build from Open Source components

Data processed and published in FAIR way, the provenance information is the part of our Data Lake

Data evaluation and credibility is the top priority, we’re providing tools for the expert community for the verification of our datasets

The transparency of data and services guarantees the reproducibility of all experiments and get bring new insights in COVID-19 research

49 of 51

CoronaWhy Common Research and Data Infrastructure

Data preprocessing pipeline implemented on Jupyter notebook Docker with extra modules.

Dataverse as data repository to store data from automatic and curated workflows.

Elasticsearch has CORD-19 indexes on sections and sentences level with spacy enrichments. Other indexes: MeSH, Geonames, GRID.

Hypothesis and Doccano annotation services to annotate publications

Virtuoso and GraphDB with public SPARQL endpoints to query COVID-19 Knowledge Graph

Other services: Colab, MongoDB, Kibana, BEL Commons 3.0, INDRA, Geoparser, Tabula

https://github.com/CoronaWhy/covid-19-infrastructure

49

50 of 51

50

Source: CoronaWhy API built on FastAPI framework for Python

51 of 51

Thank you! Questions?

@Slava Tykhonov on CoronaWhy Slack

vyacheslav.tykhonov@dans.knaw.nl

www.coronawhy.org

www.dans.knaw.nl/en

51