1 of 50

CoronaWhy

Fight against COVID-19

Slava Tykhonov

Senior Information Scientist (DANS-KNAW)

04.06.2020

2 of 50

COVID-19 Coronarius Europe datasets

I’ve started to collect data in the middle of March and created EU COVID-19 Data Hub on Harvard Dataverse to make all datasets persistent and FAIR.

All datasets archived and updated on the daily basis.

More than 2k downloads in total!

https://dataverse.harvard.edu/dataverse/covid-19-eu

That’s a lot of data, how to process and analyse?

3 of 50

My motivation

4 of 50

7 weeks in the complete lockdown in Spain

R esistere (I will resist)

5 of 50

About CoronaWhy

1066 people registered in the

organization, more than 300 actively contributing!

6 of 50

COVID-19 Open Research Dataset Challenge (CORD-19)

It’s all started from this:

“In response to the COVID-19 pandemic and with the view to boost research, the Allen Institute for AI together with CZI, MSR, Georgetown, NIH & The White House is collecting and making available for free the COVID-19 Open Research Dataset (CORD-19). This resource is updated weekly and contains over 52,000 scholarly articles, including 41,000 with full text, about COVID-19 and other viruses of the coronavirus family.” (Kaggle)

7 of 50

CoronaWhy Community Tasks (March-April)

Task-Risk helps to identify risk factors that can increase the chance of being infected, or affects the severity or the survival outcome of the infection
Task-Ties to explore transmission, incubation and environment stability
Named Entity Recognition across the entire corpus of CORD-19 papers with full text
Match Clinical Trials allows exploration of the results from the COVID-19 International Clinical Trials dataset
COVID-19 Literature Visualization helps to explore the data behind the AI-powered literature review

I’m managing Labs and Common Data and Services for the whole community.

8 of 50

PowerBI Dashboards

Created by Mike Honey (Australia)

9 of 50

You might think it’s like that...

Credits: wonderfulengineering.com

10 of 50

But in reality leading 1k people could be seen like...

11 of 50

Some rules of High Scale Open Source project management

We’re running like mad in all directions in the same time and stuck in the perfect state of entropy. Nobody knows what will happen next
Forget all of your previous experience, nothing really works here
People with different motivation acting like volunteers don’t like to be pushed, the “war” can start anytime
Most of people acting like lone wolves, the most advanced joining the teams
Great leaders with interesting ideas attracting and hiring people from other teams, sometimes making internal HR crisis
Almost all issues with international relations can be solved by finding Commons. If some people aren’t able to find a commons, they’re falling apart.
If people have a free time and ready to contribute - we should do it right now!

12 of 50

Valve Corporation as the most similar example

“The change in Valve's approach has also been attributed to its use of a flat organization structure that was adopted as the company expanded. When founded, Valve used a hierarchical structure more typical of other development firms, driven by the nature of physical game releases through publishers that required tasks to be completed by given deadlines.^[35] However, as Valve became its own publisher via Steam, it transitioned to a looser, flat structure, which was formally in place by 2012.^[36]^[37] Outside of executive management, Valve does not have bosses, and the company used an open allocation system, allowing employees to move between departments at will.^[38]^[39] This approach allows employees to work on whatever interests them, but requires them to take ownership of their product and mistakes they may make, according to Newell. Newell recognized that this structure works well for some but that "there are plenty of great developers for whom this is a terrible place to work".^[35] Many outside observers believe the lack of organization structure has led to frequent cancellations of potential games as it can be difficult to convince other employees to work on such titles.^[40]^[41]^[42]”

(from Wikipedia)

13 of 50

Rules of people “management” in the Open Source

People want be directed and supported, not managed

Volunteering job considered as a self-test, for example, can Junior Data Scientist perform as a Senior, postgraduate student lead a project, Senior become a Mentor.

We’re trying to support all of volunteers as a kind of advisory board and sharing available resources!

People coming and leaving, let them contribute their best to the Open community

CoronaWhy seen as a huge incubator, we don’t want to kill any ideas, even considered as a “crazy” by the most of people

14 of 50

Lessons for creating good Open Source software

Every good work of software starts by scratching a developer's personal itch.
Good programmers know what to write. Great ones know what to rewrite (and reuse).
Plan to throw one [version] away; you will, anyhow. (Copied from Frederick Brooks' The Mythical Man-Month)
...
Release early. Release often. And listen to your customers.
Given a large enough beta-tester and co-developer base, almost every problem will be characterized quickly and the fix obvious to someone.
Smart data structures and dumb code works a lot better than the other way around.
….
Often, the most striking and innovative solutions come from realizing that your concept of the problem was wrong.
Any tool should be useful in the expected way, but a truly great tool lends itself to uses you never expected.
When writing gateway software of any kind, take pains to disturb the data stream as little as possible—and never throw away information unless the recipient forces you to!
...
To solve an interesting problem, start by finding a problem that is interesting to you.
Provided the development coordinator has a communications medium at least as good as the Internet, and knows how to lead without coercion, many heads are inevitably better than one.

The Cathedral and the Bazaar

Benevolent Dictator For Life?

15 of 50

Slack people land

16 of 50

Slack channels intersection

17 of 50

CoronaWhy Skills

18 of 50

CRM as a service (vtiger)

19 of 50

Skills management at CoronaWhy community

All registration data anonymized and available in the CRM with respect to GDPR, skills of people can be selected in the according fields

Volunteers can be contacted by task leaders and invited to participate based on their skills

All tasks and activities should be transparent and clear, time zones aligned

20 of 50

Looking for Commons

Merce Crosas, “Harvard Data Commons”

21 of 50

CoronaWhy Dataverse

CoronaWhy Dataverse is a central integration point for all teams and individual contributors

Both manual and automatic data upload, integrated with Jupyter Notebooks by Data Access API (pyDataverse)

Data points synchronized with services

All people that are working in the teams can get acknowledgement

We don’t steal from others work, credit is given to the author(s) by asking to put their names in Dataverse, GitHub, docs, articles!

URL: datasets.coronawhy.org

22 of 50

CoronaWhy Data Lake

Source: LinkedIn

23 of 50

COVID-19 data files verification

We do a verification of every files by importing its contents to dataframe.

All column names (variables) extracted from tabular data available as labels in files metadata

We’ve enabled Dataverse data previewers to browse through the content of files without download!

We’re starting internal challenges to build ML models for the metadata classification

24 of 50

COVID-19 Data Crowdsourcing

CoronaWhy data management team does does the review of all harvested datasets and try to identify the important data.

We’re approaching github owners by creating issues in their repos and inviting them to help us.

More than 20% of data owners joining CoronaWhy community or interested to curate their datasets.

Bottom-up data collection works!

25 of 50

CoronaWhy Common Research and Data Infrastructure

Data preprocessing pipeline implemented on Jupyter notebook Docker with extra modules.

Dataverse as data repository to store data from automatic and curated workflows.

Elasticsearch has CORD-19 indexes on sections and sentences level with spacy enrichments. Other indexes: MeSH, Geonames, GRID.

Hypothesis annotation service is running allows to annotate CORD-19 papers.

Virtuoso triplestore with public SPARQL Endpoint to query COVID-19 Knowledge Graph

Other services: Colab, MongoDB, Kibana, BEL Commons 3.0, INDRA, Geoparser, Tabula

https://github.com/CoronaWhy/covid-19-infrastructure

26 of 50

Building a horizonal platform to surve vertical teams

Source: CoronaWhy organization

27 of 50

Key points of CoronaWhy infrastructure management

Some basic agreements:

all experimental services have “labs” subdomain name, we can decide to shut them down if nobody is interested. New service candidates should be up and running asap and tested by users
production services deployed on Kubernetes CI/CD (GCP and Amazon AWS)
infrastructure team investigating the maturity of possible services, adding CI/CD integrations if it’s necessary
every member of the CoronaWhy community can get access to VM sandbox by request and build own application or tool, and suggest it as a service to the infrastructure team

28 of 50

CoronaWhy Collab service

We’re trying to engage the Open collaboration in teams using our services. It seems to be a good solution for education and training of all team members, and onboarding newcomers.

People could choose which tasks they prefer the most and join the team!

29 of 50

CORD-19 preprocessing pipeline

Based on Allen AI spaCy pipeline and models for scientific and biomedical documents

https://github.com/allenai/scispacy

A full spaCy pipeline for biomedical data with a ~785k vocabulary and 600k word vectors

Demo https://scispacy.apps.allenai.org

30 of 50

Elasticsearch index on CORD-19 sentences

Data preprocessing pipeline does entity linking, named entities, dependencies. Check it now!

31 of 50

Common Problems

Preprocessing pipeline can extract from text only some entities, not relations
The quality of entities recognition isn’t perfect and not reliable, depending on the trained model(s) and number of entities in vocabulary
NLP pipelines are very slow and require a lot of computations
disambiguation of entities isn’t possible without keeping context and relations between all words in the text
building high quality Knowledge Graph is an essential process of the whole pipeline, the idea to put all statements in triples
every statement should be verified by human experts with keeping all the provenance information in order to produce a trusted Knowledge

32 of 50

Building bridges between communities

Biological community working on the different datasets than researchers from social-economic field

Computer scientists usually build tools and dashboards without input from other communities just to analyze available data

Scientometrics community has own ideas about the importance and ranking of COVID-19 papers

Do we really have a Commons here?

33 of 50

Hypothesis as a peer review service

AI pipeline does domain specific entities extraction and ranking of relevant CORD-19 papers.
Automatic entities and statements will be added, important fragments should be highlighted.
Human annotators should verify results and validate all statements.

34 of 50

Doccano as a service

“doccano is an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequence to sequence tasks. So, you can create labeled data for sentiment analysis, named entity recognition, text summarization and so on. “

It was initially developed in Japan and now helping to create COVID-19 related annotations that will be done as a collaborative effort of human and AI. CoronaWhy Labs has added Machine Learning models trained on biological ontologies to label entities in CORD-19 collection.

Project github: https://github.com/doccano/doccano

35 of 50

Doccano annotation with Machine Learning

36 of 50

Biological Expression Language (BEL)

37 of 50

BEL example in notebook

Maintained by Charles Tapley Hoyt

ORCID: 0000-0003-4423-4370

(from BEL tutorial)

GRB2 bind SHC

p(hgnc:4566 ! GRB2)

r(hgnc:4566 ! GRB2)

g(hgnc:4566 ! GRB2)

p(hgnc:10840 ! SHC)

complex(p(hgnc:4566 ! GRB2), p(hgnc:10840 ! SHC))

p(hgnc:8614 ! PAWR)

p(hgnc:8614 ! PAWR, pmod(Ph, Thr, 163)) increases bp(go:0006915 ! "apoptotic process")

p(hgnc:8614 ! PAWR, pmod(Ph, Thr, 163)) increases p(hgnc:9588 ! PTEN)

p(hgnc:9588 ! PTEN) increases bp(go:0006915 ! "apoptotic process")

p(hgnc:8614 ! PAWR, pmod(Ph)) =| complex(p(hgnc:8614 ! PAWR), p(fplx:14_3_3))

38 of 50

INDRA statements

INDRA (Integrated Network and Dynamical Reasoning Assembler) is an automated model assembly system interfacing with NLP systems and databases to collect knowledge, and through a process of assembly, produce causal graphs and dynamical models.

http://www.indra.bio

INDRA producing triples!

39 of 50

Advanced Text Analysis

Source: D.Shlepakov, How to Build Knowledge Graph, notebook

We need a good understanding of all domain specific texts to make right statements:

40 of 50

Building Knowledge Graph

We’re collecting all possible COVID-19 data and archiving in our Dataverse
Looking for various related controlled vocabularies and ontologies
Building and reusing conversion pipelines to get all data values in RDF format

The ultimate goal is to automate the process of the Knowledge extraction by using the latest developments in Artificial Intelligence and Deep Learning.

41 of 50

CoronaWhy building blocks

42 of 50

Using altmetrics for ranking relevant COVID-19 papers

43 of 50

CORD-19 Altmetrics integration

Jupyter notebook with Altmetrics as citations count, social media coverage, etc

Altmetric data harvested by DOI and archived in Dataverse

Data points ingested and integrated in MongoDB service

Provided very easy-to-use access to altmetrics via notebooks

The pipeline can keep data up-to-date

44 of 50

Use case: COVID-19 Dutch papers affiliations

Source: CORD-19 collection without institutions affiliation labels

Process: ML model to get a linkage to GRID database available in our infrastructure

Result: affiliations matched for the most of papers, the dataset enrichment for location, lon-lat, gridID.

It’s published on Dataverse.

Plan: scale-up this workflow and build a service to get affiliation label realtime

45 of 50

Huge impact of CoronaWhy activities on the society

People from all time zones are present in this radically transparent community

Quick and amazing exchange of ideas and sharing knowledge between more than 1k people with really different background (developers, scientists, doctors, epidemiologists, students, and various subject matter experts)

Creations of the community attracting the best talents as CoronaWhy is providing a Open COVID-19 research platform with some dedicated resources and experts

Community members are contributing to COVID-19 related discussions worldwide and suggesting new ideas, sometimes award winning

Anybody can join, provide introduction and get onboarding instructions immediately

46 of 50

Building an Operating System for Open Science

CoronaWhy Common Research and Data Infrastructure is distributed and robust enough to be scaled up and reused for other tasks like cancer research

All services are build from Open Source components

Data processed and published in FAIR way, the provenance information is the part of our Data Lake

Data evaluation and credibility is the top priority, we’re providing tools for the expert community for the verification of our datasets

The transparency of data and services guarantees the reproducibility of all experiments and get bring new insights in COVID-19 research