CoronaWhy
Fight against COVID-19
Slava Tykhonov
Senior Information Scientist (DANS-KNAW)
04.06.2020
COVID-19 Coronarius Europe datasets
2
I’ve started to collect data in the middle of March and created EU COVID-19 Data Hub on Harvard Dataverse to make all datasets persistent and FAIR.
All datasets archived and updated on the daily basis.
More than 2k downloads in total!
https://dataverse.harvard.edu/dataverse/covid-19-eu
That’s a lot of data, how to process and analyse?
My motivation
3
7 weeks in the complete lockdown in Spain
4
About CoronaWhy
5
1066 people registered in the
organization, more than 300 actively contributing!
COVID-19 Open Research Dataset Challenge (CORD-19)
It’s all started from this:
“In response to the COVID-19 pandemic and with the view to boost research, the Allen Institute for AI together with CZI, MSR, Georgetown, NIH & The White House is collecting and making available for free the COVID-19 Open Research Dataset (CORD-19). This resource is updated weekly and contains over 52,000 scholarly articles, including 41,000 with full text, about COVID-19 and other viruses of the coronavirus family.” (Kaggle)
6
CoronaWhy Community Tasks (March-April)
I’m managing Labs and Common Data and Services for the whole community.
7
PowerBI Dashboards
Created by Mike Honey (Australia)
8
You might think it’s like that...
9
Credits: wonderfulengineering.com
But in reality leading 1k people could be seen like...
10
Some rules of High Scale Open Source project management
11
Valve Corporation as the most similar example
“The change in Valve's approach has also been attributed to its use of a flat organization structure that was adopted as the company expanded. When founded, Valve used a hierarchical structure more typical of other development firms, driven by the nature of physical game releases through publishers that required tasks to be completed by given deadlines.[35] However, as Valve became its own publisher via Steam, it transitioned to a looser, flat structure, which was formally in place by 2012.[36][37] Outside of executive management, Valve does not have bosses, and the company used an open allocation system, allowing employees to move between departments at will.[38][39] This approach allows employees to work on whatever interests them, but requires them to take ownership of their product and mistakes they may make, according to Newell. Newell recognized that this structure works well for some but that "there are plenty of great developers for whom this is a terrible place to work".[35] Many outside observers believe the lack of organization structure has led to frequent cancellations of potential games as it can be difficult to convince other employees to work on such titles.[40][41][42]”
(from Wikipedia)
12
Rules of people “management” in the Open Source
People want be directed and supported, not managed
Volunteering job considered as a self-test, for example, can Junior Data Scientist perform as a Senior, postgraduate student lead a project, Senior become a Mentor.
We’re trying to support all of volunteers as a kind of advisory board and sharing available resources!
People coming and leaving, let them contribute their best to the Open community
CoronaWhy seen as a huge incubator, we don’t want to kill any ideas, even considered as a “crazy” by the most of people
13
14
Lessons for creating good Open Source software
Benevolent Dictator For Life?
Slack people land
15
Slack channels intersection
16
CoronaWhy Skills
17
CRM as a service (vtiger)
18
Skills management at CoronaWhy community
19
All registration data anonymized and available in the CRM with respect to GDPR, skills of people can be selected in the according fields
Volunteers can be contacted by task leaders and invited to participate based on their skills
All tasks and activities should be transparent and clear, time zones aligned
Looking for Commons
Merce Crosas, “Harvard Data Commons”
20
CoronaWhy Dataverse
21
CoronaWhy Dataverse is a central integration point for all teams and individual contributors
Both manual and automatic data upload, integrated with Jupyter Notebooks by Data Access API (pyDataverse)
Data points synchronized with services
All people that are working in the teams can get acknowledgement
We don’t steal from others work, credit is given to the author(s) by asking to put their names in Dataverse, GitHub, docs, articles!
URL: datasets.coronawhy.org
CoronaWhy Data Lake
22
Source: LinkedIn
COVID-19 data files verification
23
We do a verification of every files by importing its contents to dataframe.
All column names (variables) extracted from tabular data available as labels in files metadata
We’ve enabled Dataverse data previewers to browse through the content of files without download!
We’re starting internal challenges to build ML models for the metadata classification
COVID-19 Data Crowdsourcing
CoronaWhy data management team does does the review of all harvested datasets and try to identify the important data.
We’re approaching github owners by creating issues in their repos and inviting them to help us.
More than 20% of data owners joining CoronaWhy community or interested to curate their datasets.
Bottom-up data collection works!
24
CoronaWhy Common Research and Data Infrastructure
Data preprocessing pipeline implemented on Jupyter notebook Docker with extra modules.
Dataverse as data repository to store data from automatic and curated workflows.
Elasticsearch has CORD-19 indexes on sections and sentences level with spacy enrichments. Other indexes: MeSH, Geonames, GRID.
Hypothesis annotation service is running allows to annotate CORD-19 papers.
Virtuoso triplestore with public SPARQL Endpoint to query COVID-19 Knowledge Graph
Other services: Colab, MongoDB, Kibana, BEL Commons 3.0, INDRA, Geoparser, Tabula
https://github.com/CoronaWhy/covid-19-infrastructure
25
Building a horizonal platform to surve vertical teams
26
Source: CoronaWhy organization
Key points of CoronaWhy infrastructure management
Some basic agreements:
27
CoronaWhy Collab service
28
We’re trying to engage the Open collaboration in teams using our services. It seems to be a good solution for education and training of all team members, and onboarding newcomers.
People could choose which tasks they prefer the most and join the team!
CORD-19 preprocessing pipeline
Based on Allen AI spaCy pipeline and models for scientific and biomedical documents
https://github.com/allenai/scispacy
A full spaCy pipeline for biomedical data with a ~785k vocabulary and 600k word vectors
29
Elasticsearch index on CORD-19 sentences
30
Data preprocessing pipeline does entity linking, named entities, dependencies. Check it now!
Common Problems
31
Building bridges between communities
Biological community working on the different datasets than researchers from social-economic field
Computer scientists usually build tools and dashboards without input from other communities just to analyze available data
Scientometrics community has own ideas about the importance and ranking of COVID-19 papers
Do we really have a Commons here?
32
Hypothesis as a peer review service
33
Doccano as a service
“doccano is an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequence to sequence tasks. So, you can create labeled data for sentiment analysis, named entity recognition, text summarization and so on. “
It was initially developed in Japan and now helping to create COVID-19 related annotations that will be done as a collaborative effort of human and AI. CoronaWhy Labs has added Machine Learning models trained on biological ontologies to label entities in CORD-19 collection.
Project github: https://github.com/doccano/doccano
Doccano annotation with Machine Learning
Biological Expression Language (BEL)
36
BEL example in notebook
37
Maintained by Charles Tapley Hoyt
ORCID: 0000-0003-4423-4370
(from BEL tutorial)
GRB2 bind SHC
p(hgnc:4566 ! GRB2)
r(hgnc:4566 ! GRB2)
g(hgnc:4566 ! GRB2)
p(hgnc:10840 ! SHC)
complex(p(hgnc:4566 ! GRB2), p(hgnc:10840 ! SHC))
p(hgnc:8614 ! PAWR)
p(hgnc:8614 ! PAWR, pmod(Ph, Thr, 163)) increases bp(go:0006915 ! "apoptotic process")
p(hgnc:8614 ! PAWR, pmod(Ph, Thr, 163)) increases p(hgnc:9588 ! PTEN)
p(hgnc:9588 ! PTEN) increases bp(go:0006915 ! "apoptotic process")
p(hgnc:8614 ! PAWR, pmod(Ph)) =| complex(p(hgnc:8614 ! PAWR), p(fplx:14_3_3))
INDRA statements
38
INDRA (Integrated Network and Dynamical Reasoning Assembler) is an automated model assembly system interfacing with NLP systems and databases to collect knowledge, and through a process of assembly, produce causal graphs and dynamical models.
INDRA producing triples!
Advanced Text Analysis
Source: D.Shlepakov, How to Build Knowledge Graph, notebook
39
We need a good understanding of all domain specific texts to make right statements:
Building Knowledge Graph
40
The ultimate goal is to automate the process of the Knowledge extraction by using the latest developments in Artificial Intelligence and Deep Learning.
CoronaWhy building blocks
41
Using altmetrics for ranking relevant COVID-19 papers
42
CORD-19 Altmetrics integration
43
Jupyter notebook with Altmetrics as citations count, social media coverage, etc
Altmetric data harvested by DOI and archived in Dataverse
Data points ingested and integrated in MongoDB service
Provided very easy-to-use access to altmetrics via notebooks
The pipeline can keep data up-to-date
Use case: COVID-19 Dutch papers affiliations
44
Source: CORD-19 collection without institutions affiliation labels
Process: ML model to get a linkage to GRID database available in our infrastructure
Result: affiliations matched for the most of papers, the dataset enrichment for location, lon-lat, gridID.
It’s published on Dataverse.
Plan: scale-up this workflow and build a service to get affiliation label realtime
Huge impact of CoronaWhy activities on the society
People from all time zones are present in this radically transparent community
Quick and amazing exchange of ideas and sharing knowledge between more than 1k people with really different background (developers, scientists, doctors, epidemiologists, students, and various subject matter experts)
Creations of the community attracting the best talents as CoronaWhy is providing a Open COVID-19 research platform with some dedicated resources and experts
Community members are contributing to COVID-19 related discussions worldwide and suggesting new ideas, sometimes award winning
Anybody can join, provide introduction and get onboarding instructions immediately
45
Building an Operating System for Open Science
46
CoronaWhy Common Research and Data Infrastructure is distributed and robust enough to be scaled up and reused for other tasks like cancer research
All services are build from Open Source components
Data processed and published in FAIR way, the provenance information is the part of our Data Lake
Data evaluation and credibility is the top priority, we’re providing tools for the expert community for the verification of our datasets
The transparency of data and services guarantees the reproducibility of all experiments and get bring new insights in COVID-19 research
CoronaWhy is changing the world’s Data Landscape
47
National Institutes of Health Webinar
48
Some references to articles and notebooks
I’m an AI researcher and here’s how I fight corona by Artur Kiulian
Exploration of Document Clustering with SPECTER Embeddings by Brandon Eychaner
COVID-19 Research Papers Geolocation by Ishan Sharma
How to access Elasticsearch and Dataverse, notebook
CoronaWhy Elasticsearch Tutorial notebook
How to Create Knowledge Graph, notebook
Dataverse Colab Connect, notebook
GitHub dataset sync with Dataverse, notebook
49
Thank you! Questions?
@Slava Tykhonov on CoronaWhy Slack
vyacheslav.tykhonov@dans.knaw.nl
50