1 of 50

CoronaWhy

Fight against COVID-19

Slava Tykhonov

Senior Information Scientist (DANS-KNAW)

04.06.2020

2 of 50

COVID-19 Coronarius Europe datasets

2

I’ve started to collect data in the middle of March and created EU COVID-19 Data Hub on Harvard Dataverse to make all datasets persistent and FAIR.

All datasets archived and updated on the daily basis.

More than 2k downloads in total!

https://dataverse.harvard.edu/dataverse/covid-19-eu

That’s a lot of data, how to process and analyse?

3 of 50

My motivation

3

4 of 50

7 weeks in the complete lockdown in Spain

4

5 of 50

About CoronaWhy

5

1066 people registered in the

organization, more than 300 actively contributing!

6 of 50

COVID-19 Open Research Dataset Challenge (CORD-19)

It’s all started from this:

“In response to the COVID-19 pandemic and with the view to boost research, the Allen Institute for AI together with CZI, MSR, Georgetown, NIH & The White House is collecting and making available for free the COVID-19 Open Research Dataset (CORD-19). This resource is updated weekly and contains over 52,000 scholarly articles, including 41,000 with full text, about COVID-19 and other viruses of the coronavirus family.” (Kaggle)

6

7 of 50

CoronaWhy Community Tasks (March-April)

  1. Task-Risk helps to identify risk factors that can increase the chance of being infected, or affects the severity or the survival outcome of the infection
  2. Task-Ties to explore transmission, incubation and environment stability
  3. Named Entity Recognition across the entire corpus of CORD-19 papers with full text
  4. Match Clinical Trials allows exploration of the results from the COVID-19 International Clinical Trials dataset
  5. COVID-19 Literature Visualization helps to explore the data behind the AI-powered literature review

I’m managing Labs and Common Data and Services for the whole community.

7

8 of 50

PowerBI Dashboards

Created by Mike Honey (Australia)

8

9 of 50

You might think it’s like that...

9

Credits: wonderfulengineering.com

10 of 50

But in reality leading 1k people could be seen like...

10

11 of 50

Some rules of High Scale Open Source project management

  1. We’re running like mad in all directions in the same time and stuck in the perfect state of entropy. Nobody knows what will happen next
  2. Forget all of your previous experience, nothing really works here
  3. People with different motivation acting like volunteers don’t like to be pushed, the “war” can start anytime
  4. Most of people acting like lone wolves, the most advanced joining the teams
  5. Great leaders with interesting ideas attracting and hiring people from other teams, sometimes making internal HR crisis
  6. Almost all issues with international relations can be solved by finding Commons. If some people aren’t able to find a commons, they’re falling apart.
  7. If people have a free time and ready to contribute - we should do it right now!

11

12 of 50

Valve Corporation as the most similar example

“The change in Valve's approach has also been attributed to its use of a flat organization structure that was adopted as the company expanded. When founded, Valve used a hierarchical structure more typical of other development firms, driven by the nature of physical game releases through publishers that required tasks to be completed by given deadlines.[35] However, as Valve became its own publisher via Steam, it transitioned to a looser, flat structure, which was formally in place by 2012.[36][37] Outside of executive management, Valve does not have bosses, and the company used an open allocation system, allowing employees to move between departments at will.[38][39] This approach allows employees to work on whatever interests them, but requires them to take ownership of their product and mistakes they may make, according to Newell. Newell recognized that this structure works well for some but that "there are plenty of great developers for whom this is a terrible place to work".[35] Many outside observers believe the lack of organization structure has led to frequent cancellations of potential games as it can be difficult to convince other employees to work on such titles.[40][41][42]

(from Wikipedia)

12

13 of 50

Rules of people “management” in the Open Source

People want be directed and supported, not managed

Volunteering job considered as a self-test, for example, can Junior Data Scientist perform as a Senior, postgraduate student lead a project, Senior become a Mentor.

We’re trying to support all of volunteers as a kind of advisory board and sharing available resources!

People coming and leaving, let them contribute their best to the Open community

CoronaWhy seen as a huge incubator, we don’t want to kill any ideas, even considered as a “crazy” by the most of people

13

14 of 50

14

Lessons for creating good Open Source software

  • Every good work of software starts by scratching a developer's personal itch.
  • Good programmers know what to write. Great ones know what to rewrite (and reuse).
  • Plan to throw one [version] away; you will, anyhow. (Copied from Frederick Brooks' The Mythical Man-Month)
  • ...
  • Release early. Release often. And listen to your customers.
  • Given a large enough beta-tester and co-developer base, almost every problem will be characterized quickly and the fix obvious to someone.
  • Smart data structures and dumb code works a lot better than the other way around.
  • ….
  • Often, the most striking and innovative solutions come from realizing that your concept of the problem was wrong.
  • Any tool should be useful in the expected way, but a truly great tool lends itself to uses you never expected.
  • When writing gateway software of any kind, take pains to disturb the data stream as little as possible—and never throw away information unless the recipient forces you to!
  • ...
  • To solve an interesting problem, start by finding a problem that is interesting to you.
  • Provided the development coordinator has a communications medium at least as good as the Internet, and knows how to lead without coercion, many heads are inevitably better than one.

Benevolent Dictator For Life?

15 of 50

Slack people land

15

16 of 50

Slack channels intersection

16

17 of 50

CoronaWhy Skills

17

18 of 50

CRM as a service (vtiger)

18

19 of 50

Skills management at CoronaWhy community

19

All registration data anonymized and available in the CRM with respect to GDPR, skills of people can be selected in the according fields

Volunteers can be contacted by task leaders and invited to participate based on their skills

All tasks and activities should be transparent and clear, time zones aligned

20 of 50

Looking for Commons

Merce Crosas, “Harvard Data Commons

20

21 of 50

CoronaWhy Dataverse

21

CoronaWhy Dataverse is a central integration point for all teams and individual contributors

Both manual and automatic data upload, integrated with Jupyter Notebooks by Data Access API (pyDataverse)

Data points synchronized with services

All people that are working in the teams can get acknowledgement

We don’t steal from others work, credit is given to the author(s) by asking to put their names in Dataverse, GitHub, docs, articles!

URL: datasets.coronawhy.org

22 of 50

CoronaWhy Data Lake

22

Source: LinkedIn

23 of 50

COVID-19 data files verification

23

We do a verification of every files by importing its contents to dataframe.

All column names (variables) extracted from tabular data available as labels in files metadata

We’ve enabled Dataverse data previewers to browse through the content of files without download!

We’re starting internal challenges to build ML models for the metadata classification

24 of 50

COVID-19 Data Crowdsourcing

CoronaWhy data management team does does the review of all harvested datasets and try to identify the important data.

We’re approaching github owners by creating issues in their repos and inviting them to help us.

More than 20% of data owners joining CoronaWhy community or interested to curate their datasets.

Bottom-up data collection works!

24

25 of 50

CoronaWhy Common Research and Data Infrastructure

Data preprocessing pipeline implemented on Jupyter notebook Docker with extra modules.

Dataverse as data repository to store data from automatic and curated workflows.

Elasticsearch has CORD-19 indexes on sections and sentences level with spacy enrichments. Other indexes: MeSH, Geonames, GRID.

Hypothesis annotation service is running allows to annotate CORD-19 papers.

Virtuoso triplestore with public SPARQL Endpoint to query COVID-19 Knowledge Graph

Other services: Colab, MongoDB, Kibana, BEL Commons 3.0, INDRA, Geoparser, Tabula

https://github.com/CoronaWhy/covid-19-infrastructure

25

26 of 50

Building a horizonal platform to surve vertical teams

26

Source: CoronaWhy organization

27 of 50

Key points of CoronaWhy infrastructure management

Some basic agreements:

  • all experimental services have “labs” subdomain name, we can decide to shut them down if nobody is interested. New service candidates should be up and running asap and tested by users
  • production services deployed on Kubernetes CI/CD (GCP and Amazon AWS)
  • infrastructure team investigating the maturity of possible services, adding CI/CD integrations if it’s necessary
  • every member of the CoronaWhy community can get access to VM sandbox by request and build own application or tool, and suggest it as a service to the infrastructure team

27

28 of 50

CoronaWhy Collab service

28

We’re trying to engage the Open collaboration in teams using our services. It seems to be a good solution for education and training of all team members, and onboarding newcomers.

People could choose which tasks they prefer the most and join the team!

29 of 50

CORD-19 preprocessing pipeline

Based on Allen AI spaCy pipeline and models for scientific and biomedical documents

https://github.com/allenai/scispacy

A full spaCy pipeline for biomedical data with a ~785k vocabulary and 600k word vectors

29

30 of 50

Elasticsearch index on CORD-19 sentences

30

Data preprocessing pipeline does entity linking, named entities, dependencies. Check it now!

31 of 50

Common Problems

  • Preprocessing pipeline can extract from text only some entities, not relations
  • The quality of entities recognition isn’t perfect and not reliable, depending on the trained model(s) and number of entities in vocabulary
  • NLP pipelines are very slow and require a lot of computations
  • disambiguation of entities isn’t possible without keeping context and relations between all words in the text
  • building high quality Knowledge Graph is an essential process of the whole pipeline, the idea to put all statements in triples
  • every statement should be verified by human experts with keeping all the provenance information in order to produce a trusted Knowledge

31

32 of 50

Building bridges between communities

Biological community working on the different datasets than researchers from social-economic field

Computer scientists usually build tools and dashboards without input from other communities just to analyze available data

Scientometrics community has own ideas about the importance and ranking of COVID-19 papers

Do we really have a Commons here?

32

33 of 50

Hypothesis as a peer review service

33

  1. AI pipeline does domain specific entities extraction and ranking of relevant CORD-19 papers.
  2. Automatic entities and statements will be added, important fragments should be highlighted.
  3. Human annotators should verify results and validate all statements.

34 of 50

Doccano as a service

“doccano is an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequence to sequence tasks. So, you can create labeled data for sentiment analysis, named entity recognition, text summarization and so on. “

It was initially developed in Japan and now helping to create COVID-19 related annotations that will be done as a collaborative effort of human and AI. CoronaWhy Labs has added Machine Learning models trained on biological ontologies to label entities in CORD-19 collection.

Project github: https://github.com/doccano/doccano

35 of 50

Doccano annotation with Machine Learning

36 of 50

Biological Expression Language (BEL)

36

37 of 50

BEL example in notebook

37

Maintained by Charles Tapley Hoyt

ORCID: 0000-0003-4423-4370

(from BEL tutorial)

GRB2 bind SHC

p(hgnc:4566 ! GRB2)

r(hgnc:4566 ! GRB2)

g(hgnc:4566 ! GRB2)

p(hgnc:10840 ! SHC)

complex(p(hgnc:4566 ! GRB2), p(hgnc:10840 ! SHC))

p(hgnc:8614 ! PAWR)

p(hgnc:8614 ! PAWR, pmod(Ph, Thr, 163)) increases bp(go:0006915 ! "apoptotic process")

p(hgnc:8614 ! PAWR, pmod(Ph, Thr, 163)) increases p(hgnc:9588 ! PTEN)

p(hgnc:9588 ! PTEN) increases bp(go:0006915 ! "apoptotic process")

p(hgnc:8614 ! PAWR, pmod(Ph)) =| complex(p(hgnc:8614 ! PAWR), p(fplx:14_3_3))

38 of 50

INDRA statements

38

INDRA (Integrated Network and Dynamical Reasoning Assembler) is an automated model assembly system interfacing with NLP systems and databases to collect knowledge, and through a process of assembly, produce causal graphs and dynamical models.

http://www.indra.bio

INDRA producing triples!

39 of 50

Advanced Text Analysis

39

We need a good understanding of all domain specific texts to make right statements:

40 of 50

Building Knowledge Graph

40

  • We’re collecting all possible COVID-19 data and archiving in our Dataverse
  • Looking for various related controlled vocabularies and ontologies
  • Building and reusing conversion pipelines to get all data values in RDF format

The ultimate goal is to automate the process of the Knowledge extraction by using the latest developments in Artificial Intelligence and Deep Learning.

41 of 50

CoronaWhy building blocks

41

42 of 50

Using altmetrics for ranking relevant COVID-19 papers

42

43 of 50

CORD-19 Altmetrics integration

43

Jupyter notebook with Altmetrics as citations count, social media coverage, etc

Altmetric data harvested by DOI and archived in Dataverse

Data points ingested and integrated in MongoDB service

Provided very easy-to-use access to altmetrics via notebooks

The pipeline can keep data up-to-date

44 of 50

Use case: COVID-19 Dutch papers affiliations

44

Source: CORD-19 collection without institutions affiliation labels

Process: ML model to get a linkage to GRID database available in our infrastructure

Result: affiliations matched for the most of papers, the dataset enrichment for location, lon-lat, gridID.

It’s published on Dataverse.

Plan: scale-up this workflow and build a service to get affiliation label realtime

45 of 50

Huge impact of CoronaWhy activities on the society

People from all time zones are present in this radically transparent community

Quick and amazing exchange of ideas and sharing knowledge between more than 1k people with really different background (developers, scientists, doctors, epidemiologists, students, and various subject matter experts)

Creations of the community attracting the best talents as CoronaWhy is providing a Open COVID-19 research platform with some dedicated resources and experts

Community members are contributing to COVID-19 related discussions worldwide and suggesting new ideas, sometimes award winning

Anybody can join, provide introduction and get onboarding instructions immediately

45

46 of 50

Building an Operating System for Open Science

46

CoronaWhy Common Research and Data Infrastructure is distributed and robust enough to be scaled up and reused for other tasks like cancer research

All services are build from Open Source components

Data processed and published in FAIR way, the provenance information is the part of our Data Lake

Data evaluation and credibility is the top priority, we’re providing tools for the expert community for the verification of our datasets

The transparency of data and services guarantees the reproducibility of all experiments and get bring new insights in COVID-19 research

47 of 50

CoronaWhy is changing the world’s Data Landscape

47

48 of 50

National Institutes of Health Webinar

48

49 of 50

Some references to articles and notebooks

49

50 of 50

Thank you! Questions?

@Slava Tykhonov on CoronaWhy Slack

vyacheslav.tykhonov@dans.knaw.nl

50