Generation and Application of �Biomedical Knowledge Graphs
PhD Colloquium of
Charles Tapley Hoyt
ORCID: 0000-0003-4423-4370
1
Presented December 3rd, 2019 at the University of Bonn
Erstgutachter Prof. Dr. Martin Hofmann-Apitius
Zweitgutachter Prof. Dr. Andreas Weber
Fachnah Prof. Dr. Thomas Schultz
Fachfremd Prof. Dr. Diana Imhof
Knowledge Graphs for Storage and Integration
2
Scannell, J. W., et al. (2012). Diagnosing the decline in pharmaceutical R&D efficiency. Nature Reviews Drug Discovery, 11(3), 191–200.
Knowledge Graphs for Storage and Integration
3
Biomedical Knowledge Graphs (KGs)
4
Systems Biology Markup Language (SBML)
Biological Pathways Exchange �(BioPAX)
Biological Expression Language �(BEL)
Resource Description Framework �(RDF)
Web Ontology Language �(OWL)
Different formalisms for KGs
Example BEL Statement
5
Type | Namespace | Name
decreases
bp(MESH:“Oxidative Stress”)
Predicate
Object
a(CHEBI:corticosteroid)
Subject
Identifier
Causal Modeling with BEL
6
Petri Nets 1
Influence Maps, Differential Equation Models 3
Causal Networks 2 (e.g., BEL)
Higher Granularity
Goals of this Thesis
7
Part 1: Generation
8
9
Development of PyBEL
10
Parser and
Validator
Network Data Structure
Data Converter
Database
External Data (BEL Script, etc.)
Visualize
Hoyt, C. T., et al (2017). PyBEL: a computational framework for Biological Expression Language. Bioinformatics, 34(4), 703–704.�Domingo-Fernández, D., et al. (2017). Multimodal mechanistic signatures for neurodegenerative diseases (NeuroMMSig): a web server for mechanism enrichment. Bioinformatics, 33(22), 3679–3681.
Curation in the Cloud
11
Hoyt, C. T., et al. (2019). Re-curation and rational enrichment of knowledge graphs in Biological Expression Language. Database, 2019.
Manual Curation is Unsustainable
12
Reference: https://www.ncbi.nlm.nih.gov/pubmed/?term=chemistry accessed on 2019-11-03
Automatic Extraction with INDRA
13
Gyori, B. M., et al. (2017). From word models to executable models of signaling networks using automated assembly. Molecular Systems Biology, 13(11), 954.
Text
Rational Curation
14
Hoyt, C. T., et al. (2019). Re-curation and rational enrichment of knowledge graphs in Biological Expression Language. Database, 2019.
Integrating Biological Data Sources in Bio2BEL
Reproducible, automated, reliable acquisition and transformation of biological data sources to BEL
15
Hoyt, C. T., et al. (2019). Integration of Structured Biological Data Sources using Biological Expression Language. bioRxiv, 631812.
Integrating Biological Data Sources in Bio2BEL
16
Hoyt, C. T., et al. (2019). Integration of Structured Biological Data Sources using Biological Expression Language. bioRxiv, 631812.
... and getting bigger
Visualization and Exploration in BEL Commons
17
Hoyt, C. T., et al. (2018). BEL Commons: an environment for exploration and analysis of networks encoded in Biological Expression Language. Database, 2018(3), 1–11.
Part 2: Applications
18
What can we do with these knowledge graphs?
19
20
Drug Discovery and Development
21
Scannell, J. W., Blanckley, A., Boldon, H., & Warrington, B. (2012). Diagnosing the decline in pharmaceutical R&D efficiency. Nature Reviews Drug Discovery, 11(3), 191–200.
The Challenge�Unraveling the Triangle
22
Drug Discovery/
Drug Repositioning
Proteochemometrics
Target Prioritization
The Challenge�Unraveling the Triangle
23
Drug Discovery/
Drug Repositioning
Target Prioritization
Proteochemometrics
Network Representation Learning (NRL)
24
Intuition from the DeepWalk Algorithm
25
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). DeepWalk: Online Learning of Social Representations. https://doi.org/10.1145/2623330.2623732
Emig's Approach to Target Prioritization
26
Emig, D., Ivliev, A., Pustovalova, O., Lancashire, L., Bureeva, S., Nikolsky, Y., & Bessarabova, M. (2013). Drug Target Prediction and Repositioning Using an Integrated Network-Based Approach. PLoS ONE, 8(4).
Gene Features (manually selected)
Gene Labels
(Disease Associated Genes)
Logistic Regression Classifier
Protein-Protein Interaction Network
Differential Gene Expression
GuiltyTargets
27
Muslu, Ö., Hoyt, C. T., Hofmann-Apitius, M., & Fröhlich, H. (2019). GuiltyTargets: Prioritization of Novel Therapeutic Targets with Deep Network Representation Learning. bioRxiv, 521161
Gene Features
LEARNED WITH NRL
Gene Labels
(Disease Associated Genes)
Logistic Regression Classifier
Protein-Protein Interaction Network
Differential Gene Expression
Emig vs. GuiltyTargets Evaluation (AUC-ROC)
28
Disease | Emig�(published) | Emig �(redo) | GuiltyTargets (ours) | GuiltyTargets (randomized) |
Acute myeloid leukemia | .82 | .82 | .93 | .50 |
Hepatocellular carcinoma | .80 | .73 | .94 | .51 |
Idiopathic pulmonary fibrosis | .88 | .55 | .92 | .50 |
Liver cirrhosis | .67 | .55 | .94 | .51 |
Multiple sclerosis | .72 | .76 | .93 | .50 |
Alzheimer's disease | - | .63 | .94 | .50 |
... | ... | ... | ... | ... |
Muslu, Ö., Hoyt, C. T., Hofmann-Apitius, M., & Fröhlich, H. (2019). GuiltyTargets: Prioritization of Novel Therapeutic Targets with Deep Network Representation Learning. bioRxiv, 521161
***Disclaimer: we have new results as of the week of this defense that are slightly different, but don't contradict the conclusions from this table
Limitations of GuiltyTargets
29
Muslu, Ö., Hoyt, C. T., Hofmann-Apitius, M., & Fröhlich, H. (2019). GuiltyTargets: Prioritization of Novel Therapeutic Targets with Deep Network Representation Learning. bioRxiv, 521161
Side Effect Network (SEffNet)
30
Training Edges
LEARNED WITH NRL
Node and Edge Features
Testing Edges + Negative Sampling
Logistic Regression Classifier
Heterogeneous Network
Aldisi, R. ... & Hoyt, C.T. (2020) Applications of Network Representation Learning for Drug Repositioning in Parkinson's Disease. Manuscript in preparation
Side Effect Network (SEffNet)
31
UMLS | Name | LOR |
C0006384 | Bundle branch block | .000 |
C0575090 | Balance disorder | .000 |
C0878544 | Cardiomyopathy | .001 |
C0233794 | Memory impairment | .001 |
C0004239 | Atrial flutter | .001 |
C0160390 | Liver injury | .001 |
C0020676 | Hypothyroidism | .001 |
C0002884 | Hypochromic anaemia | .001 |
C0034069 | Pulmonary fibrosis | .001 |
C0233477 | Dysphoria | .001 |
Aldisi, R. ... & Hoyt, C.T. (2020) Applications of Network Representation Learning for Drug Repositioning in Parkinson's Disease. Manuscript in preparation
Ongoing and Next Steps
32
Himmelstein, D. S., et al. (2017). Systematic integration of biomedical knowledge prioritizes drugs for repurposing. ELife, 6.
Impact
*Interest and ongoing adoption in both academia and industry
33
Code and Data Availability
PyBEL �https://github.com/pybel/pybel
Re-curation Workflow�https://github.com/pybel/pybel-git
Enrichment Workflow�https://github.com/bel-enrichment
34
Bio2BEL�https://github.com/bio2bel
BEL Commons�https://github.com/bel-commons
Guilty Targets https://github.com/guiltytargets
SEffNet�https://github.com/seffnet
Lessons Learned in Communication
35
Acknowledgements
Supervision
Committee
Master's Students
Özlem Muslu, Rana Aldisi, Lingling Xu, Maurici Pio de Lacerda, Vinay Bharadhwaj
Coworkers
36
Collaborators
Special Thanks
Family
Friends
PhD Crew, Girgit Crew, FFF Crew, ESN Bonn
Projects and Funding
AETIONOMY (IMI)�https://www.aetionomy.eu/
B-IT Foundation�http://www.b-it-center.de/
Cytoscape Consortium�https://cytoscape.org/
The Human Brain Pharmacome (Fraunhofer)�https://pharmacome.scai.fraunhofer.de/
Fraunhofer Center for Machine Learning�https://www.cit.fraunhofer.de/de/zentren/maschinelles-lernen.html
37
Ph.D. Publications
38
Other Publications
39
Extras
40
Re-curation
41
Re-curation Workflow
42
Re-curation Results on NeuroMMSig
43
Significant effort by many people to assure BEL syntactic and semantic quality
Manual curation efforts added huge, high-quality biological novelty.
Manual Curation
44
CONSO
45
CONIB
46
Curators | 10 |
Full Text Articles | 353 |
Time (in Months) | 24 |
Authors | 1969 |
Nodes | 5862 |
Edges | 20860 |
Text Mining
47
Biology Has Intricate Relationships
Types
Modes
Directionality
Polarity
Contact
States
48
Relation Extraction Methods
Manual
Automatic
Semi-automatic
49
Rule-based Extraction in REACH
50
Valenzuela-Escárcega, et al. Large-scale automated machine reading discovers new cancer-driving mechanisms. Database (2018) Vol. 2018: article ID bay098; doi:10.1093/database/bay098
Rational Enrichment
51
Priority Subgraphs
Discussion with Fraunhofer IME and Hugo Geerts (in Silico Biosciences) lead to a prioritization of signatures for re-curation
Criteria
Top 4 Subgraphs
52
BEL Commons
53
54
55
56
57
Emig's Approach
58
Emig's Data Sources
59
Emig, D., Ivliev, A., Pustovalova, O., Lancashire, L., Bureeva, S., Nikolsky, Y., & Bessarabova, M. (2013). Drug Target Prediction and Repositioning Using an Integrated Network-Based Approach. PLoS ONE, 8(4).
Emig's Local Network Features
Interconnectivity
Neighborhood Scoring
60
Reference: Emig, D., Ivliev, A., Pustovalova, O., Lancashire, L., Bureeva, S., Nikolsky, Y., & Bessarabova, M. (2013). Drug Target Prediction and Repositioning Using an Integrated Network-Based Approach. PLoS ONE, 8(4). https://doi.org/10.1371/journal.pone.0060618
Emig's Global Network Features
Random Walks and Network Propagation
61
Reference: Emig, D., Ivliev, A., Pustovalova, O., Lancashire, L., Bureeva, S., Nikolsky, Y., & Bessarabova, M. (2013). Drug Target Prediction and Repositioning Using an Integrated Network-Based Approach. PLoS ONE, 8(4). https://doi.org/10.1371/journal.pone.0060618
GuiltyTargets
62
Natural Language Model: Skip-Gram
A language model that maximizes the co-occurrence probability of words in the same window
63
Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space, 1–12.
GAT2VEC
Extends DeepWalk to handle nodes' labels
64
Sheikh, N., Kefato, Z., & Montresor, A. (2018). Gat2Vec: Representation Learning for Attributed Graphs. Computing, 1–23. https://doi.org/10.1007/s00607-018-0622-9
GAT2VEC Algorithm
Given: an attributed graph G = (V, E, F), attributes A, and attribution function F: V → 2A
65
Sheikh, N., Kefato, Z., & Montresor, A. (2018). Gat2Vec: Representation Learning for Attributed Graphs. Computing, 1–23. https://doi.org/10.1007/s00607-018-0622-9
GuiltyTargets Workflow
66
Random walks and GAT2VEC
Disease-specific differential gene expression
Emig, D., et al. (2013). Drug Target Prediction and Repositioning Using an Integrated Network-Based Approach. PLoS ONE, 8(4).
Muslu, Ö., Hoyt, C. T., Hofmann-Apitius, M., & Fröhlich, H. (2019). GuiltyTargets: Prioritization of Novel Therapeutic Targets with Deep Network Representation Learning. bioRxiv, 521161
GuiltyTargets Predictions for AD
67
Symbol | Protein Type/Class | Score |
CHRNB4 | Nicotinic acetylcholine receptor subunit | .700 |
ITPR1 | IP3 receptor | .689 |
GLRA2 | Ligand-gated chloride channel subunit | .619 |
COMT | Catechol-O-methyltransferase | .587 |
GRIK2 | Ionotropic glutamate receptor subunit | .587 |
CHRM4 | Muscarinic acetylcholine receptor | .586 |
CHRFAM7A | Nicotinic acetylcholine receptor | .557 |
HTR7 | Serotonin receptor 7 | .532 |
KCNK3 | Potassium K+ channels | .523 |
.... | ... | ... |
Muslu, Ö., Hoyt, C. T., Hofmann-Apitius, M., & Fröhlich, H. (2019). GuiltyTargets: Prioritization of Novel Therapeutic Targets with Deep Network Representation Learning. bioRxiv, 1–14. 521161
SEffNet
68
Learning on SEffNet
Node Embeddings
Edge Embeddings
Downstream Task
69
Node2vec
Return parameter (p) �the probability of revisiting nodes in a walk
In/out parameter (q) �the probability of visiting nodes connected to the previous node
70
Reference: Grover, A., & Leskovec, J. (2016). Node2Vec: Scalable Feature Learning for Networks. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 855–864). New York, NY, USA: ACM. https://doi.org/10.1145/2939672.2939754
Edge2vec
71
Reference: Gao, Z., Fu, G., Ouyang, C., Tsutsui, S., Liu, X., Yang, J., … Ding, Y. (2018). edge2vec: Representation learning using edge semantics for biomedical knowledge discovery. Retrieved from http://arxiv.org/abs/1809.02269
72
Method | Parameters | Value | Mean AUC-ROC | Mean AUC-PR | Mean MCC |
node2vec | Dimensions Walk length Number of walks Window size Return parameter (p) In/out parameter (q) | 300 8 8 4 2.3 1.9 | .977 | .981 | .877 |
DeepWalk | Dimensions Walk length Number of walks Window size | 300 8 8 2 | .969 | .974 | .846 |
HOPE | Dimensions | 300 | .937 | .962 | .842 |
GraRep | Dimensions k-step | 300 3 | .977 | .981 | .866 |
LINE | Dimensions Proximity order Epochs | 300 3 5 | .979 | .983 | .867 |
SDNE | Proximity balance (a) Reconstruction weight (b) Epochs | 0.128 14 25 | .927 | .949 | .648 |
Robustness of SEffNet
73