1 of 73

Generation and Application of �Biomedical Knowledge Graphs

PhD Colloquium of

Charles Tapley Hoyt

ORCID: 0000-0003-4423-4370

1

Presented December 3rd, 2019 at the University of Bonn

Erstgutachter Prof. Dr. Martin Hofmann-Apitius

Zweitgutachter Prof. Dr. Andreas Weber

Fachnah Prof. Dr. Thomas Schultz

Fachfremd Prof. Dr. Diana Imhof

2 of 73

Knowledge Graphs for Storage and Integration

2

Scannell, J. W., et al. (2012). Diagnosing the decline in pharmaceutical R&D efficiency. Nature Reviews Drug Discovery, 11(3), 191–200.

3 of 73

Knowledge Graphs for Storage and Integration

3

4 of 73

Biomedical Knowledge Graphs (KGs)

  • Enable formalization of knowledge �(as triples)
  • Goals of biomedical KGs
    • Information retrieval, exploration, and visualization
    • Reason over experimental and clinical observations
    • Propose new experiments
  • Comparison of BEL
    • Provenance
    • Experimental and biological contextualization
    • Close to mechanistic

4

Systems Biology Markup Language (SBML)

Biological Pathways Exchange �(BioPAX)

Biological Expression Language �(BEL)

Resource Description Framework �(RDF)

Web Ontology Language �(OWL)

Different formalisms for KGs

5 of 73

Example BEL Statement

5

Type | Namespace | Name

decreases

bp(MESH:“Oxidative Stress”)

Predicate

Object

a(CHEBI:corticosteroid)

Subject

Identifier

6 of 73

Causal Modeling with BEL

6

Petri Nets 1

  1. Image from https://upload.wikimedia.org/wikipedia/commons/f/fe/Detailed_petri_net.png
  2. Image from https://neurommsig.scai.fraunhofer.de/
  3. Image from Lopez, C. F., et al. (2013). Programming biological models in Python using PySB. Molecular Systems Biology, 9(646), 646.

Influence Maps, Differential Equation Models 3

Causal Networks 2 (e.g., BEL)

Higher Granularity

7 of 73

Goals of this Thesis

  1. Generation. Improve methods for curation and semantic data integration to generate high granular biomedical knowledge graphs
  2. Application. Develop novel methods for using prior biomedical knowledge to propose new biological hypotheses
    1. Investigate the aetiology of disease (target prioritization)
    2. Understand drugs' mechanisms of action (drug repositioning)

7

8 of 73

Part 1: Generation

8

9 of 73

9

10 of 73

Development of PyBEL

  • Previously curated neurodegenerative disease (NDD) KGs from NeuroMMSig
  • Lacked stable open source software to:
    • Check syntactic and semantic correctness
    • Write new algorithms
    • Re-implement previous algorithms
    • Analyze new data

10

Parser and

Validator

Network Data Structure

Data Converter

Database

External Data (BEL Script, etc.)

Visualize

Hoyt, C. T., et al (2017). PyBEL: a computational framework for Biological Expression Language. Bioinformatics, 34(4), 703–704.�Domingo-Fernández, D., et al. (2017). Multimodal mechanistic signatures for neurodegenerative diseases (NeuroMMSig): a web server for mechanism enrichment. Bioinformatics, 33(22), 3679–3681.

11 of 73

Curation in the Cloud

  • Technologies for checking �syntax/semantics
    • Version control system
    • Continuous integration
  • New workflow for achieving curator �agreement
  • Proof of concept: re-curated NDD KGs
  • Built Curation of Neurodegeneration Supporting Ontology (CONSO; unpublished)

11

Hoyt, C. T., et al. (2019). Re-curation and rational enrichment of knowledge graphs in Biological Expression Language. Database, 2019.

12 of 73

Manual Curation is Unsustainable

12

  • Lead curation of 353 full-text articles
    • Curation of Neurodegeneration in BEL (CONIB; unpublished)
  • There's Too Much Literature - almost doubles every ten years�(example: chemistry in PubMed)
  • Need automatic computer assistance
    • Natural language processing / Text mining

Reference: https://www.ncbi.nlm.nih.gov/pubmed/?term=chemistry accessed on 2019-11-03

13 of 73

Automatic Extraction with INDRA

  • INDRA (Integrated Network and Dynamical Reasoning Assembler)
  • Developed open source at Harvard
  • Combine several text mining systems
  • Runs on huge corpus
  • Calculates confidences for all triples

13

Gyori, B. M., et al. (2017). From word models to executable models of signaling networks using automated assembly. Molecular Systems Biology, 13(11), 954.

Text

14 of 73

Rational Curation

  • Large-scale reading with INDRA
  • Export as BEL for curation via PyBEL
  • Proof-of-concept: applied to enrich NDD KGs

14

Hoyt, C. T., et al. (2019). Re-curation and rational enrichment of knowledge graphs in Biological Expression Language. Database, 2019.

15 of 73

Integrating Biological Data Sources in Bio2BEL

Reproducible, automated, reliable acquisition and transformation of biological data sources to BEL

15

Hoyt, C. T., et al. (2019). Integration of Structured Biological Data Sources using Biological Expression Language. bioRxiv, 631812.

16 of 73

Integrating Biological Data Sources in Bio2BEL

16

Hoyt, C. T., et al. (2019). Integration of Structured Biological Data Sources using Biological Expression Language. bioRxiv, 631812.

... and getting bigger

17 of 73

Visualization and Exploration in BEL Commons

  • An environment for summarizing, exploring, and analyzing BEL KGs
  • Available on web at https://bel-commons.scai.fraunhofer.de

17

Hoyt, C. T., et al. (2018). BEL Commons: an environment for exploration and analysis of networks encoded in Biological Expression Language. Database, 2018(3), 1–11.

18 of 73

Part 2: Applications

18

What can we do with these knowledge graphs?

19 of 73

19

20 of 73

20

21 of 73

Drug Discovery and Development

  • Problems: long and expensive
  • Failure in Phase II from toxicity
  • Failure in Phase III from lack of efficacy

21

Scannell, J. W., Blanckley, A., Boldon, H., & Warrington, B. (2012). Diagnosing the decline in pharmaceutical R&D efficiency. Nature Reviews Drug Discovery, 11(3), 191–200.

22 of 73

The Challenge�Unraveling the Triangle

22

Drug Discovery/

Drug Repositioning

Proteochemometrics

Target Prioritization

23 of 73

The Challenge�Unraveling the Triangle

23

Drug Discovery/

Drug Repositioning

Target Prioritization

Proteochemometrics

24 of 73

Network Representation Learning (NRL)

  • Generate dense embeddings for vertices in a KG
  • Main tasks:
  • Clustering / Downstream Machine Learning
  • Link Prediction
  • Entity Disambiguation

24

25 of 73

Intuition from the DeepWalk Algorithm

  1. Generate random walks starting at each node
  2. Train a Skip-Gram language model using �nodes as words and walks as sentences
  3. Outputs vectors for each node for downstream machine learning
  4. *GAT2VEC includes labels �(e.g., differential gene expression)

25

Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). DeepWalk: Online Learning of Social Representations. https://doi.org/10.1145/2623330.2623732

26 of 73

Emig's Approach to Target Prioritization

26

Emig, D., Ivliev, A., Pustovalova, O., Lancashire, L., Bureeva, S., Nikolsky, Y., & Bessarabova, M. (2013). Drug Target Prediction and Repositioning Using an Integrated Network-Based Approach. PLoS ONE, 8(4).

Gene Features (manually selected)

Gene Labels

(Disease Associated Genes)

Logistic Regression Classifier

Protein-Protein Interaction Network

Differential Gene Expression

27 of 73

GuiltyTargets

27

Muslu, Ö., Hoyt, C. T., Hofmann-Apitius, M., & Fröhlich, H. (2019). GuiltyTargets: Prioritization of Novel Therapeutic Targets with Deep Network Representation Learning. bioRxiv, 521161

Gene Features

LEARNED WITH NRL

Gene Labels

(Disease Associated Genes)

Logistic Regression Classifier

Protein-Protein Interaction Network

Differential Gene Expression

28 of 73

Emig vs. GuiltyTargets Evaluation (AUC-ROC)

28

Disease

Emig�(published)

Emig �(redo)

GuiltyTargets (ours)

GuiltyTargets (randomized)

Acute myeloid leukemia

.82

.82

.93

.50

Hepatocellular carcinoma

.80

.73

.94

.51

Idiopathic pulmonary fibrosis

.88

.55

.92

.50

Liver cirrhosis

.67

.55

.94

.51

Multiple sclerosis

.72

.76

.93

.50

Alzheimer's disease

-

.63

.94

.50

...

...

...

...

...

Muslu, Ö., Hoyt, C. T., Hofmann-Apitius, M., & Fröhlich, H. (2019). GuiltyTargets: Prioritization of Novel Therapeutic Targets with Deep Network Representation Learning. bioRxiv, 521161

***Disclaimer: we have new results as of the week of this defense that are slightly different, but don't contradict the conclusions from this table

29 of 73

Limitations of GuiltyTargets

  • Built on homogenous PPI network
    • Could use more fine granular networks generated from Part 1
  • Limited novelty due to guilt-by-association
  • No edge weights
  • Single task machine learning

29

Muslu, Ö., Hoyt, C. T., Hofmann-Apitius, M., & Fröhlich, H. (2019). GuiltyTargets: Prioritization of Novel Therapeutic Targets with Deep Network Representation Learning. bioRxiv, 521161

30 of 73

Side Effect Network (SEffNet)

  • KG with drugs, targets, diseases, and side effects
  • Investigated performance benefits of edge weights
  • Multi-task machine learning

30

Training Edges

LEARNED WITH NRL

Node and Edge Features

Testing Edges + Negative Sampling

Logistic Regression Classifier

Heterogeneous Network

Aldisi, R. ... & Hoyt, C.T. (2020) Applications of Network Representation Learning for Drug Repositioning in Parkinson's Disease. Manuscript in preparation

31 of 73

Side Effect Network (SEffNet)

31

UMLS

Name

LOR

C0006384

Bundle branch block

.000

C0575090

Balance disorder

.000

C0878544

Cardiomyopathy

.001

C0233794

Memory impairment

.001

C0004239

Atrial flutter

.001

C0160390

Liver injury

.001

C0020676

Hypothyroidism

.001

C0002884

Hypochromic anaemia

.001

C0034069

Pulmonary fibrosis

.001

C0233477

Dysphoria

.001

Aldisi, R. ... & Hoyt, C.T. (2020) Applications of Network Representation Learning for Drug Repositioning in Parkinson's Disease. Manuscript in preparation

32 of 73

Ongoing and Next Steps

  • Edge prediction in networks containing drugs, side-effects, and targets to identify drugs' mechanisms of action
  • Comparison of NRL to drug repositioning with engineered features from Hetionet [Himmelstein, et al. (2015)]
  • Multitask target prioritization with Hetionet
  • Incorporation of literals in knowledge graphs such as chemical fingerprints, differential gene expression values, clinical variables, etc.
  • Benchmark more learned representations versus engineered features
  • Extend approaches to clinical modalities (embed patients for predictive modeling)

32

Himmelstein, D. S., et al. (2017). Systematic integration of biomedical knowledge prioritizes drugs for repurposing. ELife, 6.

33 of 73

Impact

  1. Development of open-source BEL ecosystem
    1. BEL compiler and tools (PyBEL)*
    2. Integration tools (Bio2BEL)*
    3. Interactive web application (BEL Commons)*
  2. Curation
    • New curation strategy (re-curation workflow and new curation guidelines)
    • Semi-automated curation workflow with open source text mining tools
  3. Applications
    • Network representation learning for hypothesis generation (GuiltyTargets)*
    • Comparative mechanism enrichment algorithm (EpiCom; not presented)
    • Automated simulation (BEL2ABM; not presented)

*Interest and ongoing adoption in both academia and industry

33

34 of 73

Code and Data Availability

34

35 of 73

Lessons Learned in Communication

  • We haven't cured any disease or elucidated unknown side effects! Presented were tools for generating hypotheses
  • Collaboration with experimentalists and clinicians is key
  • Overselling what's possible will decrease our ability to have impact (even if the market is lucrative)

35

36 of 73

Acknowledgements

Supervision

  • Prof. Dr. Martin Hofmann-Apitius
  • Prof. Dr. Holger Fröhlich

Committee

  • Prof. Dr. Andreas Weber
  • Prof. Dr. Diana Imhof
  • Prof. Dr. Thomas Schultz

Master's Students

Özlem Muslu, Rana Aldisi, Lingling Xu, Maurici Pio de Lacerda, Vinay Bharadhwaj

Coworkers

  • Fraunhofer SCAI.Bio
  • Fraunhofer SCAI.IT
  • Fraunhofer FIT
  • Fraunhofer IME

36

Collaborators

  • Scott Colby (Stanford)
  • Dr. Dexter Pratt (UCSD/Cytoscape)
  • Dr. John Bachman and Dr. Ben Gyori (Harvard)
  • Max Berrendorf (LMU Munich)
  • Laurent Vermue (Technical University of Denmark)
  • Dr. Denés Türei, Nicolàs Palacio-Escat, and Prof. Dr. Julio Saez-Rodriguez (University of Heidelberg/EMBL)

Special Thanks

  • Daniel Domingo Fernández
  • Mehdi Ali (he's a good guy)
  • Emperor André Gemünd

Family

Friends

PhD Crew, Girgit Crew, FFF Crew, ESN Bonn

37 of 73

Projects and Funding

AETIONOMY (IMI)�https://www.aetionomy.eu/

B-IT Foundation�http://www.b-it-center.de/

Cytoscape Consortiumhttps://cytoscape.org/

The Human Brain Pharmacome (Fraunhofer)�https://pharmacome.scai.fraunhofer.de/

Fraunhofer Center for Machine Learning�https://www.cit.fraunhofer.de/de/zentren/maschinelles-lernen.html

37

38 of 73

Ph.D. Publications

  1. Hoyt, C. T., Konotopez, A., & Ebeling, C. (2017). PyBEL: a computational framework for Biological Expression Language. Bioinformatics (Oxford, England), 34(4), 703–704. https://doi.org/10.1093/bioinformatics/btx660
  2. Hoyt, C. T., Domingo-Fernández, D., & Hofmann-Apitius, M. (2018). BEL Commons: an environment for exploration and analysis of networks encoded in Biological Expression Language. Database, 2018(3), 1–11. https://doi.org/10.1093/database/bay126
  3. Hoyt, C. T., et al. (2019). Re-curation and rational enrichment of knowledge graphs in Biological Expression Language. Database, 2019(1). https://doi.org/10.1093/database/baz068
  4. Hoyt, C. T., et al. (2019) Bio2BEL: Integration of Structured Knowledge Sources with Biological Expression Language. bioRxiv, 631812. https://doi.org/10.1101/631812
  5. Gündel, M., Hoyt, C. T., & Hofmann-Apitius, M. (2018). BEL2ABM: Agent-based simulation of static models in Biological Expression Language. Bioinformatics (Oxford, England), 34(13), 2316–2318. https://doi.org/10.1093/bioinformatics/bty107
  6. Hoyt, C. T., et al. (2018). A systematic approach for identifying shared mechanisms in epilepsy and its comorbidities. Database, 2018(1). https://doi.org/10.1093/database/bay050
  7. Muslu, Ö., Hoyt, C. T., Hofmann-Apitius, M., & Fröhlich, H. (2019). GuiltyTargets: Prioritization of Novel Therapeutic Targets with Deep Network Representation Learning. bioRxiv, 521161. https://doi.org/10.1101/521161

38

39 of 73

Other Publications

  1. Bradford, R., Sturm, T., Weber, A., Davenport, J. H., England, M., Errami, H., Gerdt, V., Grigoriev, D., Hoyt, C. T., Košta, M., & Radulescu, O. (2017). A Case Study on the Parametric Occurrence of Multiple Steady States. In Proceedings of the 2017 ACM on International Symposium on Symbolic and Algebraic Computation - ISSAC ’17 (Vol. Part F1293, pp. 45–52). New York, New York, USA: ACM Press.
  2. Domingo-Fernández, D., Hoyt, C. T., Alvarez, C. B., Marin-Llao, J., & Hofmann- Apitius, M. (2018). ComPath: an ecosystem for exploring, analyzing, and curating mappings across pathway databases. Npj Systems Biology and Applications, 5(1), 3.
  3. Domingo-Fernández, D., Mubeen, S., Marín-Llaó, J., Hoyt, C. T., & Hofmann- Apitius, M. (2019). PathMe: merging and exploring mechanistic pathway knowledge. BMC Bioinformatics, 20(1), 243.
  4. Ali, M., Hoyt, C. T., Domingo-Fernández, D., Lehmann, J., & Jabeen, H. (2019). BioKEEN: A library for learning and evaluating biological knowledge graph embeddings. Bioinformatics (Oxford, England).
  5. Bradford, R., Davenport, J. H., England, M., Errami, H., Gerdt, V., Grigoriev, D., Hoyt, C. T., Kosta, M., Radulescu, O., Sturm, T., & Weber, A. (2019). Identifying the Parametric Occurrence of Multiple Steady States for some Biological Networks. arXiv, 1902.04882
  6. Brito, E., Georgiev, B., Domingo-Fernández, D., Hoyt, C. T., & Bauckhage, C. (2019) RatVec: A General Approach for Low-dimensional Distributed Vector Representations via Domain-specific Rational Kernels. In Proceedings of ECML PKDD

39

  1. Ali., M., Jabeen, H., Hoyt, C. T., & Lehmann, J. (2019) The KEEN Universe: An Ecosystem for Knowledge Graph Embeddings with a Focus on Reproducibility and Transferability. In Proceedings of ISWC.
  2. Ali., M., Domingo-Fernández, D., Hoyt, C. T., & Lehmann, J. (2019) Predicting Missing Links Using PyKEEN. In Proceedings of ISWC.
  3. Karki, K., Kodamullil, A .T., Hoyt, C. T., & Hofmann-Apitius,M.(2019) Quantifying mechanisms in neurodegenerative diseases (NDDs) using candidate mechanism perturbation amplitude (CMPA) algorithm. BMC Bioinformatics, 20(1), 494..
  4. Emon, M. A., Domingo-Fernández, D., Hoyt, C. T., & Hofmann-Apitius, M. (2019) PS4DR: a multimodal workflow for identification and prioritization of drugs based on pathway signatures. BMC Bioinformatics, submitted.
  5. Mubeen, S., Hoyt, C. T., Gemünd, A., Hofmann-Apitius, M., Fröhlich, H., & Domingo-Fernández, D. (2019). The Impact of Pathway Database Choice on Statistical Enrichment Analysis and Predictive Modeling. Frontiers in Genetics.
  6. Humayun, F., Domingo-Fernández, D., George, A. A. P., Hopp, M.-T., Syllwasschy, B. F., Detzel, M. S., Hoyt, C. T., Hofmann-Apitius, M., Imhof, D. (2019). A computational approach for mapping heme biology in the context of hemolytic disorders. bioRxiv, 804906.

40 of 73

Extras

40

41 of 73

Re-curation

41

42 of 73

Re-curation Workflow

  1. (Team) Handle names
    • Normalize entities from custom namespaces that were never checked
    • Update all namespaces (only a small manual effort)
  2. (Team) Checking correctness
    • High Confidence a statement can be asserted from the given evidence
    • Medium Confidence a statement is incorrect or incomplete. Update accordingly.
    • Low Confidence the meaning of a statement is unclear and should be discussed in a group
    • No Confidence the statement is not supported by the evidence and cannot be fixed. Delete.
  3. (Leaders) Finalize curation
    • Read all statements with Medium and High confidence annotations and assign Very High where appropriate

42

43 of 73

Re-curation Results on NeuroMMSig

43

Significant effort by many people to assure BEL syntactic and semantic quality

Manual curation efforts added huge, high-quality biological novelty.

44 of 73

Manual Curation

44

45 of 73

CONSO

  • Curation of Neurodegeneration Supporting Ontology (CONSO)
  • Generated during re-curation of NeuroMMSig (and later full-text curation)
  • 31 classes, 366 entities, 449 relationships, 1366 synonyms, and 271 cross references

45

46 of 73

CONIB

  • Curation of Neurodegeneration in BEL (CONIB)
  • Mechanisms underpinning tauopathies including:
    • Tau modification and hyper-modification
    • Tau aggregation
    • Proteostasis
    • Nicotinic receptor signalling
  • Related diseases and indications �(ALS, MS, Huntington's, etc.)

46

Curators

10

Full Text Articles

353

Time (in Months)

24

Authors

1969

Nodes

5862

Edges

20860

47 of 73

Text Mining

47

48 of 73

Biology Has Intricate Relationships

Types

  • Causal
  • Correlative
  • Associative
  • Ontological

Modes

  • Activities
  • Abundances
  • Efflux

Directionality

  • Uni-directional
  • Bi-directional (reflexive)

Polarity

  • Increase
  • Decrease
  • Unknown
  • None

Contact

  • Direct
  • Indirect

States

  • Experimental context
  • Subcellular location
  • PTMs
  • Fusions
  • Mutations
  • Pre/post-conditions

48

49 of 73

Relation Extraction Methods

Manual

  • Curation at highest granularity

Automatic

  • Rule-based
    • REACH (REading and Assembling Contextual and Holistic mechanisms from text)
  • Natural Language Processing (NLP)
    • TRIPS
  • Machine Learning / Deep Learning
    • Turko Event Extraction System (TEES)
    • BELIEF (BEL Information Extraction Workflow)

Semi-automatic

  • As good as automatic methods get, manual curation will continue to be necessary

49

50 of 73

Rule-based Extraction in REACH

50

Valenzuela-Escárcega, et al. Large-scale automated machine reading discovers new cancer-driving mechanisms. Database (2018) Vol. 2018: article ID bay098; doi:10.1093/database/bay098

51 of 73

Rational Enrichment

51

52 of 73

Priority Subgraphs

Discussion with Fraunhofer IME and Hugo Geerts (in Silico Biosciences) lead to a prioritization of signatures for re-curation

Criteria

  • Novelty
  • Druggability
  • Assay-ability
  • Expert resource availability and advise

Top 4 Subgraphs

  • Tau protein subgraph
  • Inflammatory response subgraph
  • Insulin signal transduction
  • GSK3 subgraph & DKK1 subgraph

52

53 of 73

BEL Commons

53

54 of 73

54

55 of 73

55

56 of 73

56

57 of 73

57

58 of 73

Emig's Approach

58

59 of 73

Emig's Data Sources

  • Protein-protein interaction (PPI) networks
    • HIPPIE
    • STRING
  • Disease-specific differential gene expression profiles
    • GEO
    • ArrayExpress
  • Fingerprinting genes
    • Local (neighborhood scoring, interconnectivity)
    • Global (random walk, network propagation)
  • Disease-target associations
    • Integrity (http://integrity.thomson-pharma.com)
    • Therapeutic Target Database
  • Positive-unlabelled learning
    • Logistic regression
    • Cross validation

59

Emig, D., Ivliev, A., Pustovalova, O., Lancashire, L., Bureeva, S., Nikolsky, Y., & Bessarabova, M. (2013). Drug Target Prediction and Repositioning Using an Integrated Network-Based Approach. PLoS ONE, 8(4).

60 of 73

Emig's Local Network Features

Interconnectivity

Neighborhood Scoring

60

Reference: Emig, D., Ivliev, A., Pustovalova, O., Lancashire, L., Bureeva, S., Nikolsky, Y., & Bessarabova, M. (2013). Drug Target Prediction and Repositioning Using an Integrated Network-Based Approach. PLoS ONE, 8(4). https://doi.org/10.1371/journal.pone.0060618

61 of 73

Emig's Global Network Features

Random Walks and Network Propagation

61

Reference: Emig, D., Ivliev, A., Pustovalova, O., Lancashire, L., Bureeva, S., Nikolsky, Y., & Bessarabova, M. (2013). Drug Target Prediction and Repositioning Using an Integrated Network-Based Approach. PLoS ONE, 8(4). https://doi.org/10.1371/journal.pone.0060618

62 of 73

GuiltyTargets

62

63 of 73

Natural Language Model: Skip-Gram

A language model that maximizes the co-occurrence probability of words in the same window

63

Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space, 1–12.

64 of 73

GAT2VEC

Extends DeepWalk to handle nodes' labels

64

Sheikh, N., Kefato, Z., & Montresor, A. (2018). Gat2Vec: Representation Learning for Attributed Graphs. Computing, 1–23. https://doi.org/10.1007/s00607-018-0622-9

65 of 73

GAT2VEC Algorithm

Given: an attributed graph G = (V, E, F), attributes A, and attribution function F: V → 2A

  1. Append structural edges {(v, a) | v ∈ V, a ∈ F(v)} to G
  2. Generate γ random walks starting at each v ∈ V
  3. Filter all a ∈ A from the walks
  4. Train a Skip-Gram model as previously in DeepWalk

65

Sheikh, N., Kefato, Z., & Montresor, A. (2018). Gat2Vec: Representation Learning for Attributed Graphs. Computing, 1–23. https://doi.org/10.1007/s00607-018-0622-9

66 of 73

GuiltyTargets Workflow

66

Random walks and GAT2VEC

Disease-specific differential gene expression

Emig, D., et al. (2013). Drug Target Prediction and Repositioning Using an Integrated Network-Based Approach. PLoS ONE, 8(4).

Muslu, Ö., Hoyt, C. T., Hofmann-Apitius, M., & Fröhlich, H. (2019). GuiltyTargets: Prioritization of Novel Therapeutic Targets with Deep Network Representation Learning. bioRxiv, 521161

67 of 73

GuiltyTargets Predictions for AD

67

Symbol

Protein Type/Class

Score

CHRNB4

Nicotinic acetylcholine receptor subunit

.700

ITPR1

IP3 receptor

.689

GLRA2

Ligand-gated chloride channel subunit

.619

COMT

Catechol-O-methyltransferase

.587

GRIK2

Ionotropic glutamate receptor subunit

.587

CHRM4

Muscarinic acetylcholine receptor

.586

CHRFAM7A

Nicotinic acetylcholine receptor

.557

HTR7

Serotonin receptor 7

.532

KCNK3

Potassium K+ channels

.523

....

...

...

Muslu, Ö., Hoyt, C. T., Hofmann-Apitius, M., & Fröhlich, H. (2019). GuiltyTargets: Prioritization of Novel Therapeutic Targets with Deep Network Representation Learning. bioRxiv, 1–14. 521161

68 of 73

SEffNet

68

69 of 73

Learning on SEffNet

Node Embeddings

  • DeepWalk
  • Node2Vec
  • LINE
  • SDNE
  • HOPE
  • GraREP

Edge Embeddings

  • Concatenation
  • Hadamard

Downstream Task

  • Edge classification with logistic regression
  • Cross validation schema

69

70 of 73

Node2vec

Return parameter (p) �the probability of revisiting nodes in a walk

In/out parameter (q) �the probability of visiting nodes connected to the previous node

70

Reference: Grover, A., & Leskovec, J. (2016). Node2Vec: Scalable Feature Learning for Networks. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 855–864). New York, NY, USA: ACM. https://doi.org/10.1145/2939672.2939754

71 of 73

Edge2vec

  • Calculate edge-edge conditional probabilities
  • Weight walks with a transition probability matrix
  • Encodes heterogeneous edges

71

Reference: Gao, Z., Fu, G., Ouyang, C., Tsutsui, S., Liu, X., Yang, J., … Ding, Y. (2018). edge2vec: Representation learning using edge semantics for biomedical knowledge discovery. Retrieved from http://arxiv.org/abs/1809.02269

72 of 73

72

Method

Parameters

Value

Mean

AUC-ROC

Mean

AUC-PR

Mean

MCC

node2vec

Dimensions

Walk length

Number of walks

Window size

Return parameter (p)

In/out parameter (q)

300

8

8

4

2.3

1.9

.977

.981

.877

DeepWalk

Dimensions

Walk length

Number of walks

Window size

300

8

8

2

.969

.974

.846

HOPE

Dimensions

300

.937

.962

.842

GraRep

Dimensions

k-step

300

3

.977

.981

.866

LINE

Dimensions

Proximity order

Epochs

300

3

5

.979

.983

.867

SDNE

Proximity balance (a)

Reconstruction weight (b)

Epochs

0.128

14

25

.927

.949

.648

73 of 73

Robustness of SEffNet

73