1 of 58

Graph diffusion for network–based applications

Nov 8th, 2022

BMI/CS 775 Computational Network Biology�Fall 2022

Anthony Gitter

https://compnetbiocourse.discovery.wisc.edu

Original slides created by Prof. Sushmita Roy

2 of 58

Topics in this section

  • Graph-based approaches for gene prioritization
  • Graph diffusion to interpret sequence variants
  • Combinatorial graph algorithms for subnetwork selection
  • Multi-omic data integration

3 of 58

Goals for today

  • Gene prioritization
    • Supervised methods
    • Unsupervised methods
  • GeneWanderer: Walking the Interactome for Prioritization of Candidate Disease Genes
    • Example of a supervised, graph diffusion-based approach
  • Other applications of random walks on graphs

4 of 58

Gene prioritization

  • Gene prioritization: rank the most important genes for a particular process or system through integrative computational analysis of public and private genomic data.
  • Many of the approaches developed were originally for finding the important gene(s) from a linkage study
    • A genomic locus associated with a disease that could have hundreds of genes
  • Now used to prioritize genes identified from high-throughput omics experiments

Computational tools for prioritizing candidate genes: boosting disease gene discovery Moreau and Tranchevent, Nature Review Genetics 2012

5 of 58

Why gene prioritization?

  • Identification of genes associated with diseases (and other complex traits) is important for gaining a molecular understanding of a disease
  • Linkage analysis or omics-based measurements identify many candidate genes
    • Can narrow down a region of the genome that might be associated with a disease, or obtain a gene set
    • Too many genes for follow-up analysis
    • Don’t have the resolution to pinpoint specific genes

6 of 58

Linkage disequilibrium

Single nucleotide variants along the genome

Heatmap of correlations

7 of 58

Why gene prioritization?

  • Gene prioritization has many applications
    • What genes control a particular process?
    • What genes are affected in a specific disease?
    • What genes must be tested first to best complete a model of the system?
    • What genes (regulators) establish a particular cell type or fate?

8 of 58

Overview of approaches for gene prioritization

  • Non-network based methods
    • Similarity in different gene-level features can be used to prioritize new genes
      • Sequence, expression, and Gene Ontology
  • Network-based methods
    • Supervised methods: need a small number of seed or known genes
    • Unsupervised methods
    • Network model-based methods

9 of 58

Network-based prioritization

  • Can we use interaction networks to prioritize what genes might be important?
  • Guilt-by-association principle
  • Suppose we have candidate genes we know are important
    • Can we identify additional genes that are also important based on their proximity in the network?

10 of 58

Supervised network-based gene prioritization task definition

  • Given
    • List of candidate genes
    • Background interaction network among genes
    • List of known genes (seed genes) for a particular process or disease
  • Do
    • Rank candidates genes based on their association to the known genes

11 of 58

Supervised network-based gene prioritization

Slide credit: Debbie Chasman

Seed genes

(known genes)

12 of 58

Neighborhoods on graphs

  • Network-based methods rely on defining neighborhoods of genes
  • Neighborhoods of genes known for a disease can be used to prioritize additional genes
  • Local neighborhood
    • Immediate neighbor (direct interaction)
    • Shortest path
  • Global neighborhood
    • Based on more global measures of distance between nodes
  • We have seen some ideas of neighborhoods on graphs in graph clustering

13 of 58

Local network neighborhood-based prioritization

  • Available in many public software and tools
    • MouseNET, GIANT, STRING
  • Applied for filling holes in a pathway

Known members

Novel members

Predictions of novel pathway components from MouseNET

14 of 58

Local network prioritization fails

Cowen et al., Nature Review Genetics 2017

Known disease gene

New disease gene we are trying to find

False positives

15 of 58

Local network prioritization fails

Cowen et al., Nature Review Genetics 2017

Known disease gene

New disease genes we are trying to find

False positives

16 of 58

Desired global network prioritization

Cowen et al., Nature Review Genetics 2017

Known disease gene

New disease genes we are trying to find

Successfully rank the other

disease genes above other

genes

17 of 58

Global network neighborhood-based prioritization

  • Global network neighborhood-based methods make use of the entire graph to define the similarity between two genes
  • Given a graph connecting nodes, global similarity can be defined using
    • Random walk
    • Diffusion kernel
    • Laplacian kernel
    • Heat kernel

18 of 58

Goals for today

  • Gene prioritization
    • Supervised methods
    • Unsupervised methods
  • GeneWanderer: Walking the Interactome for Prioritization of Candidate Disease Genes
    • Example of a supervised, graph diffusion-based approach
  • Other applications of random walks on graphs

19 of 58

Motivation of GeneWanderer

  • Many diseases for which we do not know the causal genes
  • Previous approaches have used protein-protein interaction networks, but in a limited way
    • Only local similarity was used
  • Can we use global graph distance?
  • Does global distance help and improve current prioritization techniques?

S. Köhler, S. Bauer, D. Horn, and P. N. Robinson, "Walking the interactome for prioritization of candidate disease genes." American Journal of Human Genetics 2008

20 of 58

Overview of GeneWanderer

21 of 58

Motivation of using global distance

  • Global similarity is more sensitive and different in each case
  • In contrast, local shortest-path similarity is the same for all pairs
  • Direct interactions will never select y as a candidate

Known gene: x

Candidate gene: y

22 of 58

Global graph distances used in GeneWanderer

  • Random walk with restarts

  • Diffusion kernel

23 of 58

Random walk on graphs

  • A random walk is defined as a probabilistic traversal of a graph
  • At each time step, the walker transitions randomly to one of the neighbors of the current node
  • Typically can start anywhere on the graph
  • But can put priors on the walk based on what we know about candidate genes
  • The random walk converges to some distribution of the number of times a gene is visited
  • The distribution can be used to rank genes

24 of 58

Random walk on graphs

  • Random walk requires transition probabilities
  • Transition probability is obtained by dividing each entry in adjacency matrix by the row sum or column sum
  • One step of the random walk corresponds to one matrix multiplication of the transition matrix
  • Suppose we started at node a, what is the probability of reaching other nodes after t steps?
  • tth power of the transition matrix tell us the probability of reaching node j from node i after t steps on the random walk

25 of 58

Transition matrix of a graph

0

1

0

1

0

1

0

1

0

0

0

1

0

1

1

1

0

1

0

0

0

0

1

0

0

a

b

c

d

e

a

b

c

d

e

0

0.5

0

0.5

0

0.5

0

1/3

0

0

0

0.5

0

0.5

1

0.5

0

1/3

0

0

0

0

1/3

0

0

a

b

c

d

e

a

b

c

d

e

A: Adjacency matrix

Column-normalized adjacency matrix

Network adapted from IsoRank paper

a

b

c

d

e

0

0.5

0

0.5

0

0.5

0

0.5

0

0

0

1/3

0

1/3

1/3

0.5

0

0.5

0

0

0

0

1

0

0

a

b

c

d

e

a

b

c

d

e

Row-normalized adjacency matrix

26 of 58

Random Walk with Restarts (RWR)

  • Allow the random walker to restart occasionally

  • W is the column normalized adjacency matrix
    • Original adjacency matrix can be weighted
  • r controls how often we restart the random walk
  • p0 was set such that the random walk would start at any of the known genes with equal probability
  • Rank candidate genes based on the pt+1 when pt and pt+1 are really close
  • In other words, this means that pt+1 is the steady state distribution

Prior probability of starting at a node

27 of 58

RWR continued

  • Intuitively, this approach gives the probability of reaching a specific node, given that a random walk starts at a given node, while taking all intermediate paths into account
  • By making the random walk start from known disease genes we obtain a global proximity measure from these genes

28 of 58

RWR example

0

1

0

1

0

1

0

1

0

0

0

1

0

1

1

1

0

1

0

0

0

0

1

0

0

a

b

c

d

e

a

b

c

d

e

0

0.5

0

0.5

0

0.5

0

1/3

0

0

0

0.5

0

0.5

1

0.5

0

1/3

0

0

0

0

1/3

0

0

a

b

c

d

e

a

b

c

d

e

A: Adjacency matrix

W: Column-normalized adjacency matrix

Network adapted from IsoRank paper

a

b

c

d

e

29 of 58

RWR example continued

  • Assume that a and b are “known genes”
  • p0: [0.5 0.5 0 0 0]
  • Let r=0.7

a

b

c

d

e

Iteration 0

0.5

0.5

0

0

0

Iteration 1

a

b

c

d

e

0.4250

0.4250

0.0750

0.0750

0

Iteration 2

a

b

c

d

e

0.4250

0.4213

0.0750

0.0713

0.0075

Iteration 6

(convergence)

a

b

c

d

e

0.4239

0.4212

0.0761

0.0712

0.0076

30 of 58

RWR example continued

  • Assume that a and b are “known genes”
  • p0: [0.5 0.5 0 0 0]
  • Let r=0.5

a

b

c

d

e

Iteration 0

0.5

0.5

0

0

0

Iteration 1

a

b

c

d

e

0.3750

0.3750

0.1250

0.1250

0

Iteration 2

a

b

c

d

e

0.3750

0.3646

0.1250

0.1146

0.0208

Iteration 8

(convergence)

a

b

c

d

e

0.3696

0.3641

0.1304

0.1141

0.0217

31 of 58

Effect of restart probability

r=0.5

Iteration 8

(convergence)

a

b

c

d

e

0.3696

0.3641

0.1304

0.1141

0.0217

Iteration 6

(convergence)

a

b

c

d

e

0.4239

0.4212

0.0761

0.0712

0.0076

r=0.7

32 of 58

Top Hat question

33 of 58

Global graph distances used in GeneWanderer

  • Random walk with restarts

  • Diffusion kernel

34 of 58

Kernel

  • A kernel k is a function that maps pairs of objects to a real-valued space
  • It enables one to define very general similarity functions between pairs of objects
  • Let denote a set of objects of a particular type
    • For example, the set of images or a set of trees
  • A kernel k is defined as

35 of 58

Diffusion kernel

  • Diffusion kernel K of a graph is defined a function of the graph Laplacian L
    • K= exp(-βL)
  • Graph Laplacian L=D-A
    • D is the diagonal degree of matrix
    • A is the adjacency matrix
    • The rank of a gene j is given by a score depending upon its proximity from other disease genes

Kondor and Lafferty, “Diffusion Kernels on Graphs and Other Discrete Input Spaces”, ICML 2002

36 of 58

Diffusion kernel

  • The diffusion kernel gives a ranking of a new gene using the sum of a global distance measure from all known genes of a disease
  • Specifically the jth column corresponds to the stationary distribution of a random walk that started at node j

37 of 58

Diffusion kernel

  •  

Gärtner et al., “On Graph Kernels: Hardness Results and Efficient Alternatives”, 2003

 

Kondor and Lafferty, “Diffusion Kernels on Graphs and Other Discrete Input Spaces”, ICML 2002

38 of 58

Diffusion kernel example

0

1

0

1

0

1

0

1

0

0

0

1

0

1

1

1

0

1

0

0

0

0

1

0

0

a

b

c

d

e

a

b

c

d

e

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

a

b

c

d

e

a

b

c

d

e

L: Laplacian matrix

Network adapted from IsoRank paper

a

b

c

d

e

 

 

39 of 58

Diffusion kernel example

0

1

0

1

0

1

0

1

0

0

0

1

0

1

1

1

0

1

0

0

0

0

1

0

0

a

b

c

d

e

a

b

c

d

e

0.47

0.21

0.09

0.21

0.02

0.21

0.46

0.18

0.09

0.05

0.09

0.18

0.34

0.18

0.21

0.21

0.09

0.18

0.46

0.05

0.02

0.05

0.21

0.05

0.67

a

b

c

d

e

a

b

c

d

e

L: Laplacian matrix

Network adapted from IsoRank paper

a

b

c

d

e

 

 

40 of 58

Diffusion kernel example

0

1

0

1

0

1

0

1

0

0

0

1

0

1

1

1

0

1

0

0

0

0

1

0

0

a

b

c

d

e

a

b

c

d

e

0.32

0.24

0.15

0.24

0.06

0.24

0.3

0.19

0.17

0.11

0.15

0.19

0.23

0.19

0.24

0.24

0.17

0.19

0.3

0.11

0.06

0.11

0.24

0.11

0.49

a

b

c

d

e

a

b

c

d

e

L: Laplacian matrix

Network adapted from IsoRank paper

a

b

c

d

e

 

 

41 of 58

Diffusion kernel example

0

1

0

1

0

1

0

1

0

0

0

1

0

1

1

1

0

1

0

0

0

0

1

0

0

a

b

c

d

e

a

b

c

d

e

0.24

0.22

0.19

0.22

0.13

0.22

0.22

0.19

0.2

0.16

0.19

0.19

0.21

0.19

0.22

0.22

0.2

0.19

0.22

0.16

0.13

0.16

0.22

0.16

0.33

a

b

c

d

e

a

b

c

d

e

L: Laplacian matrix

Network adapted from IsoRank paper

a

b

c

d

e

 

 

42 of 58

Diffusion kernel example

0

1

0

1

0

1

0

1

0

0

0

1

0

1

1

1

0

1

0

0

0

0

1

0

0

a

b

c

d

e

a

b

c

d

e

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

a

b

c

d

e

a

b

c

d

e

L: Laplacian matrix

Network adapted from IsoRank paper

a

b

c

d

e

 

 

43 of 58

Diffusion kernel example

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

a

b

c

d

e

a

b

c

d

e

 

0.24

0.22

0.19

0.22

0.13

0.22

0.22

0.19

0.2

0.16

0.19

0.19

0.21

0.19

0.22

0.22

0.2

0.19

0.22

0.16

0.13

0.16

0.22

0.16

0.33

a

b

c

d

e

a

b

c

d

e

 

0.32

0.24

0.15

0.24

0.06

0.24

0.3

0.19

0.17

0.11

0.15

0.19

0.23

0.19

0.24

0.24

0.17

0.19

0.3

0.11

0.06

0.11

0.24

0.11

0.49

a

b

c

d

e

a

b

c

d

e

 

0.47

0.21

0.09

0.21

0.02

0.21

0.46

0.18

0.09

0.05

0.09

0.18

0.34

0.18

0.21

0.21

0.09

0.18

0.46

0.05

0.02

0.05

0.21

0.05

0.67

a

b

c

d

e

a

b

c

d

e

 

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

a

b

c

d

e

a

b

c

d

e

 

a

b

c

d

e

44 of 58

Competing methods

  • Direct interactions (DI)
    • A gene is predicted as a disease gene for disease j if it is a direct neighbor of a gene associated with a known gene of disease j
  • Shortest path (SP)
    • Rank gene based on the single shortest path to any known disease gene for disease j
  • ENDEAVOUR
    • Integrates gene expression, protein domain information, literature annotation
    • Based on an ensemble method, each component of the ensemble corresponds to a dataset
  • PROSPECTR
    • Sequence-based features using an alternating decision tree to output a likelihood of a gene to belong to a disease

45 of 58

Dataset

  • Online Mendelian Inheritance in Man (OMIM)
    • Online Catalog of Human Genes and Genetic Disorders
  • Literature search to find genes clearly associated with a disease
  • 110 different disease-gene families spanning different types of diseases:
    • 86 genetically heterogeneous disorders: mutations in distinct genes are associated with similar or even indistinguishable phenotypes
    • 12 cancer syndromes: genes associated with hereditary cancer, increased risk, or somatic mutation in a given cancer type
    • 12 complex (polygenic) disorders: influenced by variation in multiple genes
  • Total of 783 genes with 665 distinct genes
    • Some genes are associated with multiple diseases
    • Each family has 7 genes on average, largest family has 41 genes, smallest had 3 genes
  • Network of 35,910 human interactions + 38,975 from other organisms
    • Include experimentally verified and predicted interactions

46 of 58

OMIM

47 of 58

Evaluation strategy

  • For each disease gene, define an artificial linkage interval by selecting 100 closest genes on the same chromosome
  • Evaluate performance based on leave one out for each disease gene family
  • Two assessment measures
    • An enrichment score: Compare the predicted rank to random rank
      • Let grank be the rank of a particular gene
      • Enrichment score is defined as 50/grank
      • If a true disease gene is ranked the highest, it has an enrichment score of 50
      • For all other genes, we have assigned a rank of 100.
    • Receiver Operator Curves (ROC)
      • Plot True Positive Rate (TPR) versus False Positive Rate (FPR) and compute Area Under the ROC (AUROC)

48 of 58

Global graph-based measures have greater score enrichment compared to local measures

Mean enrichment scores of true genes

Diffusion kernel was not as useful for cancer diseases

A and B differ in how ties are handled

49 of 58

ROC analysis further supports enrichment analysis

50 of 58

RWR outperforms existing local network based methods on seven disease genes that do not have the literature bias

51 of 58

Shortest path fails but RWR successfully identifies disease genes

False positives from SP and DI

Dense network with multiple alternate paths is used by RWR to correctly rank genes

For bare lymphocyte syndrome type 1

52 of 58

RWR can identify genes based on indirect paths

  1. There are no direct connections between disease genes, so DI will not work
  2. Shortest path has high false positives (yellow)
  3. RWR correctly identifies new genes (red)

53 of 58

Summary of GeneWanderer results

  • Global similarity measured using both random walk and diffusion kernels are vastly superior than local distance measures
  • The authors generated a test set using a linkage based analysis
    • This is more realistic and similar to existing linkage-based approaches to find genes
    • Others have used a random set of test genes, which might not be correct because similar genes tend to cluster on chromosomes
  • How to define disease gene families?
    • This method needs known genes, although not many. Was shown to work for as small as 3 genes
    • Existing approaches like ENDEAVOUR may not work

54 of 58

Concluding remarks

  • Phenotypically similar diseases likely represent perturbations to the same subnetwork or are topologically close.
  • This approach works for situations where there are some known genes that can be used to rank other genes with respect to them
    • But this may not be the case always. How to extend?
  • Current interaction networks are incomplete
  • Each disease was considered one at a time
    • Can we share information between diseases to better prioritize genes?
  • Other approaches go beyond genes: prioritizing GWAS sequence variants (NetWAS from GIANT http://giant.princeton.edu )

55 of 58

Other applications of random walks

  • Hi-C matrix denoising
  • Single cell RNA-seq data denoising

56 of 58

Using random walks for denoising Hi-C data

  • Nodes in the graph represent genomic regions
  • Edge weight is the counts of interactions
  • Denoising is done by taking powers of the row sum-based transition matrix
  • This gives you the probability of reaching region j from region i in t steps for a random walk originating on region i

57 of 58

Using random walks to smooth scRNA-seq data

van Dijk et al. Cell 2018

Euclidean distance

Kernel is adaptive

Each cell has at most k neighbors, to allow most of the Gaussian kernel to be covered.

Make Affinity matrix symmetric

58 of 58

Smoothing using MAGIC enhances scRNA-seq data signal