1 of 59

Graph diffusion for network-based applications

Nov 6th, 2025

Original slides created by Prof. Sushmita Roy

BMI/CS 775 Computational Network Biology�Fall 2025

Anthony Gitter

https://bmi775.sites.wisc.edu

These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Sushmita Roy and Anthony Gitter

2 of 59

Topics in this section

  • Graph-based approaches for gene prioritization
  • Graph diffusion to interpret sequence variants
  • Combinatorial graph algorithms for subnetwork selection
  • Multi-omic data integration
  • Network integration with graph neural networks

3 of 59

Goals for today

  • Gene prioritization
    • Supervised methods
    • Unsupervised methods
  • GeneWanderer: Walking the Interactome for Prioritization of Candidate Disease Genes
    • Example of a supervised, graph diffusion-based approach
  • Other applications of random walks on graphs

4 of 59

Gene prioritization

  • Gene prioritization: rank the most important genes for a particular process or system through integrative computational analysis of public and private genomic data.
  • Many of the approaches developed were originally for finding the important gene(s) from a linkage study
    • A genomic locus associated with a disease that could have hundreds of genes
    • Don’t have the resolution to pinpoint specific genes
  • Now used to prioritize genes identified from high-throughput omics experiments

Computational tools for prioritizing candidate genes: boosting disease gene discovery Moreau and Tranchevent, Nature Review Genetics 2012

5 of 59

Why gene prioritization?

  • Identification of genes associated with diseases (and other complex traits) is important for gaining a molecular understanding of a disease
  • Linkage analysis or omics-based measurements identify many candidate genes
    • Too many genes for follow-up analysis

6 of 59

Linkage disequilibrium

Single nucleotide variants along the genome

Heatmap of correlations

7 of 59

Why gene prioritization?

  • Gene prioritization has many applications
    • What genes control a particular process?
    • What genes are affected in a specific disease?
    • What genes with sequence variation are pathogenic?
    • What genes must be tested first to best complete a model of the system?
    • What genes (regulators) establish a particular cell type or fate?

8 of 59

Overview of approaches for gene prioritization

  • Non-network based methods
    • Similarity in different gene-level features can be used to prioritize new genes
      • Sequence, expression, Gene Ontology, structure
  • Network-based methods
    • Supervised methods: need a small number of seed or known genes
    • Unsupervised methods
    • Network model-based methods

9 of 59

Network-based prioritization

  • Can we use interaction networks to prioritize what genes might be important?
  • Guilt-by-association principle
  • Suppose we have candidate genes we know are important
    • Can we identify additional genes that are also important based on their proximity in the network?

10 of 59

Supervised network-based gene prioritization task definition

  • Given
    • List of candidate genes
    • Background interaction network among genes
    • List of known genes (seed genes) for a particular process or disease
  • Do
    • Rank candidates genes based on their association to the known genes

11 of 59

Supervised network-based gene prioritization

Slide credit: Debbie Chasman

Seed genes

(known genes)

12 of 59

Neighborhoods on graphs

  • Network-based methods rely on defining neighborhoods of genes
  • Neighborhoods of genes known for a disease can be used to prioritize additional genes
  • Local neighborhood
    • Immediate neighbor (direct interaction)
    • Shortest path
  • Global neighborhood
    • Based on more global measures of distance between nodes
  • We have seen related ideas in node2vec and graph clustering

13 of 59

Local network neighborhood-based prioritization

  • Available in many public software and tools
    • MouseNET, GIANT, STRING
  • Applied for filling holes in a pathway

Known members

Novel members

Predictions of novel pathway components from MouseNET

14 of 59

Local network prioritization fails

Cowen et al., Nature Review Genetics 2017

Known disease gene

New disease gene we are trying to find

False positives

15 of 59

Local network prioritization fails

Cowen et al., Nature Review Genetics 2017

Known disease gene

New disease genes we are trying to find

False positives

16 of 59

Desired global network prioritization

Cowen et al., Nature Review Genetics 2017

Known disease gene

New disease genes we are trying to find

Successfully rank the other

disease genes above other

genes

17 of 59

Global network neighborhood-based prioritization

  • Global network neighborhood-based methods make use of the entire graph to define the similarity between two genes
  • Given a graph connecting nodes, global similarity can be defined using
    • Random walk
    • Diffusion kernel
    • Laplacian kernel
    • Heat kernel

18 of 59

Goals for today

  • Gene prioritization
    • Supervised methods
    • Unsupervised methods
  • GeneWanderer: Walking the Interactome for Prioritization of Candidate Disease Genes
    • Example of a supervised, graph diffusion-based approach
  • Other applications of random walks on graphs

19 of 59

Motivation of GeneWanderer

  • Many diseases for which we do not know the causal genes
  • Previous approaches have used protein-protein interaction networks, but in a limited way
    • Only local similarity was used
  • Can we use global graph distance?
  • Does global distance help and improve current prioritization techniques?

S. Köhler, S. Bauer, D. Horn, and P. N. Robinson, "Walking the interactome for prioritization of candidate disease genes." American Journal of Human Genetics 2008

20 of 59

Overview of GeneWanderer

21 of 59

Motivation of using global distance

  • Global similarity is more sensitive and different in each case
  • In contrast, local shortest-path similarity is the same for all pairs
  • Direct interactions will never select y as a candidate

Known gene: x

Candidate gene: y

22 of 59

Global graph distances used in GeneWanderer

  • Random walk with restarts

  • Diffusion kernel

23 of 59

Random walk on graphs

  • A random walk is defined as a probabilistic traversal of a graph
  • At each time step, the walker transitions randomly to one of the neighbors of the current node
  • Typically can start anywhere on the graph
  • But can put priors on the walk based on what we know about candidate genes
  • The random walk converges to some distribution of the number of times a gene is visited
  • The distribution can be used to rank genes

24 of 59

Random walk on graphs

  • Random walk requires transition probabilities
  • Transition probability is obtained by dividing each entry in adjacency matrix by the row sum or column sum
  • One step of the random walk corresponds to one matrix multiplication of the transition matrix
  • Suppose we started at node a, what is the probability of reaching other nodes after t steps?
  • tth power of the transition matrix tell us the probability of reaching node j from node i after t steps on the random walk

25 of 59

Transition matrix of a graph

0

1

0

1

0

1

0

1

0

0

0

1

0

1

1

1

0

1

0

0

0

0

1

0

0

a

b

c

d

e

a

b

c

d

e

0

0.5

0

0.5

0

0.5

0

1/3

0

0

0

0.5

0

0.5

1

0.5

0

1/3

0

0

0

0

1/3

0

0

a

b

c

d

e

a

b

c

d

e

A: Adjacency matrix

Column-normalized adjacency matrix

Network adapted from IsoRank paper

a

b

c

d

e

0

0.5

0

0.5

0

0.5

0

0.5

0

0

0

1/3

0

1/3

1/3

0.5

0

0.5

0

0

0

0

1

0

0

a

b

c

d

e

a

b

c

d

e

Row-normalized adjacency matrix

26 of 59

Transition matrix of a graph

0

1

0

1

0

1

0

1

0

0

0

1

0

1

1

1

0

1

0

0

0

0

1

0

0

a

b

c

d

e

a

b

c

d

e

0

0.5

0

0.5

0

0.5

0

1/3

0

0

0

0.5

0

0.5

1

0.5

0

1/3

0

0

0

0

1/3

0

0

a

b

c

d

e

a

b

c

d

e

A: Adjacency matrix

Column-normalized adjacency matrix

Network adapted from IsoRank paper

a

b

c

d

e

0

0.5

0

0.5

0

0.5

0

0.5

0

0

0

1/3

0

1/3

1/3

0.5

0

0.5

0

0

0

0

1

0

0

a

b

c

d

e

a

b

c

d

e

Row-normalized adjacency matrix

27 of 59

Random Walk with Restarts (RWR)

  • Allow the random walker to restart occasionally

  • W is the column normalized adjacency matrix
    • Original adjacency matrix can be weighted
  • r controls how often we restart the random walk
  • p0 was set such that the random walk would start at any of the known genes with equal probability
  • Rank candidate genes based on the pt+1 when pt and pt+1 are really close
  • pt+1 is the steady state distribution

Prior probability of starting at a node

28 of 59

RWR continued

  • Gives the probability of reaching a specific node, given that a random walk starts at a specific node, while taking all intermediate paths into account
  • By making the random walk start from known disease genes we obtain a global proximity measure from these genes

29 of 59

RWR example

0

1

0

1

0

1

0

1

0

0

0

1

0

1

1

1

0

1

0

0

0

0

1

0

0

a

b

c

d

e

a

b

c

d

e

0

0.5

0

0.5

0

0.5

0

1/3

0

0

0

0.5

0

0.5

1

0.5

0

1/3

0

0

0

0

1/3

0

0

a

b

c

d

e

a

b

c

d

e

A: Adjacency matrix

W: Column-normalized adjacency matrix

Network adapted from IsoRank paper

a

b

c

d

e

30 of 59

RWR example continued

  • Assume that a and b are “known genes”
  • p0: [0.5 0.5 0 0 0]
  • Let r=0.7

a

b

c

d

e

Iteration 0

0.5

0.5

0

0

0

Iteration 1

a

b

c

d

e

0.4250

0.4250

0.0750

0.0750

0

Iteration 2

a

b

c

d

e

0.4250

0.4213

0.0750

0.0713

0.0075

Iteration 6

(convergence)

a

b

c

d

e

0.4239

0.4212

0.0761

0.0712

0.0076

31 of 59

RWR example continued

  • Assume that a and b are “known genes”
  • p0: [0.5 0.5 0 0 0]
  • Let r=0.5

a

b

c

d

e

Iteration 0

0.5

0.5

0

0

0

Iteration 1

a

b

c

d

e

0.3750

0.3750

0.1250

0.1250

0

Iteration 2

a

b

c

d

e

0.3750

0.3646

0.1250

0.1146

0.0208

Iteration 8

(convergence)

a

b

c

d

e

0.3696

0.3641

0.1304

0.1141

0.0217

32 of 59

Effect of restart probability

r=0.5

Iteration 8

(convergence)

a

b

c

d

e

0.3696

0.3641

0.1304

0.1141

0.0217

Iteration 6

(convergence)

a

b

c

d

e

0.4239

0.4212

0.0761

0.0712

0.0076

r=0.7

33 of 59

Top Hat question

34 of 59

Global graph distances used in GeneWanderer

  • Random walk with restarts

  • Diffusion kernel

35 of 59

Kernel

  • A kernel k is a function that maps pairs of objects to a real-valued space
  • It enables one to define very general similarity functions between pairs of objects
  • Let denote a set of objects of a particular type
    • For example, the set of images or a set of trees
  • A kernel k is defined as

36 of 59

Diffusion kernel

  • Diffusion kernel K of a graph is defined a function of the graph Laplacian L
    • K= exp(-βL)
  • Graph Laplacian L=D-A
    • D is the diagonal degree of matrix
    • A is the adjacency matrix
    • The rank of a gene j is given by a score depending upon its proximity from other known disease genes (seeds)

Kondor and Lafferty, “Diffusion Kernels on Graphs and Other Discrete Input Spaces”, ICML 2002

37 of 59

Diffusion kernel

  • The diffusion kernel gives a ranking of a new gene using the sum of a global distance measure from all known genes of a disease
  • Specifically the jth column corresponds to the stationary distribution of a random walk that started at node j

38 of 59

Diffusion kernel

  •  

Gärtner et al., “On Graph Kernels: Hardness Results and Efficient Alternatives”, 2003

 

Kondor and Lafferty, “Diffusion Kernels on Graphs and Other Discrete Input Spaces”, ICML 2002

39 of 59

Diffusion kernel example

2

-1

0

-1

0

-1

2

-1

0

0

0

-1

3

-1

-1

-1

0

-1

2

0

0

0

-1

0

1

a

b

c

d

e

a

b

c

d

e

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

a

b

c

d

e

a

b

c

d

e

L: Laplacian matrix

a

b

c

d

e

 

 

See .ipynb or .html for calculations

40 of 59

Diffusion kernel example

a

b

c

d

e

a

b

c

d

e

0.47

0.21

0.09

0.21

0.02

0.21

0.46

0.18

0.09

0.05

0.09

0.18

0.34

0.18

0.21

0.21

0.09

0.18

0.46

0.05

0.02

0.05

0.21

0.05

0.67

a

b

c

d

e

a

b

c

d

e

L: Laplacian matrix

a

b

c

d

e

 

 

2

-1

0

-1

0

-1

2

-1

0

0

0

-1

3

-1

-1

-1

0

-1

2

0

0

0

-1

0

1

See .ipynb or .html for calculations

41 of 59

Diffusion kernel example

a

b

c

d

e

a

b

c

d

e

0.32

0.24

0.15

0.24

0.06

0.24

0.3

0.19

0.17

0.11

0.15

0.19

0.23

0.19

0.24

0.24

0.17

0.19

0.3

0.11

0.06

0.11

0.24

0.11

0.49

a

b

c

d

e

a

b

c

d

e

L: Laplacian matrix

a

b

c

d

e

 

 

2

-1

0

-1

0

-1

2

-1

0

0

0

-1

3

-1

-1

-1

0

-1

2

0

0

0

-1

0

1

See .ipynb or .html for calculations

42 of 59

Diffusion kernel example

a

b

c

d

e

a

b

c

d

e

0.24

0.22

0.19

0.22

0.13

0.22

0.22

0.19

0.2

0.16

0.19

0.19

0.21

0.19

0.22

0.22

0.2

0.19

0.22

0.16

0.13

0.16

0.22

0.16

0.33

a

b

c

d

e

a

b

c

d

e

L: Laplacian matrix

a

b

c

d

e

 

 

2

-1

0

-1

0

-1

2

-1

0

0

0

-1

3

-1

-1

-1

0

-1

2

0

0

0

-1

0

1

See .ipynb or .html for calculations

43 of 59

Diffusion kernel example

a

b

c

d

e

a

b

c

d

e

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

a

b

c

d

e

a

b

c

d

e

L: Laplacian matrix

a

b

c

d

e

 

 

2

-1

0

-1

0

-1

2

-1

0

0

0

-1

3

-1

-1

-1

0

-1

2

0

0

0

-1

0

1

See .ipynb or .html for calculations

44 of 59

Diffusion kernel example

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

a

b

c

d

e

a

b

c

d

e

 

0.24

0.22

0.19

0.22

0.13

0.22

0.22

0.19

0.2

0.16

0.19

0.19

0.21

0.19

0.22

0.22

0.2

0.19

0.22

0.16

0.13

0.16

0.22

0.16

0.33

a

b

c

d

e

a

b

c

d

e

 

0.32

0.24

0.15

0.24

0.06

0.24

0.3

0.19

0.17

0.11

0.15

0.19

0.23

0.19

0.24

0.24

0.17

0.19

0.3

0.11

0.06

0.11

0.24

0.11

0.49

a

b

c

d

e

a

b

c

d

e

 

0.47

0.21

0.09

0.21

0.02

0.21

0.46

0.18

0.09

0.05

0.09

0.18

0.34

0.18

0.21

0.21

0.09

0.18

0.46

0.05

0.02

0.05

0.21

0.05

0.67

a

b

c

d

e

a

b

c

d

e

 

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

a

b

c

d

e

a

b

c

d

e

 

a

b

c

d

e

45 of 59

Competing methods

  • Direct interactions (DI)
    • A gene is predicted as a disease gene for disease j if it is a direct neighbor of a gene associated with a known gene of disease j
  • Shortest path (SP)
    • Rank gene based on the single shortest path to any known disease gene for disease j
  • ENDEAVOUR
    • Integrates gene expression, protein domain information, literature annotation
    • Based on an ensemble method, each component of the ensemble corresponds to a dataset
  • PROSPECTR
    • Sequence-based features using an alternating decision tree to output a likelihood of a gene to belong to a disease

46 of 59

Dataset

  • Online Mendelian Inheritance in Man (OMIM)
    • Online Catalog of Human Genes and Genetic Disorders
  • Literature search to find genes clearly associated with a disease
  • 110 different disease-gene families spanning different types of diseases:
    • 86 genetically heterogeneous disorders: mutations in distinct genes are associated with similar or even indistinguishable phenotypes
    • 12 cancer syndromes: genes associated with hereditary cancer, increased risk, or somatic mutation in a given cancer type
    • 12 complex (polygenic) disorders: influenced by variation in multiple genes
  • Total of 783 genes with 665 distinct genes
    • Some genes are associated with multiple diseases
    • Each family has 7 genes on average, largest family has 41 genes, smallest had 3 genes
  • Network of 35,910 human interactions + 38,975 from other organisms
    • Include experimentally verified and predicted interactions

47 of 59

OMIM

48 of 59

Evaluation strategy

  • For each disease gene, define an artificial linkage interval by selecting 100 closest genes on the same chromosome
  • Evaluate performance based on leave one out for each disease gene family
  • Two assessment measures
    • An enrichment score: Compare the predicted rank to random rank
      • Let grank be the rank of a particular gene
      • Enrichment score is defined as 50/grank
      • If a true disease gene is ranked the highest, it has an enrichment score of 50
      • For all other genes, we have assigned a rank of 100.
    • Receiver Operator Curves (ROC)
      • Plot True Positive Rate (TPR) versus False Positive Rate (FPR) and compute Area Under the ROC (AUROC)

49 of 59

Global graph-based measures have greater score enrichment compared to local measures

Mean enrichment scores of true genes

Diffusion kernel was not as useful for cancer diseases

A and B differ in how ties are handled

50 of 59

ROC analysis further supports enrichment analysis

51 of 59

RWR outperforms existing local network based methods on seven disease genes that do not have the literature bias

52 of 59

Shortest path fails but RWR successfully identifies disease genes

False positives from SP and DI

Dense network with multiple alternate paths is used by RWR to correctly rank genes

For bare lymphocyte syndrome type 1

53 of 59

RWR can identify genes based on indirect paths

  1. There are no direct connections between disease genes, so DI will not work
  2. Shortest path has high false positives (yellow)
  3. RWR correctly identifies new genes (red)

54 of 59

Summary of GeneWanderer results

  • Global similarity measured using both random walk and diffusion kernels are vastly superior than local distance measures
  • The authors generated a test set using a linkage based analysis
    • This is more realistic and similar to existing linkage-based approaches to find genes
    • Others have used a random set of test genes, which might not be correct because similar genes tend to cluster on chromosomes
  • How to define disease gene families?
    • This method needs known genes, although not many. Was shown to work for as small as 3 genes
    • Existing approaches like ENDEAVOUR may not work

55 of 59

Concluding remarks

  • Phenotypically similar diseases likely represent perturbations to the same subnetwork or are topologically close.
  • This approach works for situations where there are some known genes that can be used to rank other genes with respect to them
    • But this may not be the case always. How to extend?
  • Current interaction networks are incomplete
  • Each disease was considered one at a time
    • Can we share information between diseases to better prioritize genes?
  • Other approaches go beyond genes: prioritizing GWAS sequence variants (NetWAS from GIANT http://giant.princeton.edu )

56 of 59

Other applications of random walks

  • Hi-C matrix denoising
  • Single cell RNA-seq data denoising

57 of 59

Using random walks for denoising Hi-C data

  • Nodes in the graph represent genomic regions
  • Edge weight is the counts of interactions
  • Denoising is done by taking powers of the row sum-based transition matrix
  • This gives you the probability of reaching region j from region i in t steps for a random walk originating on region i

58 of 59

Using random walks to smooth scRNA-seq data

van Dijk et al. Cell 2018

Euclidean distance

Kernel is adaptive

Each cell has at most k neighbors, to allow most of the Gaussian kernel to be covered.

Make Affinity matrix symmetric

59 of 59

Smoothing using MAGIC enhances scRNA-seq data signal