1 of 58

Graph diffusion for network–based applications

Nov 8^th, 2022

BMI/CS 775 Computational Network Biology�Fall 2022

Anthony Gitter

https://compnetbiocourse.discovery.wisc.edu

Original slides created by Prof. Sushmita Roy

2 of 58

Topics in this section

Graph-based approaches for gene prioritization
Graph diffusion to interpret sequence variants
Combinatorial graph algorithms for subnetwork selection
Multi-omic data integration

3 of 58

Goals for today

Gene prioritization

Supervised methods
Unsupervised methods

GeneWanderer: Walking the Interactome for Prioritization of Candidate Disease Genes

Example of a supervised, graph diffusion-based approach

Other applications of random walks on graphs

4 of 58

Gene prioritization

Gene prioritization: rank the most important genes for a particular process or system through integrative computational analysis of public and private genomic data.
Many of the approaches developed were originally for finding the important gene(s) from a linkage study

A genomic locus associated with a disease that could have hundreds of genes

Now used to prioritize genes identified from high-throughput omics experiments

Computational tools for prioritizing candidate genes: boosting disease gene discovery Moreau and Tranchevent, Nature Review Genetics 2012

5 of 58

Why gene prioritization?

Identification of genes associated with diseases (and other complex traits) is important for gaining a molecular understanding of a disease
Linkage analysis or omics-based measurements identify many candidate genes

Can narrow down a region of the genome that might be associated with a disease, or obtain a gene set
Too many genes for follow-up analysis
Don’t have the resolution to pinpoint specific genes

6 of 58

Linkage disequilibrium

Single nucleotide variants along the genome

Heatmap of correlations

Image from Tulsyan et al . PLOS ONE 2013

7 of 58

Why gene prioritization?

Gene prioritization has many applications

What genes control a particular process?
What genes are affected in a specific disease?
What genes must be tested first to best complete a model of the system?
What genes (regulators) establish a particular cell type or fate?

8 of 58

Overview of approaches for gene prioritization

Non-network based methods

Similarity in different gene-level features can be used to prioritize new genes

Sequence, expression, and Gene Ontology

Network-based methods

Supervised methods: need a small number of seed or known genes
Unsupervised methods
Network model-based methods

9 of 58

Network-based prioritization

Can we use interaction networks to prioritize what genes might be important?
Guilt-by-association principle
Suppose we have candidate genes we know are important

Can we identify additional genes that are also important based on their proximity in the network?

10 of 58

Supervised network-based gene prioritization task definition

Given

List of candidate genes
Background interaction network among genes
List of known genes (seed genes) for a particular process or disease

Do

Rank candidates genes based on their association to the known genes

11 of 58

Supervised network-based gene prioritization

Slide credit: Debbie Chasman

Seed genes

(known genes)

12 of 58

Neighborhoods on graphs

Network-based methods rely on defining neighborhoods of genes
Neighborhoods of genes known for a disease can be used to prioritize additional genes
Local neighborhood

Immediate neighbor (direct interaction)
Shortest path

Global neighborhood

Based on more global measures of distance between nodes

We have seen some ideas of neighborhoods on graphs in graph clustering

13 of 58

Local network neighborhood-based prioritization

Available in many public software and tools

MouseNET, GIANT, STRING

Applied for filling holes in a pathway

Known members

Novel members

Predictions of novel pathway components from MouseNET

14 of 58

Local network prioritization fails

Cowen et al., Nature Review Genetics 2017

Known disease gene

New disease gene we are trying to find

False positives

15 of 58

Local network prioritization fails

Cowen et al., Nature Review Genetics 2017

Known disease gene

New disease genes we are trying to find

False positives

16 of 58

Desired global network prioritization

Cowen et al., Nature Review Genetics 2017

Known disease gene

New disease genes we are trying to find

Successfully rank the other

disease genes above other

genes

17 of 58

Global network neighborhood-based prioritization

Global network neighborhood-based methods make use of the entire graph to define the similarity between two genes
Given a graph connecting nodes, global similarity can be defined using

Random walk
Diffusion kernel
Laplacian kernel
Heat kernel

18 of 58

Goals for today

Gene prioritization

Supervised methods
Unsupervised methods

GeneWanderer: Walking the Interactome for Prioritization of Candidate Disease Genes

Example of a supervised, graph diffusion-based approach

Other applications of random walks on graphs

19 of 58

Motivation of GeneWanderer

Many diseases for which we do not know the causal genes
Previous approaches have used protein-protein interaction networks, but in a limited way

Only local similarity was used

Can we use global graph distance?
Does global distance help and improve current prioritization techniques?

S. Köhler, S. Bauer, D. Horn, and P. N. Robinson, "Walking the interactome for prioritization of candidate disease genes." American Journal of Human Genetics 2008

20 of 58

Overview of GeneWanderer

21 of 58

Motivation of using global distance

Global similarity is more sensitive and different in each case
In contrast, local shortest-path similarity is the same for all pairs
Direct interactions will never select y as a candidate

Known gene: x

Candidate gene: y

22 of 58

Global graph distances used in GeneWanderer

Random walk with restarts

Diffusion kernel

23 of 58

Random walk on graphs

A random walk is defined as a probabilistic traversal of a graph
At each time step, the walker transitions randomly to one of the neighbors of the current node
Typically can start anywhere on the graph
But can put priors on the walk based on what we know about candidate genes
The random walk converges to some distribution of the number of times a gene is visited
The distribution can be used to rank genes

24 of 58

Random walk on graphs

Random walk requires transition probabilities
Transition probability is obtained by dividing each entry in adjacency matrix by the row sum or column sum
One step of the random walk corresponds to one matrix multiplication of the transition matrix
Suppose we started at node a, what is the probability of reaching other nodes after t steps?
t^th power of the transition matrix tell us the probability of reaching node j from node i after t steps on the random walk

25 of 58

Transition matrix of a graph

0	1	0	1	0
1	0	1	0	0
0	1	0	1	1
1	0	1	0	0
0	0	1	0	0

a

b

c

d

e

a

b

c

d

e

0	0.5	0	0.5	0
0.5	0	1/3	0	0
0	0.5	0	0.5	1
0.5	0	1/3	0	0
0	0	1/3	0	0

a

b

c

d

e

a

b

c

d

e

A: Adjacency matrix

Column-normalized adjacency matrix

Network adapted from IsoRank paper

a

b

c

d

e

0	0.5	0	0.5	0
0.5	0	0.5	0	0
0	1/3	0	1/3	1/3
0.5	0	0.5	0	0
0	0	1	0	0

a

b

c

d

e

a

b

c

d

e

Row-normalized adjacency matrix

26 of 58

Random Walk with Restarts (RWR)

Allow the random walker to restart occasionally

W is the column normalized adjacency matrix

Original adjacency matrix can be weighted

r controls how often we restart the random walk
p⁰ was set such that the random walk would start at any of the known genes with equal probability
Rank candidate genes based on the p^t+1 when p^t and p^t+1 are really close
In other words, this means that p^t+1 is the steady state distribution

Prior probability of starting at a node

27 of 58

RWR continued

Intuitively, this approach gives the probability of reaching a specific node, given that a random walk starts at a given node, while taking all intermediate paths into account
By making the random walk start from known disease genes we obtain a global proximity measure from these genes

28 of 58

RWR example

0	1	0	1	0
1	0	1	0	0
0	1	0	1	1
1	0	1	0	0
0	0	1	0	0

a

b

c

d

e

a

b

c

d

e

0	0.5	0	0.5	0
0.5	0	1/3	0	0
0	0.5	0	0.5	1
0.5	0	1/3	0	0
0	0	1/3	0	0

a

b

c

d

e

a

b

c

d

e

A: Adjacency matrix

W: Column-normalized adjacency matrix

Network adapted from IsoRank paper

a

b

c

d

e

29 of 58

RWR example continued

Assume that a and b are “known genes”
p₀: [0.5 0.5 0 0 0]
Let r=0.7

a

b

c

d

e

Iteration 0

0.5

0

Iteration 1

a

b

c

d

e

0.4250

0.0750

0

Iteration 2

a

b

c

d

e

0.4250

0.4213

0.0750

0.0713

0.0075

Iteration 6

(convergence)

a

b

c

d

e

0.4239

0.4212

0.0761

0.0712

0.0076

30 of 58

RWR example continued

Assume that a and b are “known genes”
p₀: [0.5 0.5 0 0 0]
Let r=0.5

a

b

c

d

e

Iteration 0

0.5

0

Iteration 1

a

b

c

d

e

0.3750

0.1250

0

Iteration 2

a

b

c

d

e

0.3750

0.3646

0.1250

0.1146

0.0208

Iteration 8

(convergence)

a

b

c

d

e

0.3696

0.3641

0.1304

0.1141

0.0217

31 of 58

Effect of restart probability

r=0.5

Iteration 8

(convergence)

a

b

c

d

e

0.3696

0.3641

0.1304

0.1141

0.0217

Iteration 6

(convergence)

a

b

c

d

e

0.4239

0.4212

0.0761

0.0712

0.0076

r=0.7

32 of 58

33 of 58

Global graph distances used in GeneWanderer

Random walk with restarts

Diffusion kernel

34 of 58

Kernel

A kernel k is a function that maps pairs of objects to a real-valued space
It enables one to define very general similarity functions between pairs of objects
Let denote a set of objects of a particular type

For example, the set of images or a set of trees

A kernel k is defined as

35 of 58

Diffusion kernel

Diffusion kernel K of a graph is defined a function of the graph Laplacian L

K= exp(-βL)

Graph Laplacian L=D-A

D is the diagonal degree of matrix
A is the adjacency matrix
The rank of a gene j is given by a score depending upon its proximity from other disease genes

Kondor and Lafferty, “Diffusion Kernels on Graphs and Other Discrete Input Spaces”, ICML 2002

36 of 58

Diffusion kernel

The diffusion kernel gives a ranking of a new gene using the sum of a global distance measure from all known genes of a disease
Specifically the j^th column corresponds to the stationary distribution of a random walk that started at node j

37 of 58

Diffusion kernel

Gärtner et al., “On Graph Kernels: Hardness Results and Efficient Alternatives”, 2003

Kondor and Lafferty, “Diffusion Kernels on Graphs and Other Discrete Input Spaces”, ICML 2002

38 of 58

Diffusion kernel example

0	1	0	1	0
1	0	1	0	0
0	1	0	1	1
1	0	1	0	0
0	0	1	0	0

a

b

c

d

e

a

b

c

d

e

1	0	0	0	0
0	1	0	0	0
0	0	1	0	0
0	0	0	1	0
0	0	0	0	1

a

b

c

d

e

a

b

c

d

e

L: Laplacian matrix

Network adapted from IsoRank paper

a

b

c

d

e

39 of 58

Diffusion kernel example

0	1	0	1	0
1	0	1	0	0
0	1	0	1	1
1	0	1	0	0
0	0	1	0	0

a

b

c

d

e

a

b

c

d

e

0.47	0.21	0.09	0.21	0.02
0.21	0.46	0.18	0.09	0.05
0.09	0.18	0.34	0.18	0.21
0.21	0.09	0.18	0.46	0.05
0.02	0.05	0.21	0.05	0.67

a

b

c

d

e

a

b

c

d

e

L: Laplacian matrix

Network adapted from IsoRank paper

a

b

c

d

e

40 of 58

Diffusion kernel example

0	1	0	1	0
1	0	1	0	0
0	1	0	1	1
1	0	1	0	0
0	0	1	0	0

a

b

c

d

e

a

b

c

d

e

0.32	0.24	0.15	0.24	0.06
0.24	0.3	0.19	0.17	0.11
0.15	0.19	0.23	0.19	0.24
0.24	0.17	0.19	0.3	0.11
0.06	0.11	0.24	0.11	0.49

a

b

c

d

e

a

b

c

d

e

L: Laplacian matrix

Network adapted from IsoRank paper

a

b

c

d

e

41 of 58

Diffusion kernel example

0	1	0	1	0
1	0	1	0	0
0	1	0	1	1
1	0	1	0	0
0	0	1	0	0

a

b

c

d

e

a

b

c

d

e

0.24	0.22	0.19	0.22	0.13
0.22	0.22	0.19	0.2	0.16
0.19	0.19	0.21	0.19	0.22
0.22	0.2	0.19	0.22	0.16
0.13	0.16	0.22	0.16	0.33

a

b

c

d

e

a

b

c

d

e

L: Laplacian matrix

Network adapted from IsoRank paper

a

b

c

d

e

42 of 58

Diffusion kernel example

0	1	0	1	0
1	0	1	0	0
0	1	0	1	1
1	0	1	0	0
0	0	1	0	0

a

b

c

d

e

a

b

c

d

e

0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2

a

b

c

d

e

a

b

c

d

e

L: Laplacian matrix

Network adapted from IsoRank paper

a

b

c

d

e

43 of 58

Diffusion kernel example

0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2

a

b

c

d

e

a

b

c

d

e

0.24	0.22	0.19	0.22	0.13
0.22	0.22	0.19	0.2	0.16
0.19	0.19	0.21	0.19	0.22
0.22	0.2	0.19	0.22	0.16
0.13	0.16	0.22	0.16	0.33

a

b

c

d

e

a

b

c

d

e

0.32	0.24	0.15	0.24	0.06
0.24	0.3	0.19	0.17	0.11
0.15	0.19	0.23	0.19	0.24
0.24	0.17	0.19	0.3	0.11
0.06	0.11	0.24	0.11	0.49

a

b

c

d

e

a

b

c

d

e

0.47	0.21	0.09	0.21	0.02
0.21	0.46	0.18	0.09	0.05
0.09	0.18	0.34	0.18	0.21
0.21	0.09	0.18	0.46	0.05
0.02	0.05	0.21	0.05	0.67

a

b

c

d

e

a

b

c

d

e

1	0	0	0	0
0	1	0	0	0
0	0	1	0	0
0	0	0	1	0
0	0	0	0	1

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

44 of 58

Competing methods

Direct interactions (DI)

A gene is predicted as a disease gene for disease j if it is a direct neighbor of a gene associated with a known gene of disease j

Shortest path (SP)

Rank gene based on the single shortest path to any known disease gene for disease j

ENDEAVOUR

Integrates gene expression, protein domain information, literature annotation
Based on an ensemble method, each component of the ensemble corresponds to a dataset

PROSPECTR

Sequence-based features using an alternating decision tree to output a likelihood of a gene to belong to a disease

45 of 58

Dataset

Online Mendelian Inheritance in Man (OMIM)

Online Catalog of Human Genes and Genetic Disorders

Literature search to find genes clearly associated with a disease
110 different disease-gene families spanning different types of diseases:

86 genetically heterogeneous disorders: mutations in distinct genes are associated with similar or even indistinguishable phenotypes
12 cancer syndromes: genes associated with hereditary cancer, increased risk, or somatic mutation in a given cancer type
12 complex (polygenic) disorders: influenced by variation in multiple genes

Total of 783 genes with 665 distinct genes

Some genes are associated with multiple diseases
Each family has 7 genes on average, largest family has 41 genes, smallest had 3 genes

Network of 35,910 human interactions + 38,975 from other organisms

Include experimentally verified and predicted interactions

46 of 58

OMIM

47 of 58

Evaluation strategy

For each disease gene, define an artificial linkage interval by selecting 100 closest genes on the same chromosome
Evaluate performance based on leave one out for each disease gene family
Two assessment measures

An enrichment score: Compare the predicted rank to random rank

Let g_rankbe the rank of a particular gene
Enrichment score is defined as 50/g_rank
If a true disease gene is ranked the highest, it has an enrichment score of 50
For all other genes, we have assigned a rank of 100.

Receiver Operator Curves (ROC)

Plot True Positive Rate (TPR) versus False Positive Rate (FPR) and compute Area Under the ROC (AUROC)

48 of 58

Global graph-based measures have greater score enrichment compared to local measures

Mean enrichment scores of true genes

Diffusion kernel was not as useful for cancer diseases

A and B differ in how ties are handled

49 of 58

ROC analysis further supports enrichment analysis

50 of 58

RWR outperforms existing local network based methods on seven disease genes that do not have the literature bias

51 of 58

Shortest path fails but RWR successfully identifies disease genes

False positives from SP and DI

Dense network with multiple alternate paths is used by RWR to correctly rank genes

For bare lymphocyte syndrome type 1

52 of 58

RWR can identify genes based on indirect paths

There are no direct connections between disease genes, so DI will not work
Shortest path has high false positives (yellow)
RWR correctly identifies new genes (red)

53 of 58

Summary of GeneWanderer results

Global similarity measured using both random walk and diffusion kernels are vastly superior than local distance measures
The authors generated a test set using a linkage based analysis

This is more realistic and similar to existing linkage-based approaches to find genes
Others have used a random set of test genes, which might not be correct because similar genes tend to cluster on chromosomes

How to define disease gene families?

This method needs known genes, although not many. Was shown to work for as small as 3 genes
Existing approaches like ENDEAVOUR may not work

54 of 58

Concluding remarks

Phenotypically similar diseases likely represent perturbations to the same subnetwork or are topologically close.
This approach works for situations where there are some known genes that can be used to rank other genes with respect to them

But this may not be the case always. How to extend?

Current interaction networks are incomplete
Each disease was considered one at a time

Can we share information between diseases to better prioritize genes?

Other approaches go beyond genes: prioritizing GWAS sequence variants (NetWAS from GIANT http://giant.princeton.edu )

55 of 58

Other applications of random walks

Hi-C matrix denoising
Single cell RNA-seq data denoising

56 of 58

Using random walks for denoising Hi-C data

Nodes in the graph represent genomic regions
Edge weight is the counts of interactions
Denoising is done by taking powers of the row sum-based transition matrix
This gives you the probability of reaching region j from region i in t steps for a random walk originating on region i

Ursu et al. Bioinformatics 2018

57 of 58

Using random walks to smooth scRNA-seq data

van Dijk et al. Cell 2018

Euclidean distance

Kernel is adaptive

Each cell has at most k neighbors, to allow most of the Gaussian kernel to be covered.

Make Affinity matrix symmetric

58 of 58

Smoothing using MAGIC enhances scRNA-seq data signal

0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2

0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2

0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2

0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2

0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2

0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2
0.2	0.2	0.2	0.2	0.2