1 of 62

Comparison and alignment of networks

Oct 31^st, Nov 5^th 2024

BMI/CS 775 Computational Network Biology�Fall 2024

Sushmita Roy

https://compnetbiocourse.discovery.wisc.edu

2 of 62

Plan for the semester

2

When	What
Week 2-Week 4	Representing and learning networks from data
Week 5-Week 7	Supervised deep learning in network biology
Week 8-Week 9	Graph topology and modules
Week 9-Week 10	Graph alignment and comparison
Week 11-Week 13	Network-based integration and interpretation
Week 14-Week 15	Project presentations

3 of 62

Plan for this section

Global alignment of protein-interaction networks (Oct 31^st)

Matrix factorization: FUSE

Graph-based alignment for single cell omic datasets (Nov 5^th)

4 of 62

Goals for today

Introduction to the network alignment problem
Aligning networks across species
Global network alignment

Matrix factorization
FUSE: A matrix factorization method for network alignment

5 of 62

Broad definition of graph alignment

6 of 62

Applications of network alignment

Alignment of scRNA-seq datasets

Alignment of molecular networks

PathBLAST
IsoRank
FUSE

SCANORAMA
CONOS
LIGER
Seurat anchoring

7 of 62

Distinguishing the two alignment tasks

Network alignment across species

We also have a phylogenetic tree (information of how organisms are related)
Nodes correspond to genes/proteins
Node correspondence relies heavily on sequence similarity of nodes across species

Network alignment across datasets

We might not have a known hierarchy/relationship across datasets
Nodes correspond to cells
Node correspondence relies on similarity of measured gene expression in each cell

8 of 62

Goals for today

Introduction to the network alignment problem
Aligning networks across species
Global network alignment

Matrix factorization
FUSE: A matrix factorization method for network alignment

9 of 62

How are these organisms related?

Toh et al, Nature, 2011

10 of 62

Zoonomia project

A phylogenetic tree for ~200 mammals

11 of 62

Organisms can be compared at multiple levels

Comparison at the sequence level

Sequence alignment
Phylogenetic tree construction

Comparison at the expression level

Comparison at the network level

12 of 62

Some terminology

Homology

Two sequences are said to be homologous if they are derived from a common ancestral sequence

Orthology

Two proteins in two organisms are said to be orthologs of each other if they are related by a common ancestor

Paralogy

Two proteins that are related by a duplication event within a species

Match, mismatch, gaps

Terms used in sequence alignment

BLAST

A software program used to align two molecular sequences that provides a statistical score (BLAST E-value) used to assess the quality of the alignment

13 of 62

How do sequences change between organisms?

Substitutions

Deletions

Insertions

T H I S S E Q U E N C E

T H A T S E Q U E N C E

Sequence 2

Sequence 1

T H I S I S A S E Q U E N C E

T H I S _ _ _S E Q U E N C E

Sequence 2

Sequence 1

mismatch

gap

_ _ _ _S E Q U E N C E

T H I S S E Q U E N C E

Sequence 2

Sequence 1

gap

14 of 62

Alignment problems

Sequence alignment

Given the genomic sequence of two species, find the differences and similarities of the sequence
Aims to find a correspondence between the positions of two sequences while minimizing the number of substitutions and gaps

Network alignment

Given molecular interaction networks from different organisms find the differences and similarities between them at the subnetwork level
Aims to find a correspondence between the positions of two networks while minimizing the number of substitutions and gaps on the network

15 of 62

Sequence Alignment Examples

T H I S S E Q U E N C E

T H A T S E Q U E N C E

T H + + S E Q U E N C E

Sequence 2

Sequence 1

T H I S - - - S E Q U E N C E

T H A T I S A S E Q U E N C E

T H + + S E Q U E N C E

Sequence 2

Sequence 1

16 of 62

Gaps and mismatches in network alignment

Positions

The nodes in the graph

Mismatch

Occurs when aligned proteins in the network alignment do not share sequence homology
This boils down to pair (A,B) in one species aligned to pair (a,b) in another species with unaligned intermediate nodes.

Gap

Occurs when a protein interaction in one path skips over a protein in the other network

17 of 62

Network alignment example

The network alignment problem of finding conserved paths is equivalent to finding a high scoring path in a new network called the “network alignment” graph.

Each vertex in the alignment graph is represented by a pair of vertices and a corresponding numeric score (BLAST E-value) that specifies the sequence similarity

Each edge corresponds to to “interactions” in the original graph

18 of 62

Why is network alignment across species important?

Important from an evolutionary perspective

Are interactions of proteins with similar sequence conserved?
How do networks evolve?
Is there a minimal set of interactions common to all species?

Refine existing interaction networks

19 of 62

What makes network alignment difficult?

The set of genes/proteins between species are not the same
The correspondence between genes of one species and the genes of another species is not one-to-one

Although many algorithms assume one-to-one mapping

Underlying networks might be noisy and/or incomplete

20 of 62

Different network alignment problems

Local network alignment: Find locally similar subnetworks

Network query: Find instances of a small subnetwork in a larger network

Global network alignment: Align all nodes in one network to all nodes in the second network

Nir Atias and Roded Sharan, May 2012, ACM Communications

21 of 62

Different types of network alignment problems

Problem type	Description	Methods
Local alignment	Align small parts of the network	PathBLAST, LocalAli
Global alignment	Align the entire network	FUSE, IsoRank
Pairwise alignment	Align two networks	PathBLAST
Multiple network alignment	Align more than two networks	FUSE, IsoRank, LocalAli
NetworkQuery	Search for a small network in a larger network	NetGrep, QNet, QPath

Adapted from Nir Atias and Roded Sharan, May 2012, ACM Communications

22 of 62

Goals for today

Introduction to the network alignment problem
Aligning networks across species
Global network alignment

Matrix factorization
FUSE: A matrix factorization method for network alignment

23 of 62

Matrix factorization

A popular data analysis technique used for high-dimensional datasets
Decomposition and factorization used interchangeably
Many applications:

visualization, pattern extraction, interpretation and imputation, smoothing

Stein-O’Brien et al, Trends in Genetics 2018

R

W

H

24 of 62

Many different variants of MF

Singular value decomposition

Penalized matrix factorization

Non-negative matrix factorization

Matrix tri-factorization

25 of 62

Non-negative matrix factorization (NMF)

25

Minimize

Lee and Seung Adv. Neur. In. 2001

•

Slide credit Erika Da-Inn Lee

26 of 62

Non-negative matrix factorization (NMF)

26

5			5
4		4	4		5	4
4		4	4		3	4
3	1	4	3		2	4
	4	2		3		1
	4			4	5
	3				4
	2			3	2

X = ℝ^n×m

Slide credit Erika Da-Inn Lee

27 of 62

Non-negative matrix factorization (NMF)

27

5			5
4		4	4		5	4
4		4	4		3	4
3	1	4	3		2	4
	4	2		3		1
	4			4	5
	3				4
	2			3	2

X = ℝ^n×m

W = ℝ^n×k

H = ℝ^k×m

k << n, m

Slide credit Erika Da-Inn Lee

28 of 62

NMF can be used for bi-clustering

H = ℝ^k×m

W = ℝ^n×k

Row clusters

Column clusters

29 of 62

Incorporating prior knowledge to constrain the factorization

29

5			5
4		4	4		5	4
4		4	4		3	4
3	1	4	3		2	4
	4	2		3		1
	4			4
	3
	2			3	2

Slide credit Erika Da-Inn Lee

30 of 62

Graph regularization to incorporate domain prior knowledge

30

Minimize

3	-1	-1	-1	-1	0	0
-1	4	-1	-1	-1	0	0
-1	-1	3	-1	-1	0	0
-1	-1	-1	4	-1	0	0
-1	-1	-1	-1	4	0	0
0	0	0	0	0	1	-1
0	0	0	0	0	-1	1

Slide credit Erika Da-Inn Lee

31 of 62

Non-negative matrix tri-factorization (NMTF)

Minimize

•

k₁

k₂

32 of 62

NMTF for biclustering

Like NMF, NMTF can be used for clustering of row and column entities
These clusters can be similarly constrained as in NMF with a row/column-specific graph

33 of 62

Cluster indicator matrix

Let k be the total number of clusters
Let G be a cluster indicator matrix

an nXk matrix which specifies the cluster ID for each entity

Objects

Clusters

Example for 5 objects in 2 clusters

34 of 62

NMTF with constraints

Semi-Supervised Clustering via Matrix Factorization, Wang et al, 2008

Two types of objects: people and movies

Movies can be grouped based on actors, characters, titles

People can be grouped based on their hobbies and jobs

R₁₂

X₁

X₂

Blue: Must link

Red: Cannot link

35 of 62

NMTF for constrained bi-clustering

For two entity types 1 and 2

Trade-off between maintaining intra-type constraints and estimating R_ij

36 of 62

Goals for today

Introduction to the network alignment problem
Aligning networks across species
Global network alignment

Matrix factorization
FUSE: A matrix factorization method for network alignment

37 of 62

Motivation of FUSE

How to do multiple global network alignment?
In existing approaches the sequence-based node mapping is local, that is one pair at a time.
Can we improve this mapping by using protein-protein interaction networks in each species?

38 of 62

FUSE multiple network alignment

Given

Protein-protein interaction networks for k species
Pairwise sequence similarities for pairs of proteins from each pair of species

Do

Find a global one-to-one mapping between network nodes

39 of 62

FUSE Notation

N_i=(V_i, E_i) denotes the vertex and edge set for a PPI network for species i
Let n_i denote the number of proteins in species i
Let R_ij denote an n_iXn_j sequence similarity matrix (E-values) of proteins from species i and j
Let G_i and G_j specify the cluster assignments of proteins in i and j
Let S_ij is a k_iXk_j lower-dimensional approximation of R_ij

where k_i<<n_i and k_j<<n_j
k_i and k_j can be thought of as the number of independent groups in N_iand N_j

40 of 62

Overview of FUSE

Fuse sequence similarities and network wiring patterns over all proteins in all PPI networks being aligned

Create a Global Multiple Network Alignment

41 of 62

Fuse step

Based on Non-negative matrix tri-factorization (NMTF)
Derives functional scores between pairs of proteins using sequence and network information for k species

42 of 62

NMTF for Global network alignment

Each entity type is a species
Each entity is a protein from a species
Constraints are specified by the protein-protein interaction networks N_i
R_ijis the pairwise functional similarity between proteins of species i and j

43 of 62

NMTF for k entity types

For entities of k different types, we have k different constraint graphs
We can write the objective as

This objective can be solved using an iterative multiplicative update algorithm from Wang 2008

44 of 62

Pictorial illustration for five species

45 of 62

Minimizing the objective NMTF

We need to find S_ij, and G_i for all 1≤i,j≤k entity types
An iterative algorithm is used that updates each entity one at a time
Updates are obtained by deriving the objective (while accounting for the non-negativity constraints) with respect to S_ij and G_i respectively

46 of 62

Overview of FUSE

Fuse sequence similarities and network wiring patterns over all proteins in all PPI networks being aligned

Create a Global Multiple Network Alignment

47 of 62

Global network alignment step

Find a one-to-one Global Multiple Network Alignment

Create a k-partite graph, where k is the number of species
Finding approximately maximum weight k-partite matching

48 of 62

Create a k-partite weighted graph

Recompute the new similarity based on sequence and network

Select top 5% of the associations of each protein in a species
Add back entries that were set to 0 by NMTF but have sequence similarity using a weighted sum of sequence and NMTF-based similarity

This produces a weighted k-partite graph

49 of 62

A 3-partite weighted graph

Species 1

Species 2

Species 3

a

b

c

α

β

γ

δ

W

X

Y

Z

50 of 62

Algorithm to find the best matching

Matching between two node sets is defined as a one-to-one mapping

Weighted k-partite matching for k>2 is NP-hard

Need a heuristic approach

51 of 62

Heuristic algorithm to find a maximum k-partite matching

K-partite graph

52 of 62

Graph merge step

Let our k-partite graph be

Let F_ij be a matching between nodes in V_i and V_j, where u_i in V_i is mapped to v_j in V_j
Create new vertices V_ij from the matching, each vertex represented by a pair of nodes one from each graph

This step is like creating an alignment graph!

53 of 62

Bi-partite matching example

Species i

a

b

α

β

γ

δ

c

Species j

Species i

a

b

α

β

γ

δ

Species j

Find a (maximal) matching (F_ij)

c

54 of 62

Graph merge from matching

a

b

α

β

γ

δ

c

Merged graph of species 1, 2

The merged nodes inherit the edges of the constituent nodes

New graph with all species

a

b

c

α

β

γ

δ

W

X

Y

Z

a

b

α

β

γ

δ

c

W

X

Y

Z

Note, this is not a matching

55 of 62

Results

Dataset: Protein-protein interaction networks for five species

Human (H. sapiens), mouse (M. musculus), fly (D. melanogaster), worm (C. elegans), yeast (S. cerevisiae)

Experiments

Assess the inferred functional orthologies based on similarity in annotation
Compare against other methods

56 of 62

Statistics of PPI networks used

Species

57 of 62

NMTF induces new and reconstructs existing associations between proteins

Apply PCA to estimate k_i, the number of clusters/factors for each species

k₁= 80 (human); k₂ = 90 (yeast); k₃= 80 (fly); k₄= 70 (mouse) and k₅=50 (worm)

1,477, 372 interactions based on sequence
5% edges inferred corresponds to 19, 175, 378

Covers 60% of the sequence-only edges
What happens to the 40% edges?

Compare the reconstructed (60%), predicted and non-reconstructed (40%) pairs
Count the number of sequence-similar neighbors in each network

Pairs with reconstructed similarities or new similarities are connected to many more similar neighbors (20.4 on average)
Pairs with new similarities are also connected to neighbors with high sequence similarity (12.1)
Pairs that are not reconstructed have much lower sequence similarity in their neighborhood (8.6).

58 of 62

Do the new similarities make sense?

Compute the cumulative number of associations between annotated proteins and the percentage of them sharing GO term (Biological Process (BP) and Molecular Function (MF) annotations separately).

59 of 62

Comparison against other algorithms

Algorithms compared

Beams
Smetana
CSRW
NH
IsoRank
NetCoffee

Evaluation metrics

Coverage

Good clusters: cover all five PPI networks
Bad clusters: cover less than five PPIs
Computed at the cluster and protein level

Functional consistency

An annotated cluster is consistent if all of its annotated proteins have at least one common annotation

Did not finish in time

60 of 62

FUSE produces the largest number of good clusters

Fraction of blue is highest for clusters and proteins for FUSE.

Good: inclusion of as many species as possible

61 of 62

FUSE produces functionally consistent clusters

A cluster is said to be functionally consistent, if all its annotated proteins have at least one GO term in common.

62 of 62

References

N. Atias and R. Sharan, "Comparative analysis of protein networks: Hard problems, practical solutions," Commun. ACM, vol. 55, no. 5, pp. 88-97, May 2012. [Online]. Available: http://dx.doi.org/10.1145/2160718.2160738
B. P. Kelley, R. Sharan, R. M. Karp, T. Sittler, D. E. Root, B. R. Stockwell, and T. Ideker, "Conserved pathways within bacteria and yeast as revealed by global protein network alignment," Proceedings of the National Academy of Sciences, vol. 100, no. 20, pp. 11 394-11 399, Sep. 2003. [Online]. Available: http://dx.doi.org/10.1073/pnas.1534710100
R. Sharan, T. Ideker, B. Kelley, R. Shamir, and R. M. Karp, "Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data." Journal of computational biology : a journal of computational molecular cell biology, vol. 12, no. 6, pp. 835-846, Jul. 2005. [Online]. Available: http://dx.doi.org/10.1089/cmb.2005.12.835
Singh R, Xu J, Berger B. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proceedings of the National Academy of Sciences. 2008;105(35):12763-12768. doi:10.1073/pnas.0806627105
Gligorijević, V., Malod-Dognin, N. & Pržulj, N. Fuse: multiple network alignment via data fusion. Bioinformatics 32, 1195–1203 (2016).

3	-1	-1	-1	-1	0	0
-1	4	-1	-1	-1	0	0
-1	-1	3	-1	-1	0	0
-1	-1	-1	4	-1	0	0
-1	-1	-1	-1	4	0	0
0	0	0	0	0	1	-1
0	0	0	0	0	-1	1

3	-1	-1	-1	-1	0	0
-1	4	-1	-1	-1	0	0
-1	-1	3	-1	-1	0	0
-1	-1	-1	4	-1	0	0
-1	-1	-1	-1	4	0	0
0	0	0	0	0	1	-1
0	0	0	0	0	-1	1

3	-1	-1	-1	-1	0	0
-1	4	-1	-1	-1	0	0
-1	-1	3	-1	-1	0	0
-1	-1	-1	4	-1	0	0
-1	-1	-1	-1	4	0	0
0	0	0	0	0	1	-1
0	0	0	0	0	-1	1