1 of 62

Comparison and alignment of networks

Oct 31st, Nov 5th 2024

BMI/CS 775 Computational Network Biology�Fall 2024

Sushmita Roy

https://compnetbiocourse.discovery.wisc.edu

2 of 62

Plan for the semester

2

When

What

Week 2-Week 4

Representing and learning networks from data

Week 5-Week 7

Supervised deep learning in network biology

Week 8-Week 9

Graph topology and modules

Week 9-Week 10

Graph alignment and comparison

Week 11-Week 13

Network-based integration and interpretation

Week 14-Week 15

Project presentations

3 of 62

Plan for this section

  • Global alignment of protein-interaction networks (Oct 31st)
    • Matrix factorization: FUSE

  • Graph-based alignment for single cell omic datasets (Nov 5th)

4 of 62

Goals for today

  • Introduction to the network alignment problem
  • Aligning networks across species
  • Global network alignment
    • Matrix factorization
    • FUSE: A matrix factorization method for network alignment

5 of 62

Broad definition of graph alignment

  •  

6 of 62

Applications of network alignment

Alignment of scRNA-seq datasets

Alignment of molecular networks

  • PathBLAST
  • IsoRank
  • FUSE
  • SCANORAMA
  • CONOS
  • LIGER
  • Seurat anchoring

7 of 62

Distinguishing the two alignment tasks

  • Network alignment across species
    • We also have a phylogenetic tree (information of how organisms are related)
    • Nodes correspond to genes/proteins
    • Node correspondence relies heavily on sequence similarity of nodes across species
  • Network alignment across datasets
    • We might not have a known hierarchy/relationship across datasets
    • Nodes correspond to cells
    • Node correspondence relies on similarity of measured gene expression in each cell

8 of 62

Goals for today

  • Introduction to the network alignment problem
  • Aligning networks across species
  • Global network alignment
    • Matrix factorization
    • FUSE: A matrix factorization method for network alignment

9 of 62

How are these organisms related?

Toh et al, Nature, 2011

10 of 62

Zoonomia project

A phylogenetic tree for ~200 mammals

11 of 62

Organisms can be compared at multiple levels

  • Comparison at the sequence level
    • Sequence alignment
    • Phylogenetic tree construction
  • Comparison at the expression level

  • Comparison at the network level

12 of 62

Some terminology

  • Homology
    • Two sequences are said to be homologous if they are derived from a common ancestral sequence
  • Orthology
    • Two proteins in two organisms are said to be orthologs of each other if they are related by a common ancestor
  • Paralogy
    • Two proteins that are related by a duplication event within a species
  • Match, mismatch, gaps
    • Terms used in sequence alignment
  • BLAST
    • A software program used to align two molecular sequences that provides a statistical score (BLAST E-value) used to assess the quality of the alignment

13 of 62

How do sequences change between organisms?

  • Substitutions

  • Deletions

  • Insertions

T H I S S E Q U E N C E

T H A T S E Q U E N C E

Sequence 2

Sequence 1

T H I S I S A S E Q U E N C E

T H I S _ _ _S E Q U E N C E

Sequence 2

Sequence 1

mismatch

gap

_ _ _ _S E Q U E N C E

T H I S S E Q U E N C E

Sequence 2

Sequence 1

gap

14 of 62

Alignment problems

  • Sequence alignment
    • Given the genomic sequence of two species, find the differences and similarities of the sequence
    • Aims to find a correspondence between the positions of two sequences while minimizing the number of substitutions and gaps
  • Network alignment
    • Given molecular interaction networks from different organisms find the differences and similarities between them at the subnetwork level
    • Aims to find a correspondence between the positions of two networks while minimizing the number of substitutions and gaps on the network

15 of 62

Sequence Alignment Examples

T H I S S E Q U E N C E

T H A T S E Q U E N C E

T H + + S E Q U E N C E

Sequence 2

Sequence 1

T H I S - - - S E Q U E N C E

T H A T I S A S E Q U E N C E

T H + + S E Q U E N C E

Sequence 2

Sequence 1

16 of 62

Gaps and mismatches in network alignment

  • Positions
    • The nodes in the graph
  • Mismatch
    • Occurs when aligned proteins in the network alignment do not share sequence homology
    • This boils down to pair (A,B) in one species aligned to pair (a,b) in another species with unaligned intermediate nodes.
  • Gap
    • Occurs when a protein interaction in one path skips over a protein in the other network

17 of 62

Network alignment example

The network alignment problem of finding conserved paths is equivalent to finding a high scoring path in a new network called the “network alignment” graph.

Each vertex in the alignment graph is represented by a pair of vertices and a corresponding numeric score (BLAST E-value) that specifies the sequence similarity

Each edge corresponds to to “interactions” in the original graph

18 of 62

Why is network alignment across species important?

  • Important from an evolutionary perspective
    • Are interactions of proteins with similar sequence conserved?
    • How do networks evolve?
    • Is there a minimal set of interactions common to all species?
  • Refine existing interaction networks

19 of 62

What makes network alignment difficult?

  • The set of genes/proteins between species are not the same
  • The correspondence between genes of one species and the genes of another species is not one-to-one
    • Although many algorithms assume one-to-one mapping
  • Underlying networks might be noisy and/or incomplete

20 of 62

Different network alignment problems

Local network alignment: Find locally similar subnetworks

Network query: Find instances of a small subnetwork in a larger network

Global network alignment: Align all nodes in one network to all nodes in the second network

Nir Atias and Roded Sharan, May 2012, ACM Communications

21 of 62

Different types of network alignment problems

Problem type

Description

Methods

Local alignment

Align small parts of the network

PathBLAST, LocalAli

Global alignment

Align the entire network

FUSE, IsoRank

Pairwise alignment

Align two networks

PathBLAST

Multiple network alignment

Align more than two networks

FUSE, IsoRank, LocalAli

NetworkQuery

Search for a small network in a larger network

NetGrep, QNet, QPath

Adapted from Nir Atias and Roded Sharan, May 2012, ACM Communications

22 of 62

Goals for today

  • Introduction to the network alignment problem
  • Aligning networks across species
  • Global network alignment
    • Matrix factorization
    • FUSE: A matrix factorization method for network alignment

23 of 62

Matrix factorization

  • A popular data analysis technique used for high-dimensional datasets
  • Decomposition and factorization used interchangeably
  • Many applications:
    • visualization, pattern extraction, interpretation and imputation, smoothing

Stein-O’Brien et al, Trends in Genetics 2018

R

W

H

24 of 62

Many different variants of MF

  • Singular value decomposition

  • Penalized matrix factorization

  • Non-negative matrix factorization

  • Matrix tri-factorization

25 of 62

Non-negative matrix factorization (NMF)

25

Minimize

Lee and Seung Adv. Neur. In. 2001

 

 

Slide credit Erika Da-Inn Lee

26 of 62

Non-negative matrix factorization (NMF)

26

5

5

4

4

4

5

4

4

4

4

3

4

3

1

4

3

2

4

4

2

3

1

4

4

5

3

4

2

3

2

X = ℝn×m

Slide credit Erika Da-Inn Lee

27 of 62

Non-negative matrix factorization (NMF)

27

5

5

4

4

4

5

4

4

4

4

3

4

3

1

4

3

2

4

4

2

3

1

4

4

5

3

4

2

3

2

X = ℝn×m

W = ℝn×k

H = ℝk×m

k << n, m

Slide credit Erika Da-Inn Lee

28 of 62

NMF can be used for bi-clustering

H = ℝk×m

W = ℝn×k

Row clusters

Column clusters

29 of 62

Incorporating prior knowledge to constrain the factorization

29

5

5

4

4

4

5

4

4

4

4

3

4

3

1

4

3

2

4

4

2

3

1

4

4

3

2

3

2

Slide credit Erika Da-Inn Lee

30 of 62

Graph regularization to incorporate domain prior knowledge

30

Minimize

 

3

-1

-1

-1

-1

0

0

-1

4

-1

-1

-1

0

0

-1

-1

3

-1

-1

0

0

-1

-1

-1

4

-1

0

0

-1

-1

-1

-1

4

0

0

0

0

0

0

0

1

-1

0

0

0

0

0

-1

1

Slide credit Erika Da-Inn Lee

31 of 62

Non-negative matrix tri-factorization (NMTF)

Minimize

 

 

k1

k2

32 of 62

NMTF for biclustering

  • Like NMF, NMTF can be used for clustering of row and column entities
  • These clusters can be similarly constrained as in NMF with a row/column-specific graph

33 of 62

Cluster indicator matrix

  • Let k be the total number of clusters
  • Let G be a cluster indicator matrix
    • an nXk matrix which specifies the cluster ID for each entity

Objects

Clusters

Example for 5 objects in 2 clusters

34 of 62

NMTF with constraints

Semi-Supervised Clustering via Matrix Factorization, Wang et al, 2008

Two types of objects: people and movies

Movies can be grouped based on actors, characters, titles

People can be grouped based on their hobbies and jobs

R12

X1

X2

Blue: Must link

Red: Cannot link

35 of 62

NMTF for constrained bi-clustering

  • For two entity types 1 and 2

Trade-off between maintaining intra-type constraints and estimating Rij

 

36 of 62

Goals for today

  • Introduction to the network alignment problem
  • Aligning networks across species
  • Global network alignment
    • Matrix factorization
    • FUSE: A matrix factorization method for network alignment

37 of 62

Motivation of FUSE

  • How to do multiple global network alignment?
  • In existing approaches the sequence-based node mapping is local, that is one pair at a time.
  • Can we improve this mapping by using protein-protein interaction networks in each species?

38 of 62

FUSE multiple network alignment

  • Given
    • Protein-protein interaction networks for k species
    • Pairwise sequence similarities for pairs of proteins from each pair of species
  • Do
    • Find a global one-to-one mapping between network nodes

39 of 62

FUSE Notation

  • Ni =(Vi, Ei) denotes the vertex and edge set for a PPI network for species i
  • Let ni denote the number of proteins in species i
  • Let Rij denote an niXnj sequence similarity matrix (E-values) of proteins from species i and j
  • Let Gi and Gj specify the cluster assignments of proteins in i and j
  • Let Sij is a kiXkj lower-dimensional approximation of Rij
    • where ki<<ni and kj<<nj
    • ki and kj can be thought of as the number of independent groups in Ni and Nj

40 of 62

Overview of FUSE

  • Fuse sequence similarities and network wiring patterns over all proteins in all PPI networks being aligned

  • Create a Global Multiple Network Alignment

41 of 62

Fuse step

  • Based on Non-negative matrix tri-factorization (NMTF)
  • Derives functional scores between pairs of proteins using sequence and network information for k species

42 of 62

NMTF for Global network alignment

  • Each entity type is a species
  • Each entity is a protein from a species
  • Constraints are specified by the protein-protein interaction networks Ni
  • Rij is the pairwise functional similarity between proteins of species i and j

43 of 62

NMTF for k entity types

  • For entities of k different types, we have k different constraint graphs
  • We can write the objective as

This objective can be solved using an iterative multiplicative update algorithm from Wang 2008

44 of 62

Pictorial illustration for five species

45 of 62

Minimizing the objective NMTF

  • We need to find Sij, and Gi for all 1≤i,j≤k entity types
  • An iterative algorithm is used that updates each entity one at a time
  • Updates are obtained by deriving the objective (while accounting for the non-negativity constraints) with respect to Sij and Gi respectively

46 of 62

Overview of FUSE

  • Fuse sequence similarities and network wiring patterns over all proteins in all PPI networks being aligned

  • Create a Global Multiple Network Alignment

47 of 62

Global network alignment step

  • Find a one-to-one Global Multiple Network Alignment
    • Create a k-partite graph, where k is the number of species
    • Finding approximately maximum weight k-partite matching

48 of 62

Create a k-partite weighted graph

  • Recompute the new similarity based on sequence and network

  • Select top 5% of the associations of each protein in a species
  • Add back entries that were set to 0 by NMTF but have sequence similarity using a weighted sum of sequence and NMTF-based similarity

  • This produces a weighted k-partite graph

49 of 62

A 3-partite weighted graph

Species 1

Species 2

Species 3

a

b

c

α

β

γ

δ

W

X

Y

Z

50 of 62

Algorithm to find the best matching

  • Matching between two node sets is defined as a one-to-one mapping

  • Weighted k-partite matching for k>2 is NP-hard

  • Need a heuristic approach

51 of 62

Heuristic algorithm to find a maximum k-partite matching

K-partite graph

52 of 62

Graph merge step

  • Let our k-partite graph be

  • Let Fij be a matching between nodes in Vi and Vj, where ui in Vi is mapped to vj in Vj
  • Create new vertices Vij from the matching, each vertex represented by a pair of nodes one from each graph
    • This step is like creating an alignment graph!

53 of 62

Bi-partite matching example

Species i

a

b

α

β

γ

δ

c

Species j

Species i

a

b

α

β

γ

δ

Species j

Find a (maximal) matching (Fij)

c

54 of 62

Graph merge from matching

a

b

α

β

γ

δ

c

Merged graph of species 1, 2

The merged nodes inherit the edges of the constituent nodes

New graph with all species

a

b

c

α

β

γ

δ

W

X

Y

Z

a

b

α

β

γ

δ

c

W

X

Y

Z

Note, this is not a matching

55 of 62

Results

  • Dataset: Protein-protein interaction networks for five species
    • Human (H. sapiens), mouse (M. musculus), fly (D. melanogaster), worm (C. elegans), yeast (S. cerevisiae)
  • Experiments
    • Assess the inferred functional orthologies based on similarity in annotation
    • Compare against other methods

56 of 62

Statistics of PPI networks used

Species

57 of 62

NMTF induces new and reconstructs existing associations between proteins

  • Apply PCA to estimate ki, the number of clusters/factors for each species
    • k1= 80 (human); k2 = 90 (yeast); k3= 80 (fly); k4= 70 (mouse) and k5=50 (worm)
  • 1,477, 372 interactions based on sequence
  • 5% edges inferred corresponds to 19, 175, 378
    • Covers 60% of the sequence-only edges
    • What happens to the 40% edges?
  • Compare the reconstructed (60%), predicted and non-reconstructed (40%) pairs
  • Count the number of sequence-similar neighbors in each network
    • Pairs with reconstructed similarities or new similarities are connected to many more similar neighbors (20.4 on average)
    • Pairs with new similarities are also connected to neighbors with high sequence similarity (12.1)
    • Pairs that are not reconstructed have much lower sequence similarity in their neighborhood (8.6).

58 of 62

Do the new similarities make sense?

Compute the cumulative number of associations between annotated proteins and the percentage of them sharing GO term (Biological Process (BP) and Molecular Function (MF) annotations separately).

59 of 62

Comparison against other algorithms

  • Algorithms compared
    • Beams
    • Smetana
    • CSRW
    • NH
    • IsoRank
    • NetCoffee
  • Evaluation metrics
    • Coverage
      • Good clusters: cover all five PPI networks
      • Bad clusters: cover less than five PPIs
      • Computed at the cluster and protein level
    • Functional consistency
      • An annotated cluster is consistent if all of its annotated proteins have at least one common annotation

Did not finish in time

60 of 62

FUSE produces the largest number of good clusters

Fraction of blue is highest for clusters and proteins for FUSE.

Good: inclusion of as many species as possible

61 of 62

FUSE produces functionally consistent clusters

A cluster is said to be functionally consistent, if all its annotated proteins have at least one GO term in common.

62 of 62

References

  • N. Atias and R. Sharan, "Comparative analysis of protein networks: Hard problems, practical solutions," Commun. ACM, vol. 55, no. 5, pp. 88-97, May 2012. [Online]. Available: http://dx.doi.org/10.1145/2160718.2160738
  • B. P. Kelley, R. Sharan, R. M. Karp, T. Sittler, D. E. Root, B. R. Stockwell, and T. Ideker, "Conserved pathways within bacteria and yeast as revealed by global protein network alignment," Proceedings of the National Academy of Sciences, vol. 100, no. 20, pp. 11 394-11 399, Sep. 2003. [Online]. Available: http://dx.doi.org/10.1073/pnas.1534710100
  • R. Sharan, T. Ideker, B. Kelley, R. Shamir, and R. M. Karp, "Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data." Journal of computational biology : a journal of computational molecular cell biology, vol. 12, no. 6, pp. 835-846, Jul. 2005. [Online]. Available: http://dx.doi.org/10.1089/cmb.2005.12.835
  • Singh R, Xu J, Berger B. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proceedings of the National Academy of Sciences. 2008;105(35):12763-12768. doi:10.1073/pnas.0806627105
  • Gligorijević, V., Malod-Dognin, N. & Pržulj, N. Fuse: multiple network alignment via data fusion. Bioinformatics 32, 1195–1203 (2016).