1 of 67

Topological properties of graphs

Nov 4th 2021

BMI 826-23 Computational Network Biology�Fall 2021

Sushmita Roy

https://compnetbiocourse.discovery.wisc.edu

2 of 67

Overview of lecture topics

Biological problem

    • Mapping regulatory network structure
    • Dynamics and context specificity of networks
    • Modularity in biological networks
    • Comparison of biological networks
    • Identification of important genes
    • Integrating different types of molecular genomic data
    • Predicting protein interfaces

Computational approaches

    • Probabilistic graphical models
    • Graph structure learning
    • Multiple network learning
    • Graph clustering
    • Graph alignment
    • Diffusion on graphs
    • Deep learning

Course material is organized by the biological problem and computational approaches to address the problem

3 of 67

Topics in this section

  • Topological properties of networks (Today)

  • Modules and graph clustering (Today)

  • Algorithms for graph clustering (Nov 9th, 11th)

  • Dynamic module identification (Nov 11th)

4 of 67

Goals for today

  • Commonly measured network properties
    • Degree distribution
    • Average shortest path length
    • Network motifs
    • Modularity of biological networks

  • Algorithms to find modules on graphs
    • Clustering
    • Girvan Newman algorithm

5 of 67

Why should we care about network measures?

  • Are there specific properties that characterize real world networks?
  • How can we generate networks with such properties?

Probably the most important discovery of network theory was the realization that despite the remarkable diversity of networks in nature, their architecture is governed by a few simple principles that are common to most networks of major scientific and technological interestBarabasi and Oltvai, Nature Genetics Review 2004

6 of 67

Commonly measured network properties

  • Degree distribution
  • Average shortest path length
  • Network motifs
  • Modularity of biological networks

7 of 67

Recall node degree

  • Undirected network
    • Degree, k: Number of neighbors of a node
  • Directed network
    • In degree, kin: Number of incoming edges
    • Out degree, kout: Number of outgoing edges

C

B

A

D

G

H

E

F

  • In degree of B is 1
  • Out degree of A is 2

8 of 67

Average degree

  • Consider an undirected network with N nodes
  • Let ki denote the degree of node i
  • Average degree is

9 of 67

Degree distribution

  • P(k) the probability that a node has k edges
  • Different networks can have different degree distributions
  • A fundamental property that can be used to characterize a network

10 of 67

Different degree distributions

  • Poisson distribution
    • The mean is a good representation of ki of all nodes
    • Networks that have a Poisson degree distribution are called Erdös Renyi or random networks

  • Power law distribution
    • Also called scale free
    • There is no “typical” node that captures the degree of nodes.

11 of 67

Poisson distribution

  • A discrete distribution

  • The Poisson is parameterized by which can be easily estimated by maximum likelihood

k

P(X=k)

12 of 67

Power law distribution

  • Used to capture the degree distribution of most real networks

  • Typical value of is between 2 and 3.

  • MLE exists but is more complicated
    • See Power-Law Distributions in Empirical Data. Clauset, Shalizi and Newman, 2009 for details

P(k)

k

Barabasi and Oltvai, Nature Genetics Review 2004

13 of 67

Erdös Renyi random graphs

  • Dates back to 1960 due to two mathematicians Paul Erdös and Alfred Renyi.
  • Provides a probabilistic model to generate a graph
  • Starts with N nodes and connects two nodes with probability p
  • Node degrees follow a Poisson distribution
  • Tail falls off exponentially, suggesting that nodes with degrees different from the mean are very rare

14 of 67

Scale free networks

  • Degree distribution is captured by a power law distribution
  • There is no “typical” node that describes the degree of all other nodes
  • Such networks are ubiquitous in nature

15 of 67

Poisson versus Scale free

Barabasi & Oltvai 2004, Nature Genetics Review

16 of 67

Yeast protein interaction network is believed to be scale free

  • “Whereas most proteins participate in only a few interactions, a few participate in dozens”

  • Such high degree nodes are called hubs

Barabasi & Oltvai 2004, Nature Genetics Review

17 of 67

Degree of a node is correlated to functional importance of a node

  • Hubs maybe associated with important biological function
  • Red nodes on deletion cause the organism to die
  • Red nodes also among the nodes with the highest degree

18 of 67

Yeast PPI network degree distribution

5,905 nodes

86,286 edges

19 of 67

Origin of scale free networks

  • Scale free networks are ubiquitous is nature
  • How do such networks form?
  • Such networks are the result of two processes
    • a growth process where new nodes join the network over an extended period of time
      • Think about how the internet has grown
    • preferential attachment: new nodes tend to connect to nodes with many neighbors
      • Rich get richer.

Barabasi & Oltvai 2004, Nature Genetics Review

20 of 67

Growth and preferential attachment in scale free networks

A new node (red) is more likely to connect to node 1

than 2

Barabasi & Oltvai 2004, Nature Genetics Review

21 of 67

Commonly measured network properties

  • Degree distribution
  • Average shortest path length
  • Network motifs
  • Modularity of biological networks

22 of 67

Paths on a graph

C

B

A

D

G

H

E

Path: a set of connected edges from one node to another

Paths from B to D

F

There are two paths from B-D: B->A->D and B->G->H->A->D

23 of 67

Path lengths

  • The shortest path length between two nodes A and B:
    • The smallest number of edges that need to be traversed to get from A to B
  • Mean path length is the average of all shortest path lengths
  • Diameter of a graph is the longest of all shortest paths in the network

24 of 67

Scale-free networks tend to be ultra-small

  • Two nodes on the network are connected by a small number of edges
  • Average path length is proportional to log(log(N)), where N is the number of nodes in the network
  • In a random network (Erdös Renyi network) the average path length is proportional to log(N)

25 of 67

Yeast protein interaction network

Avg. path length: 2.6

Diameter: 6

Nodes: 5,905

Edges: 86,286

26 of 67

Commonly measured network properties

  • Degree distribution
  • Average shortest path length
  • Network motifs
  • Modularity of biological networks

27 of 67

Network motifs

  • Network motifs:
    • small recurring subgraphs that occur much more than a randomized network
  • Often called “building blocks” of complex networks
  • Biological networks tend to exhibit certain types of network motifs
  • Some motifs are associated with specific network dynamics

Milo Science 2002

28 of 67

Network motifs of size 3 in a directed network

Additional possibilities if we consider the sign of an edge

Milo Science 2002

29 of 67

Different types of feed-forward motifs

Coherent: sign of the direct path is the same as the indirect path

Incoherent: sign of the direct path and indirect path are not the same

Alon 2007, Nature Review Genetics

These appear much more in transcriptional networks than other FF motifs

30 of 67

Network motifs are associated with dynamics

A feed-forward motif with an AND gate can encode a sign sensitive delay

Alon 2007, Nature Review Genetics

31 of 67

Network motifs found in many complex networks

The occurrence of the feed-forward loop in both networks suggests a fundamental similarity in the design on these networks

Milo et al., Science 2002

32 of 67

Structural common motifs seen in the yeast regulatory network

Lee et.al. 2002, Mangan & Alon, 2003

Auto-regulation

Multi-component

Feed-forward loop

Single Input

Multi Input

Regulatory Chain

Feed-forward loops involved in speeding up in response of target gene

33 of 67

How to find network motifs?

  • Given an input network G we need to address two problems
    • Subgraph enumeration: Find which subgraphs occur in G and how many times
    • Significance of the number of occurrences: Compare to the number of occurrences of subgraphs in randomized networks
  • Software to find network motifs

Wernicke 2005: http://theinf1.informatik.uni-jena.de/publications/network-motifs-wabi05.pdf

34 of 67

Algorithm for enumerating subgraphs: Edge sampling

N(V’): set of neighbors of all vertices in V’

Kashtan et al., Bioinformatics 2004

35 of 67

Generating a randomized network

  • While an Erdös Renyi network is random, it does not have the same degree distribution as a given network
  • How to generate a randomized network with the same degree distribution?

36 of 67

Strategy to generate randomized networks (Switching method)

1

2

3

4

5

1

2

3

4

5

Select two edges connecting four vertices and swap the end points.

Repeat.

37 of 67

Assessing the significance of a network motif

The motif (red dashed edges) occurs much more frequently in the real network than

in any randomized network

38 of 67

Commonly measured network properties

  • Degree distribution
  • Average shortest path length
  • Network motifs
  • Modularity of biological networks

39 of 67

Modularity in networks

  • Modularity “refers to a group of physically or functionally linked nodes that work together to achieve a distinct function” -- Barabasi & Oltvai
  • Modularity is an important principle of biological systems
  • Genes tend to interact with a select set of other genes exhibiting a clustering of interactions
  • Module detection can help to
    • Understand the organizational properties of the network
    • Can be used to predict function of genes based on their grouping behavior

40 of 67

A modular network

Module 1

Module 2

Module 3

M. E. J. Newman, Modularity and community structure in networks, PNAS 2006

41 of 67

Different types of network modules

  • Topological modules
    • Defined solely based on the graph connectivity of nodes
  • Functional modules
    • Based on graph connectivity & other node attributes
    • Can be further grouped into
      • Active modules
      • Integrative modules
      • Disease modules

Albert-László Barabási, Natali Gulbahce, and Joseph Loscalzo, Nature Review Genetics 2011, Mitra et al., Nature Review Genetics 2014

42 of 67

Studying modularity in biological systems

  • Given a network is it modular?
    • Clustering coefficient
    • Q Modularity: measures the relative density of edges between and within the groups
  • Given a network what are the modules in the network?

43 of 67

Clustering coefficient

  • Measure of transitivity in the network that asks
    • If A is connected to B, and B is connected to C, how often is A connected to C?
  • Clustering coefficient Ci for each node i is

  • ki Degree of node i
  • ni is the number of edges among neighbors of i
  • Average clustering coefficient gives a measure of “modularity” of the network

A

B

C

?

44 of 67

Clustering coefficient example

A

C

B

G

D

45 of 67

Q measure for modularity

  • Suppose the nodes in the graph belong to K groups (communities)
  • Modularity (Q) can be assessed as follows:
    • difference between within group (community) connections and expected connections within a group
  • This measure assess how good a particular grouping is

K: number of groups

eij: Fraction of total edges that link nodes in group i to group j

Detecting community structure in networks. M.E. J Newman

Fraction of edges within group i

Fraction of edges without regard to community structure

46 of 67

Summary of topological properties of networks

  • Given a network, its topology can be characterized using different measures
    • Degree distribution
    • Average path length
    • Clustering coefficient (also used to measure modularity of a network)
  • Degree distribution can be
    • Poisson
    • Power law or scale free
  • Network motifs
    • Building blocks of complex networks
    • Over-represented subgraphs of specific types
  • Network modularity
    • Clustering coefficient
    • Q modularity

47 of 67

Goals for today

  • Commonly measured network properties
    • Degree distribution
    • Average shortest path length
    • Network motifs
    • Modularity of biological networks
  • Algorithms to find modules on graphs
    • Clustering
    • Girvan Newman algorithm

48 of 67

Given a graph, what are the modules

  • Given a graph find the modules
    • Modules are represented by densely connected subgraphs
  • The graph can be partitioned into modules using “Graph clustering” or “Graph partitioning” or “Community Detection”
  • Clusters, modules are also called “communities”

49 of 67

Problem definition of (topological) module detection

  • Given:
    • A graph
    • A measure of cluster quality
  • Do
    • Partition the vertices into groups such that they form densely connected subgraphs (e.g. as measured by the cluster quality)

50 of 67

Common graph clustering algorithms

  • Hierarchical or flat clustering using a notion of similarity between nodes
  • Girvan-Newman algorithm
  • Hierarchical Agglomerative clustering
  • Spectral clustering
  • Stochastic block models
  • Markov clustering algorithm
  • Affinity propagation

51 of 67

Clustering

  • Types of clustering
    • Flat clustering
      • K-means
      • Gaussian mixture models
    • Hierarchical clustering
  • Clustering algorithms differ in the distance measure used to group objects together

52 of 67

Task definition: clustering objects

  • Given: attributes for a set of objects we wish to cluster
  • Do: organize objects into groups such that
    • Objects in the same cluster are highly similar to each other
    • Objects from different clusters have low similarity to each other

53 of 67

Flat clustering

  • Cluster objects into K clusters
  • K : number of clusters is a user defined argument
  • Two example algorithms
    • K-means
    • Gaussian mixture model-based clustering

54 of 67

Hierarchical clustering

  • Hierarchical clustering is a widely used clustering technique
  • Instead of the number of clusters, it requires us to specify how much dissimilarity we will tolerate between groups
  • The hierarchical clustering is represented by a tree structure called a dendrogram

Slides adapted from Prof. Mark Craven; BMI 576

55 of 67

Types of hierarchical clustering strategies

  • Agglomerative (bottom-up)
    • Start from the individual objects
    • Group objects or clusters of objects
  • Divisive (or top-down)
    • Start from all objects in a single cluster
    • Break down each cluster into smaller clusters

56 of 67

Hierarchical clustering

leaves represent objects to be clustered (e.g. genes or samples)

height of bar indicates

degree of distance

within cluster

distance scale

0

Slides from Prof. Mark Craven

57 of 67

Flat clustering from a hierarchical clustering

  • We can always generate a flat clustering from a hierarchical clustering by “cutting” the tree at some distance threshold

cutting here results

in 2 clusters

cutting here results

in 4 clusters

Slides from Prof. Mark Craven

58 of 67

Hierarchical clustering to find modules on graphs

  • What is a good measure of similarity to cluster nodes on a graph?
  • One approach is to use local topological overlap tij

N1(i): the set of immediate neighbors of i

aij: corresponding entry in the adjacency matrix of the graph

Yip and Horvath, BMC Bioinformatics 2007; WGCNA: an R package for weighted correlation network analysis. Peter Langfelder and Steve Horvath

59 of 67

Community structure in a graph

  • A graph that has a grouping (community) structure is going to have few intercommunity edges.
  • Community structure can be revealed by removing such intercommunity edges

Detecting community structure in networks, M. E. J. Newman

Communities

Intercommunity edges

60 of 67

Girvan-Newman algorithm

  • General idea: “If two communities are joined by only a few inter-community edges, then all paths through the network from vertices in one community to vertices in the other must pass along one of those few edges.”
  • Community structure can be revealed by removing edges that with high betweenness
  • Algorithm is based on a divisive clustering idea

M. E. J. Newman and M. Girvan. Finding and evaluating community structure

61 of 67

Betweenness of an edge

  • Betweenness of an edge e is defined as the number of shortest paths that include e
  • Edges that lie between communities tend to have high betweenness

62 of 67

Girvan-Newman algorithm

  • Initialize
    • Compute betweenness for all edges
  • Repeat until convergence criteria
    1. Remove the edge with the highest betweenness
    2. Recompute betweenness of remaining edges

  • Convergence criteria can be
    • No more edges
    • Desired modularity

M. E. J. Newman and M. Girvan. Finding and evaluating community structure

63 of 67

Girvan-Newman algorithm as a hierarchical clustering algorithm

  • One can view this algorithm as a top-down (divisive) hierarchical clustering algorithm
  • The root of the dendrogram groups all nodes into one community
  • Each branch of the tree represents the order of splitting the network as edges are removed

Graph vertices

64 of 67

Zachary’s karate club study

  • A well-known example of a social network with community structure
  • Dataset collected by Wayne Zachary over 2 years who observed social interactions among members of a karate club
  • Due to a dispute the club split into two factions
  • Can a graph clustering/module detection algorithm predict the factions?

Each node is an individual and edges represent social interactions among individuals. The shape and colors represent different groups.

65 of 67

Girvan Newman on Zachary’s karate club dataset

Each node is an individual and edges represent social interactions among individuals. The shape and colors represent different groups.

Node grouping based on betweenness

Modularity

66 of 67

Take away points

  • Biological networks are modular
  • Modules can be topological or functional
  • Modularity can be measured using
    • Clustering coefficient
    • Q measure
  • We have seen one example of topological clustering algorithms
    • Girvan-Newman algorithm
      • based on edge-betweenness
      • Can be viewed as top-down/divisive clustering algorithm

67 of 67

References

  • A.-L. Barabasi and Z. N. Oltvai, "Network biology: understanding the cell's functional organization," Nature Reviews Genetics, vol. 5, no. 2, pp. 101-113, Feb. 2004. [Online]. Available: http://dx.doi.org/10.1038/nrg1272
  • R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, "Network motifs: Simple building blocks of complex networks," Science, vol. 298, no. 5594, pp. 824-827, Oct. 2002. [Online]. Available: http://dx.doi.org/10.1126/science.298.5594.824
  • U. Alon, "Network motifs: theory and experimental approaches," Nature Reviews Genetics, vol. 8, no. 6, pp. 450-461, Jun. 2007. [Online]. Available: http://dx.doi.org/10.1038/nrg2102
  • S. Wernicke and F. Rasche, "FANMOD: a tool for fast network motif detection," Bioinformatics, vol. 22, no. 9, pp. 1152-1153, May 2006. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btl038
  • M. E. J. Newman and M. Girvan. “Finding and evaluating community structure in networks”, 2003, 10.1103/PhysRevE.69.026113
  • M. E. J. Newman, "Modularity and community structure in networks," Proceedings of the National Academy of Sciences, vol. 103, no. 23, pp. 8577-8582, Jun. 2006. [Online]. Available: http://dx.doi.org/10.1073/pnas.0601602103
  • A. Clauset, C. R. Shalizi, and M. E. J. Newman, "Power-law distributions in empirical data," SIAM Review, vol. 51, no. 4, pp. 661-703, Feb. 2009. [Online]. Available: http://dx.doi.org/10.1137/070710111