Network-based integration with many samples
Nov 25th, 2025
Original slides created by Prof. Sushmita Roy
These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Sushmita Roy and Anthony Gitter
Topics in this section
Biological data is of many different types
Image credit: TCGA, Gligorevic et al., Proteomics 2015
Today: many samples within a cancer type
Data integration with many samples
Challenges in clustering with multiple data types
Clustering idea 1: joint similarity measure
Clustering idea 1: joint similarity measure
1 | 1 |
3 | 9 |
4 | 4 |
1 | 4 |
0 | 8 |
4 | 0 |
1 | 3 |
8 | 2 |
3 | 1 |
1 | 3 |
5 | 9 |
9 | 7 |
4 | 8 |
2 | 0 |
Sample 1
Sample 2
Protein activity
DNA methylation
Vastly different number of features per data type
Clustering idea 1: joint similarity measure
1 | 1 |
3 | 9 |
4 | 4 |
1 | 4 |
0 | 8 |
4 | 0 |
1 | 3 |
0 | 0 |
0 | 0 |
0 | 0 |
0 | 0 |
1 | 0 |
0 | 0 |
0 | 1 |
Sample 1
Sample 2
Gene expression
DNA mutations
Different data types have different values
Clustering idea 2: ensemble clustering
0 | 0 |
0 | 0 |
0 | 0 |
0 | 0 |
1 | 0 |
0 | 0 |
0 | 1 |
Sample 1
Sample 2
Gene expression
DNA mutations
1 | 1 |
3 | 9 |
4 | 4 |
1 | 4 |
0 | 8 |
4 | 0 |
1 | 3 |
Sample 1
Sample 2
Expression clusters
Mutation clusters
Merged clusters
No opportunity to share information across data types or smooth noisy data
Similarity Network Fusion
Similarity Network Fusion motivating example
A few samples have noisy values in each data type
Clustering by a single data type places some points in the wrong group
Similarity Network Fusion concept
Cancer genomics application: samples = patients = nodes
Edges = similarities
Similarity Network Fusion algorithm
Important assumption
Defining a similarity graph over patient samples
Euclidean distance squared
Hyperparameter
Scaling term
Defining a similarity graph over patient samples
Creating a fused matrix
Assumes that the local similarities are the most reliable
Neighbors include the node itself
Iterate for fusion
Iteration with m=2 data types
For iteration t+1
Update similarity matrix of data type 1 using weight matrix from data type 2 and vice-versa
Renormalize P(v) after the update
Update similarity matrix of data type 1
Update similarity matrix of data type 2
What is going on in the iteration step
We are updating the similarity matrix using the most confident common neighbors of i and j
Neighbor of i
Neighbor of j
Extending to m>2 data types
Average over all other data types
SNF termination
Top Hat question
Application of SNF to Glioblastoma
SNF application to GBM identifies 3 subtypes
DNA methylation
mRNA expression
miRNA expression
Evaluating cancer subtype clusters
Kaplan–Meier curves
Image from Ruben Van Paemel
Validation of SNF identified GBM subtypes
Subtypes are associated with different survival outcomes.
Blue curve (subtype 3) are patients with more favorable prognosis.
Red: temozolomide treated
Black: untreated.
Application of SNF to five cancers
Application of SNF to five cancers
Reasonable survival differences in subtypes in each cancer type
Application of SNF to five cancers
Survival differences more significant than clustering individual data types
Application of SNF to five cancers
Survival differences more significant and better silhouette scores than naïvely merging data or iCluster
iCluster: learns sparse linear mapping to shared latent space
Similarity network fusion summary
Network-based data integration summary
Problem | Goal | Input | Algorithm | Output |
Prioritizing candidate disease genes | Rank candidates using global network relationships | PPI network, known disease genes, candidates | Random walk with restart or graph diffusion | Ranked list of candidates |
Identifying disease gene subnetworks | Find connected subnetworks with mutated genes that span many patients | PPI network, mutations per patient | Heat kernel diffusion, identify subnetworks, assess significance | Subnetworks, significance, patients with mutations in the subnetworks |
Relating multiple omic measurements of one process | Select edges connecting important nodes of multiple types | PPI network, edge costs, node scores | Prize-collecting Steiner forest | Selected edges and nodes |
Clustering samples using multiple omic measurements | Jointly use all data types to inform the clustering | Multiple types of omic data for each sample | Form similarity matrices, update iteratively across data types, spectral clustering | Consensus sample-sample similarities and sample clusters |
Combining complementary graphs | Learn a consensus node embedding from all graphs | Multiple biological networks, optional node labels | Multi-graph autoencoder | Node embeddings |