1 of 34

Network-based integration with many samples

Nov 25^th, 2025

BMI/CS 775 Computational Network Biology�Fall 2025

Anthony Gitter

https://bmi775.sites.wisc.edu

Original slides created by Prof. Sushmita Roy

These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Sushmita Roy and Anthony Gitter

2 of 34

Topics in this section

Graph-based approaches for gene prioritization
Graph diffusion to interpret sequence variants
Combinatorial graph algorithms for subnetwork selection
Multi-omic data integration
Network integration with graph neural networks

3 of 34

Biological data is of many different types

Image credit: TCGA, Gligorevic et al., Proteomics 2015

Today: many samples within a cancer type

4 of 34

Data integration with many samples

Prize collecting Steiner forest goal was to select network edges to connect omic measurements in one or a few samples

Now we have many samples, e.g. 100s of tumor samples
Multiple types of omic data in each sample

Gene expression
miRNA expression
DNA methylation

Goal: cluster the samples and take advantage of all types of omic data

5 of 34

Challenges in clustering with multiple data types

Data challenges

Noisy samples
Fewer samples than variables
Complementary nature of the data

How might we approach this?

6 of 34

Clustering idea 1: joint similarity measure

Could define a single similarity score that includes multiple data types
Potential problems:

7 of 34

Clustering idea 1: joint similarity measure

Could define a single similarity score that includes multiple data types
Potential problems:

1	1
3	9
4	4
1	4
0	8
4	0
1	3
8	2
3	1
1	3
5	9
9	7
4	8
2	0

Sample 1

Sample 2

Protein activity

DNA methylation

Vastly different number of features per data type

8 of 34

Clustering idea 1: joint similarity measure

Could define a single similarity score that includes multiple data types
Potential problems:

1	1
3	9
4	4
1	4
0	8
4	0
1	3
0	0
0	0
0	0
0	0
1	0
0	0
0	1

Sample 1

Sample 2

Gene expression

DNA mutations

Different data types have different values

9 of 34

Clustering idea 2: ensemble clustering

Cluster each data type separately, then merge the clusters
Pan cancer Cluster of Cluster Assignments

Hoadley et al. Cell 2014

Limitations:

0	0
0	0
0	0
0	0
1	0
0	0
0	1

Sample 1

Sample 2

Gene expression

DNA mutations

1	1
3	9
4	4
1	4
0	8
4	0
1	3

Sample 1

Sample 2

Expression clusters

Mutation clusters

Merged clusters

No opportunity to share information across data types or smooth noisy data

10 of 34

Similarity Network Fusion

Main ideas:

Construct sample-sample similarities within each data type
Share information across data types to update similarities
Converge on consensus similarities
Cluster using the consensus similarities

Wang et al. Nature Methods 2014

11 of 34

Similarity Network Fusion motivating example

Wang et al. Nature Methods 2014

A few samples have noisy values in each data type

Clustering by a single data type places some points in the wrong group

12 of 34

Similarity Network Fusion concept

Wang et al. Nature Methods 2014

Cancer genomics application: samples = patients = nodes

Edges = similarities

13 of 34

Similarity Network Fusion algorithm

14 of 34

Important assumption

15 of 34

Defining a similarity graph over patient samples

Euclidean distance squared

Hyperparameter

Scaling term

16 of 34

Defining a similarity graph over patient samples

Scaling term helps normalize the Euclidean distance

Average of the distance between each node and its neighborhood

17 of 34

Creating a fused matrix

Assumes that the local similarities are the most reliable

Neighbors include the node itself

18 of 34

Iterate for fusion

19 of 34

Iteration with m=2 data types

For iteration t+1

Update similarity matrix of data type 1 using weight matrix from data type 2 and vice-versa

Renormalize P^(v)after the update

Update similarity matrix of data type 1

Update similarity matrix of data type 2

20 of 34

What is going on in the iteration step

We are updating the similarity matrix using the most confident common neighbors of i and j

Neighbor of i

Neighbor of j

21 of 34

Extending to m>2 data types

Average over all other data types

22 of 34

SNF termination

After repeating the iterative updates for t steps, final similarity matrix is

This is what is clustered using spectral clustering

23 of 34

24 of 34

Application of SNF to Glioblastoma

Contradicting information about subtypes depending upon the type of data used
Glioblastoma (GBM) dataset
Three data types among 215 patients

DNA methylation (1,491 genes)
mRNA (12,042 genes)
miRNA (534 miRNAs)

25 of 34

SNF application to GBM identifies 3 subtypes

DNA methylation

mRNA expression

miRNA expression

Wang et al. Nature Methods 2014

26 of 34

Evaluating cancer subtype clusters

Silhouette score for cluster coherence
Cox log-rank test

Assess if the overall survival of the patient subtypes is different
Can visualize with Kaplan–Meier curve
Statistical test assesses the difference
Account for right censoring, do not know how long every patient survives

27 of 34

Kaplan–Meier curves

Image from Ruben Van Paemel

28 of 34

Validation of SNF identified GBM subtypes

Subtypes are associated with different survival outcomes.

Blue curve (subtype 3) are patients with more favorable prognosis.

Red: temozolomide treated

Black: untreated.

Wang et al. Nature Methods 2014

29 of 34

Application of SNF to five cancers

Expand to four additional cancer types:

breast invasive carcinoma (BIC), kidney renal clear cell carcinoma (KRCCC), lung squamous cell carcinoma (LSCC) and colon adenocarcinoma (COAD)

92 to 215 samples

30 of 34

Application of SNF to five cancers

Wang et al. Nature Methods 2014

Reasonable survival differences in subtypes in each cancer type

31 of 34

Application of SNF to five cancers

Wang et al. Nature Methods 2014

Survival differences more significant than clustering individual data types

32 of 34

Application of SNF to five cancers

Wang et al. Nature Methods 2014

Survival differences more significant and better silhouette scores than naïvely merging data or iCluster

iCluster: learns sparse linear mapping to shared latent space

33 of 34

Similarity network fusion summary

Synchronizing data-type-specific similarity matrices can

Resolve noise and contradictions in each data type
Make data types directly comparable

Improves upon single data type and previous consensus clustering
Does not natively support missing data

34 of 34

Network-based data integration summary

Problem	Goal	Input	Algorithm	Output
Prioritizing candidate disease genes	Rank candidates using global network relationships	PPI network, known disease genes, candidates	Random walk with restart or graph diffusion	Ranked list of candidates
Identifying disease gene subnetworks	Find connected subnetworks with mutated genes that span many patients	PPI network, mutations per patient	Heat kernel diffusion, identify subnetworks, assess significance	Subnetworks, significance, patients with mutations in the subnetworks
Relating multiple omic measurements of one process	Select edges connecting important nodes of multiple types	PPI network, edge costs, node scores	Prize-collecting Steiner forest	Selected edges and nodes
Clustering samples using multiple omic measurements	Jointly use all data types to inform the clustering	Multiple types of omic data for each sample	Form similarity matrices, update iteratively across data types, spectral clustering	Consensus sample-sample similarities and sample clusters
Combining complementary graphs	Learn a consensus node embedding from all graphs	Multiple biological networks, optional node labels	Multi-graph autoencoder	Node embeddings