1 of 34

Network-based integration with many samples

Nov 25th, 2025

BMI/CS 775 Computational Network Biology�Fall 2025

Anthony Gitter

https://bmi775.sites.wisc.edu

Original slides created by Prof. Sushmita Roy

These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Sushmita Roy and Anthony Gitter

2 of 34

Topics in this section

  • Graph-based approaches for gene prioritization
  • Graph diffusion to interpret sequence variants
  • Combinatorial graph algorithms for subnetwork selection
  • Multi-omic data integration
  • Network integration with graph neural networks

3 of 34

Biological data is of many different types

Image credit: TCGA, Gligorevic et al., Proteomics 2015

Today: many samples within a cancer type

4 of 34

Data integration with many samples

  • Prize collecting Steiner forest goal was to select network edges to connect omic measurements in one or a few samples

  • Now we have many samples, e.g. 100s of tumor samples
  • Multiple types of omic data in each sample
    • Gene expression
    • miRNA expression
    • DNA methylation

  • Goal: cluster the samples and take advantage of all types of omic data

5 of 34

Challenges in clustering with multiple data types

  • Data challenges
    • Noisy samples
    • Fewer samples than variables
    • Complementary nature of the data

  • How might we approach this?

6 of 34

Clustering idea 1: joint similarity measure

  • Could define a single similarity score that includes multiple data types
  • Potential problems:

7 of 34

Clustering idea 1: joint similarity measure

  • Could define a single similarity score that includes multiple data types
  • Potential problems:

1

1

3

9

4

4

1

4

0

8

4

0

1

3

8

2

3

1

1

3

5

9

9

7

4

8

2

0

Sample 1

Sample 2

Protein activity

DNA methylation

Vastly different number of features per data type

8 of 34

Clustering idea 1: joint similarity measure

  • Could define a single similarity score that includes multiple data types
  • Potential problems:

1

1

3

9

4

4

1

4

0

8

4

0

1

3

0

0

0

0

0

0

0

0

1

0

0

0

0

1

Sample 1

Sample 2

Gene expression

DNA mutations

Different data types have different values

9 of 34

Clustering idea 2: ensemble clustering

  • Cluster each data type separately, then merge the clusters
  • Pan cancer Cluster of Cluster Assignments
    • Hoadley et al. Cell 2014
  • Limitations:

0

0

0

0

0

0

0

0

1

0

0

0

0

1

Sample 1

Sample 2

Gene expression

DNA mutations

1

1

3

9

4

4

1

4

0

8

4

0

1

3

Sample 1

Sample 2

Expression clusters

Mutation clusters

Merged clusters

No opportunity to share information across data types or smooth noisy data

10 of 34

Similarity Network Fusion

  • Main ideas:
    • Construct sample-sample similarities within each data type
    • Share information across data types to update similarities
    • Converge on consensus similarities
    • Cluster using the consensus similarities

11 of 34

Similarity Network Fusion motivating example

A few samples have noisy values in each data type

Clustering by a single data type places some points in the wrong group

12 of 34

Similarity Network Fusion concept

Cancer genomics application: samples = patients = nodes

Edges = similarities

13 of 34

Similarity Network Fusion algorithm

  •  

14 of 34

Important assumption

  •  

15 of 34

Defining a similarity graph over patient samples

  •  

Euclidean distance squared

Hyperparameter

Scaling term

16 of 34

Defining a similarity graph over patient samples

  • Scaling term helps normalize the Euclidean distance

  • Average of the distance between each node and its neighborhood

17 of 34

Creating a fused matrix

  •  

Assumes that the local similarities are the most reliable

Neighbors include the node itself

18 of 34

Iterate for fusion

  •  

19 of 34

Iteration with m=2 data types

For iteration t+1

Update similarity matrix of data type 1 using weight matrix from data type 2 and vice-versa

Renormalize P(v) after the update

Update similarity matrix of data type 1

Update similarity matrix of data type 2

20 of 34

What is going on in the iteration step

We are updating the similarity matrix using the most confident common neighbors of i and j

Neighbor of i

Neighbor of j

21 of 34

Extending to m>2 data types

Average over all other data types

22 of 34

SNF termination

  • After repeating the iterative updates for t steps, final similarity matrix is

  • This is what is clustered using spectral clustering

23 of 34

Top Hat question

24 of 34

Application of SNF to Glioblastoma

  • Contradicting information about subtypes depending upon the type of data used
  • Glioblastoma (GBM) dataset
  • Three data types among 215 patients
      • DNA methylation (1,491 genes)
      • mRNA (12,042 genes)
      • miRNA (534 miRNAs)

25 of 34

SNF application to GBM identifies 3 subtypes

DNA methylation

mRNA expression

miRNA expression

26 of 34

Evaluating cancer subtype clusters

  • Silhouette score for cluster coherence
  • Cox log-rank test
    • Assess if the overall survival of the patient subtypes is different
    • Can visualize with Kaplan–Meier curve
    • Statistical test assesses the difference
    • Account for right censoring, do not know how long every patient survives

27 of 34

Kaplan–Meier curves

Image from Ruben Van Paemel

28 of 34

Validation of SNF identified GBM subtypes

Subtypes are associated with different survival outcomes.

Blue curve (subtype 3) are patients with more favorable prognosis.

Red: temozolomide treated

Black: untreated.

29 of 34

Application of SNF to five cancers

  • Expand to four additional cancer types:
    • breast invasive carcinoma (BIC), kidney renal clear cell carcinoma (KRCCC), lung squamous cell carcinoma (LSCC) and colon adenocarcinoma (COAD)
  • 92 to 215 samples

30 of 34

Application of SNF to five cancers

Reasonable survival differences in subtypes in each cancer type

31 of 34

Application of SNF to five cancers

Survival differences more significant than clustering individual data types

32 of 34

Application of SNF to five cancers

Survival differences more significant and better silhouette scores than naïvely merging data or iCluster

iCluster: learns sparse linear mapping to shared latent space

33 of 34

Similarity network fusion summary

  • Synchronizing data-type-specific similarity matrices can
    • Resolve noise and contradictions in each data type
    • Make data types directly comparable
  • Improves upon single data type and previous consensus clustering
  • Does not natively support missing data

34 of 34

Network-based data integration summary

Problem

Goal

Input

Algorithm

Output

Prioritizing candidate disease genes

Rank candidates using global network relationships

PPI network, known disease genes, candidates

Random walk with restart or graph diffusion

Ranked list of candidates

Identifying disease gene subnetworks

Find connected subnetworks with mutated genes that span many patients

PPI network, mutations per patient

Heat kernel diffusion, identify subnetworks, assess significance

Subnetworks, significance, patients with mutations in the subnetworks

Relating multiple omic measurements of one process

Select edges connecting important nodes of multiple types

PPI network, edge costs, node scores

Prize-collecting Steiner forest

Selected edges and nodes

Clustering samples using multiple omic measurements

Jointly use all data types to inform the clustering

Multiple types of omic data for each sample

Form similarity matrices, update iteratively across data types, spectral clustering

Consensus sample-sample similarities and sample clusters

Combining complementary graphs

Learn a consensus node embedding from all graphs

Multiple biological networks, optional node labels

Multi-graph autoencoder

Node embeddings