1 of 32

Network-based integration of multiple networks

Nov 26th, 2024

BMI/CS 775 Computational Network Biology�Fall 2024

Anthony Gitter

https://compnetbiocourse.discovery.wisc.edu

2 of 32

Topics in this section

  • Graph-based approaches for gene prioritization
  • Graph diffusion to interpret sequence variants
  • Combinatorial graph algorithms for subnetwork selection
  • Multi-omic data integration
  • Network integration with graph neural networks

3 of 32

Why BIONIC?

4 of 32

Appeal of representation learning on graphs

  • Already seen power of graph representation learning
    • Discover modules
    • Predict gene and protein function
    • Prioritize experiments

  • Typically used a single graph
    • For example, protein-protein interactions

5 of 32

Recurring theme: unique information captured in different omics assays

Image: TCGA, Gligorevic et al., Proteomics 2015

Have discussed this in terms of node information

Also true for edge information

6 of 32

Three types of biological networks

Images: Fout et al. NIPS 2017, Jung Choi 2013, van Leeuwen 2017

Protein-protein

Gene

co-expression

Genetic

7 of 32

BIONIC goal

How can we perform graph representation learning across diverse biological networks?

Improve node embeddings

8 of 32

BIONIC

Sample-sample

networks

Share information across disjoint networks

SNF

Bo Wang

Cluster samples

Gene-gene

networks

Node (gene) embeddings

Cluster genes

Predict gene function

9 of 32

BIONIC versus SNF coexpression

SNF

co-expression

BIONIC

co-expression

Genes

Genes

Gene similarity networks

Gene similarity matrices

10 of 32

Genetic interactions

Pairwise phenotype unexpected based on individual phenotypes

11 of 32

Restating the BIONIC goal

Input: multiple biological networks

Optional node labels

Output: informative node embeddings

12 of 32

Top Hat question

13 of 32

BIONIC algorithm overview

14 of 32

BIONIC algorithm overview

Multiple input networks

15 of 32

BIONIC algorithm overview

Pass each graph through multiple layers of a graph attention network

Special way to combine node embeddings

16 of 32

BIONIC algorithm overview

Reconstruct a single graph from the combined node embeddings

17 of 32

BIONIC algorithm overview

Combine graph reconstruction error with respect to each original graph

18 of 32

Learning the node embeddings

  • Important customizations to learn node embeddings across graphs

  • Residual connections sum all graph convolutional network (GCN) outputs
  • Supports any number of GCN layers, here 3
  • Use graph attention networks

19 of 32

Learning the node embeddings

  • Stochastic mask to combine graph-specific node embeddings

Learned scaling parameter

Graph-specific embedding

Mask value for node i in graph j

20 of 32

Graph reconstruction error

linear layer =>

Reconstruct a single graph

Loss function sums over each original graph

Binary mask vector of nodes in graph j

Reduce dimension

21 of 32

Extension to semi-supervised learning

  • If we have partial node labels, could use those to improve node representation learning
    • For example, gene functions

  • Hybrid between unsupervised and supervised graph learning we have seen previously

22 of 32

Use node embeddings to predict labels

Combined loss

Predict labels for all genes

Sigmoid function

Weight matrix

Binary label mask

Loss function sums over nodes and graphs

True node label

23 of 32

BIONIC implementation details

  • Node features are originally one hot encodings
  • Optional network batching for scalability
    • Sample subsets of networks during training step
    • Constant memory footprint
  • Node sampling during training
  • Can scale to many networks and many nodes

24 of 32

Evaluating BIONIC

  • Many evaluations, we’ll focus on
    • Predicting protein complexes
    • Semi-supervised learning
    • Scalability

  • They also
    • Compared to many strong baseline algorithms
    • Generated extensive chemical-genetic screening data
      • Prediction setting is complicated, not fully prospective
      • Statistically significant but no baseline algorithms for comparison

25 of 32

Predicting protein modules

Intrinsic: use node embeddings directly

Extrinsic: train SVM on node embeddings

26 of 32

Predicting LSM2-7 protein complex

Complex interactions missing from individual networks

27 of 32

Predicting LSM2-7 protein complex

Complex interactions missing from individual networks

Wait… really?

28 of 32

Semi-supervised learning

GeneMANIA: label propagation on association networks

29 of 32

BIONIC scalability

30 of 32

BIONIC scalability

31 of 32

BIONIC summary

  • Integrates multiple networks to learn improved node embeddings
  • Based on graph neural network ideas with customizations
    • Semi-supervised learning
    • Masking to combine embeddings across graphs
    • Sampling during training for scalability
  • Follows machine learning best-practices in training, tuning, and evaluation

32 of 32

Network-based data integration summary

Problem

Goal

Input

Algorithm

Output

Prioritizing candidate disease genes

Rank candidates using global network relationships

PPI network, known disease genes, candidates

Random walk with restart or graph diffusion

Ranked list of candidates

Identifying disease gene subnetworks

Find connected subnetworks with mutated genes that span many patients

PPI network, mutations per patient

Heat kernel diffusion, identify subnetworks, assess significance

Subnetworks, significance, patients with mutations in the subnetworks

Relating multiple omic measurements of one process

Select edges connecting important nodes of multiple types

PPI network, edge costs, node scores

Prize-collecting Steiner forest

Selected edges and nodes

Clustering samples using multiple omic measurements

Jointly use all data types to inform the clustering

Multiple types of omic data for each sample

Form similarity matrices, update iteratively across data types, spectral clustering

Consensus sample-sample similarities and sample clusters

Combining complementary graphs

Learn a consensus node embedding from all graphs

Multiple biological networks, optional node labels

Multi-graph autoencoder

Node embeddings