1 of 32

Network-based integration of multiple networks

Nov 26^th, 2024

BMI/CS 775 Computational Network Biology�Fall 2024

Anthony Gitter

https://compnetbiocourse.discovery.wisc.edu

2 of 32

Topics in this section

Graph-based approaches for gene prioritization
Graph diffusion to interpret sequence variants
Combinatorial graph algorithms for subnetwork selection
Multi-omic data integration
Network integration with graph neural networks

3 of 32

Why BIONIC?

https://doi.org/10.1038/s41592-022-01617-w

4 of 32

Appeal of representation learning on graphs

Already seen power of graph representation learning

Discover modules
Predict gene and protein function
Prioritize experiments

Typically used a single graph

For example, protein-protein interactions

5 of 32

Recurring theme: unique information captured in different omics assays

Image: TCGA, Gligorevic et al., Proteomics 2015

Have discussed this in terms of node information

Also true for edge information

6 of 32

Three types of biological networks

Images: Fout et al. NIPS 2017, Jung Choi 2013, van Leeuwen 2017

Protein-protein

Gene

co-expression

Genetic

7 of 32

BIONIC goal

How can we perform graph representation learning across diverse biological networks?

Improve node embeddings

8 of 32

BIONIC

Sample-sample

networks

Share information across disjoint networks

SNF

Bo Wang

Cluster samples

Gene-gene

networks

Node (gene) embeddings

Cluster genes

Predict gene function

9 of 32

BIONIC versus SNF coexpression

Wang et al. Nature Methods 2014

SNF

co-expression

BIONIC

co-expression

Genes

Gene similarity networks

Gene similarity matrices

10 of 32

Genetic interactions

Image: van Leeuwen 2017

Pairwise phenotype unexpected based on individual phenotypes

11 of 32

Restating the BIONIC goal

Input: multiple biological networks

Optional node labels

Output: informative node embeddings

12 of 32

13 of 32

BIONIC algorithm overview

Image: Forster 2022

14 of 32

BIONIC algorithm overview

Image: Forster 2022

Multiple input networks

15 of 32

BIONIC algorithm overview

Image: Forster 2022

Pass each graph through multiple layers of a graph attention network

Special way to combine node embeddings

16 of 32

BIONIC algorithm overview

Image: Forster 2022

Reconstruct a single graph from the combined node embeddings

17 of 32

BIONIC algorithm overview

Image: Forster 2022

Combine graph reconstruction error with respect to each original graph

18 of 32

Learning the node embeddings

Important customizations to learn node embeddings across graphs

Residual connections sum all graph convolutional network (GCN) outputs
Supports any number of GCN layers, here 3
Use graph attention networks

Image: Forster 2022

19 of 32

Learning the node embeddings

Stochastic mask to combine graph-specific node embeddings

Learned scaling parameter

Graph-specific embedding

Mask value for node i in graph j

20 of 32

Graph reconstruction error

linear layer =>

Reconstruct a single graph

Loss function sums over each original graph

Binary mask vector of nodes in graph j

Reduce dimension

21 of 32

Extension to semi-supervised learning

If we have partial node labels, could use those to improve node representation learning

For example, gene functions

Hybrid between unsupervised and supervised graph learning we have seen previously

22 of 32

Use node embeddings to predict labels

Combined loss

Predict labels for all genes

Sigmoid function

Weight matrix

Binary label mask

Loss function sums over nodes and graphs

True node label

23 of 32

BIONIC implementation details

Node features are originally one hot encodings
Optional network batching for scalability

Sample subsets of networks during training step
Constant memory footprint

Node sampling during training
Can scale to many networks and many nodes

24 of 32

Evaluating BIONIC

Many evaluations, we’ll focus on

Predicting protein complexes
Semi-supervised learning
Scalability

They also

Compared to many strong baseline algorithms
Generated extensive chemical-genetic screening data

Prediction setting is complicated, not fully prospective
Statistically significant but no baseline algorithms for comparison

25 of 32

Predicting protein modules

Image: Forster 2022

Intrinsic: use node embeddings directly

Extrinsic: train SVM on node embeddings

26 of 32

Predicting LSM2-7 protein complex

Image: Forster 2022

Complex interactions missing from individual networks

27 of 32

Predicting LSM2-7 protein complex

Image: Forster 2022

Complex interactions missing from individual networks

Wait… really?

https://string-db.org/

28 of 32

Semi-supervised learning

Image: Forster 2022

GeneMANIA: label propagation on association networks

29 of 32

BIONIC scalability

30 of 32

BIONIC scalability

31 of 32

BIONIC summary

Integrates multiple networks to learn improved node embeddings
Based on graph neural network ideas with customizations

Semi-supervised learning
Masking to combine embeddings across graphs
Sampling during training for scalability

Follows machine learning best-practices in training, tuning, and evaluation

32 of 32

Network-based data integration summary

Problem	Goal	Input	Algorithm	Output
Prioritizing candidate disease genes	Rank candidates using global network relationships	PPI network, known disease genes, candidates	Random walk with restart or graph diffusion	Ranked list of candidates
Identifying disease gene subnetworks	Find connected subnetworks with mutated genes that span many patients	PPI network, mutations per patient	Heat kernel diffusion, identify subnetworks, assess significance	Subnetworks, significance, patients with mutations in the subnetworks
Relating multiple omic measurements of one process	Select edges connecting important nodes of multiple types	PPI network, edge costs, node scores	Prize-collecting Steiner forest	Selected edges and nodes
Clustering samples using multiple omic measurements	Jointly use all data types to inform the clustering	Multiple types of omic data for each sample	Form similarity matrices, update iteratively across data types, spectral clustering	Consensus sample-sample similarities and sample clusters
Combining complementary graphs	Learn a consensus node embedding from all graphs	Multiple biological networks, optional node labels	Multi-graph autoencoder	Node embeddings