1 of 47

Network-based integration with few samples

Nov 19^th, 2024

BMI/CS 775 Computational Network Biology�Fall 2024

Anthony Gitter

https://compnetbiocourse.discovery.wisc.edu

Includes slides created by Prof. Sushmita Roy

2 of 47

Topics in this section

Graph-based approaches for gene prioritization
Graph diffusion to interpret sequence variants
Combinatorial graph algorithms for subnetwork selection
Multi-omic data integration
Network integration with graph neural networks

3 of 47

Biological data is of many different types

Image credit: TCGA, Gligorevic et al., Proteomics 2015

4 of 47

Need for systematic approaches for data integration

Approach to integrate data types depends on the goal and the types of data available
Considerations

Number of samples per data type
Available background knowledge
Types of measurements

Gene sets versus quantitative profiles

5 of 47

Network-based approaches for integrating data

Network inference-based

Learn mixed graphical models where different variable types (different probability distribution families) represent different omic data types

Diffusion-based

Similarity Network Fusion (Wang et al., Nature Methods 2014)
MASHUP (Cho et al., Cell Systems 2016)
GeneMania (Mostafavi et al, Genome Biology 2008)

Subnetwork selection-based methods (our focus today)

Still applicable for a small number of samples
Prize-collecting Steiner tree
Integer linear program
Min cost flow
Weighted k shortest path

6 of 47

Subnetwork selection problem definition

Given

Omic data for one or more data types
Background network (e.g. protein-protein, protein-DNA, protein-metabolite interactions)

Find a sparse subnetwork that connects many of the omic measurements

Sometimes called “pathway reconstruction”

7 of 47

Running example: host cell responses in KSHV and mock infections

RNA-seq

Proteomics

Phosphoproteomics

Sychev et al. PLoS Pathogens 2017

What happens after viral infection?

8 of 47

Variety of subnetwork selection methods

Main difference is the objective function that determines which edges are selected
Prize collecting Steiner tree

Connect high scoring nodes while minimizing edge costs

Integer linear program

Most flexible, customize objective and constraints

Min cost flow

Push flow from sources to targets while respecting edge constraints

Weighted k shortest path

Identify multiple short weighted source-target paths

9 of 47

Steiner tree problem definition

Given

Terminal nodes S that must be connected, a subset of V
Edge-weighted graph G={V, E, w}

Return the minimal cost connected subnetwork containing all nodes in S

10 of 47

Steiner tree observations

Returned subnetwork potentially includes intermediate nodes
This problem is NP-complete

Similar to minimum spanning tree but more complex because we can choose the intermediate nodes

Optimal solution is always a tree structured graph

Exactly one path between any pair of nodes
Why is this?

11 of 47

Steiner tree examples

Terminals

S={A, B, C}

Input

Solution

12 of 47

Prize-collecting Steiner tree objective function

Generalize by adding scores (prizes) to nodes
No longer have to include all nodes in S
Maximize selected prizes while minimizing selected edge costs

Solve using variety of optimization techniques

Integer linear programming (Ljubic et al., 2006)
Belief propagation (Bailly-Bechet et al., 2011 PNAS)
Divide-and-conquer heuristic (Akhmedov et al., 2017 PLOS Computational Biology)

Where do the prizes and edge costs come from?

13 of 47

Score for proteins in KSHV infection

Proteomics: small p-value high score

RNA-seq: infer transcription factor (TF) activities

Mock KSHV

High TF score

Low TF score

14 of 47

PPI network costs for edges

http://wodaklab.org/iRefWeb

SLN1-YPD1:

Cost = 1 – confidence score

15 of 47

Visualizing the Steiner tree problem

Protein-protein interactions

KSHV change

No change

Designate special nodes based on KSHV scores
Assign costs to edges based on reliability
Collect special nodes while minimizing edge costs

These slides derived from examples by Ernest Fraenkel

16 of 47

Visualizing the Steiner tree problem

Node prize�doesn’t justify the �edge costs

Steiner node

KSHV change

No data

Designate special nodes based on KSHV scores
Assign costs to edges based on reliability
Collect special nodes while minimizing edge costs

17 of 47

Prize-collecting Steiner forest objective

3. Collect prizes while minimizing edge costs

Controls tradeoff between prizes and costs

Prizes of excluded vertices (nodes)

Costs of included edges

Minimize

Find forest (subnetwork)

Forest is a collection of disconnected trees

Tuncbag et al. Journal of Computational Biology 2013

18 of 47

19 of 47

Constructing a probabilistic approximation algorithm

msgsteiner algorithm Bailly-Bechet et al. PNAS 2011

How many possible subgraphs?

How can we perform inference in that state space?

20 of 47

Transform global tree constraint into local constraints

Add Root node (R) and Null node (*)

21 of 47

Transform global tree constraint into local constraints

Create two variables for every node i:

pa_i – parent node

d_i – depth from root

22 of 47

Translate omitted prize cost into edge cost

Valid tree

pa_i = *

d_i = ∞

c_i,* = λ p(v_i)

Node excluded from tree incurs cost equal to its prize

23 of 47

Probabilistic optimization

Boltzmann-Gibbs distribution

H is the “energy” of the state

Statistical mechanics: cavity method�Probabilistic graphical models: belief propagation

24 of 47

Iterative optimization with belief propagation

Node i receives messages from neighbors

Node i updates state of pa_i and d_i

J. Ortiz, T. Evans, A. Davison. A visual introduction to Gaussian Belief Propagation, 2021. https://gaussianbp.github.io/

25 of 47

Combining forests into pathways

…

26 of 47

Latent KSHV “pathway”

27 of 47

Zooming in on regions of interest

Peroxisome subnetwork

28 of 47

Peroxisome validation

Peroxisome: organelle involved in metabolism, very long chain fatty acids

Peroxisome counts increase upon infection

Knocking down peroxisomal enzymes increases cell death only in KSHV-infected cells

29 of 47

Integer linear programs (ILPs) for subnetwork selection

Most flexible approach
Customize a linear objective function and constraints
Can formulate prize collecting Steiner forest as an integer linear program

Examples:

SPINE (Ourfali et al., 2007 Bioinformatics)
Chasman et al., 2014 PLOS Computational Biology
CARNIVAL (Liu et al., 2019 npj Systems Biology and Applications)

30 of 47

ILP for viral replication subnetwork

Chasman et al., 2014 PLOS Computational Biology

31 of 47

Key idea in ILPs

Define free variables that control which elements are in the subnetwork

Is this path included or excluded?
Is this edge included or excluded?
Is this node included or excluded?

Define constraints that give valid subnetworks

If an edge is included, both nodes must be included

Define objective function use solve to fix the free variables

32 of 47

Variables in viral replication ILP

Chasman et al., 2014 PLOS Computational Biology

Set up for complex reasoning about directionality of paths and signs of effects (increase or decrease in viral replication)

33 of 47

Variables in viral replication ILP

Chasman et al., 2014 PLOS Computational Biology

Is path number 9 selected?

Is edge IV selected? Is it activating?

Is node C selected? Is it up during replication?

Is edge I selected? Activating? Directed?

Is edge III selected? Is it inhibiting?

34 of 47

Viral replication ILP objective

Maximize score of selected nodes that are not experimental hits

N nodes in the network

N^H subset of nodes that are hits

score(n) diffusion kernel score from hits

y_n binary variable, node present in any path

35 of 47

Viral replication ILP constraints

Limit number of edges connecting to virus node

If a path is selected, all edges must be selected (similar for node and edge selection)

Edges are activating or inhibiting

Others not shown

E^X edges connecting to virus

x_e edge selected

γ constant

P(e) all paths for edge

σ_p enumerated path

a_e edge is activating

h_e edge is inhibiting

36 of 47

Information flow problem definition

Given

Two node sets: sources and sinks (targets)
Directed network with edge weights and edge capacities
Amount of flow to transmit

Find the subnetwork that transmits the specified flow between the two node sets at the minimum cost

37 of 47

Information flow between sink to source nodes

Source nodes

Sink (target) nodes

Intermediate nodes

Transmit flow

38 of 47

Information flow-based methods

Used for integrating different types of data, as well as for examining perturbations and their effect
Integration of different types of “omics” data

Min cost flow (ResponseNet, Yeger-Lotem et al. 2008)

Details at the end of the slides

39 of 47

Network-based interpretation and integration summary

Subnetwork selection can integrate different data types and prioritize observations that are better connected to others
Filters experimental noise, nominates unobserved connective nodes
Objective function and constraints dictate

Available solvers
Computational complexity
Topology of the identified subnetworks

Work well even with only a few samples
Focus on selecting nodes and edges

Recall HotNet only identifies connected nodes, not edges in the original background network

40 of 47

Web and software resources

PathLinker (k shortest paths)

https://github.com/Murali-group/PathLinker

HotNet2

http://compbio.cs.brown.edu/projects/hotnet2/

ResponseNet (flow network)

http://netbio.med.ad.bgu.ac.il/respnet/
https://github.com/gitter-lab/min-cost-flow (reimplementation)

OmicsIntegrator (PCSF)

http://fraenkel-nsf.csbi.mit.edu/omicsintegrator/

Signaling Pathway Reconstruction Analysis Streamliner

https://github.com/Reed-CompBio/spras

41 of 47

Information flow details

42 of 47

Information flow notation

A flow network is defined as directed graph G=(V,E), with capacities for each edge

V: vertex set
E: edge set

s: source node
t: sink node
c(u,v)>0: Capacity of edge (u,v)

43 of 47

Flow in a graph G

A flow in G is defined by a function f that has the following properties for each edge (u,v):

The value of a flow is defined as

Capacity constraint

Conservation of flow

44 of 47

An example flow network

v₃

v₄

v₁

v₂

v₃

v₄

v₁

v₂

11(16)

12(12)

8(13)

1(4)

4(9)

11(14)

7(7)

15(20)

4(4)

Flow network G

A flow of 19 on G

Only positive flows are shown

(10)

45 of 47

Max-flow problem

Given

A flow network G, source s and sink t

find a flow f with maximum value

Ford-Fulkerson algorithm

46 of 47

Variation: Min cost flow

Often the question is not to maximize flow, but to find the most efficient/least expensive away of doing this
In addition to the flow, there is also a cost associated with each edge

For example, the cost might be inversely proportional to the edge confidence

So we would try to maximize the overall flow at the smallest cost

47 of 47

Min cost flow

Define cost of each edge as a(u,v)
Overall cost:

Minimize cost while transmitting flow as follows:

This idea was used in ResponseNet tool

E. Yeger-Lotem, L. Riva, L. J. J. Su, A. D. Gitler, A. G. Cashikar, O. D. King, P. K. Auluck, M. L. Geddie, J. S. Valastyan, D. R. Karger, S. Lindquist, and E. Fraenkel, "Bridging high-throughput genetic and transcriptional data reveals cellular responses to alpha-synuclein toxicity." Nature Genetics, vol. 41, no. 3, pp. 316-323, Mar. 2009.