1 of 47

Network-based integration with few samples

Nov 19th, 2024

BMI/CS 775 Computational Network Biology�Fall 2024

Anthony Gitter

https://compnetbiocourse.discovery.wisc.edu

Includes slides created by Prof. Sushmita Roy

2 of 47

Topics in this section

  • Graph-based approaches for gene prioritization
  • Graph diffusion to interpret sequence variants
  • Combinatorial graph algorithms for subnetwork selection
  • Multi-omic data integration
  • Network integration with graph neural networks

3 of 47

Biological data is of many different types

Image credit: TCGA, Gligorevic et al., Proteomics 2015

4 of 47

Need for systematic approaches for data integration

  • Approach to integrate data types depends on the goal and the types of data available
  • Considerations
    • Number of samples per data type
    • Available background knowledge
    • Types of measurements
      • Gene sets versus quantitative profiles

5 of 47

Network-based approaches for integrating data

  • Network inference-based
    • Learn mixed graphical models where different variable types (different probability distribution families) represent different omic data types
  • Diffusion-based
    • Similarity Network Fusion (Wang et al., Nature Methods 2014)
    • MASHUP (Cho et al., Cell Systems 2016)
    • GeneMania (Mostafavi et al, Genome Biology 2008)
  • Subnetwork selection-based methods (our focus today)
    • Still applicable for a small number of samples
    • Prize-collecting Steiner tree
    • Integer linear program
    • Min cost flow
    • Weighted k shortest path

6 of 47

Subnetwork selection problem definition

  • Given
    • Omic data for one or more data types
    • Background network (e.g. protein-protein, protein-DNA, protein-metabolite interactions)
  • Do
    • Find a sparse subnetwork that connects many of the omic measurements

  • Sometimes called “pathway reconstruction”

7 of 47

Running example: host cell responses in KSHV and mock infections

RNA-seq

Proteomics

Phosphoproteomics

What happens after viral infection?

8 of 47

Variety of subnetwork selection methods

  • Main difference is the objective function that determines which edges are selected
  • Prize collecting Steiner tree
    • Connect high scoring nodes while minimizing edge costs
  • Integer linear program
    • Most flexible, customize objective and constraints
  • Min cost flow
    • Push flow from sources to targets while respecting edge constraints
  • Weighted k shortest path
    • Identify multiple short weighted source-target paths

9 of 47

Steiner tree problem definition

  • Given
    • Terminal nodes S that must be connected, a subset of V
    • Edge-weighted graph G={V, E, w}
  • Do
    • Return the minimal cost connected subnetwork containing all nodes in S

10 of 47

Steiner tree observations

  • Returned subnetwork potentially includes intermediate nodes
  • This problem is NP-complete
    • Similar to minimum spanning tree but more complex because we can choose the intermediate nodes
  • Optimal solution is always a tree structured graph
    • Exactly one path between any pair of nodes
    • Why is this?

11 of 47

Steiner tree examples

A

B

C

E

Terminals

S={A, B, C}

A

B

C

E

D

G

F

Input

Solution

1

1

1

1

1

2

2

2

12 of 47

Prize-collecting Steiner tree objective function

  • Generalize by adding scores (prizes) to nodes
  • No longer have to include all nodes in S
  • Maximize selected prizes while minimizing selected edge costs

  • Solve using variety of optimization techniques
    • Integer linear programming (Ljubic et al., 2006)
    • Belief propagation (Bailly-Bechet et al., 2011 PNAS)
    • Divide-and-conquer heuristic (Akhmedov et al., 2017 PLOS Computational Biology)

  • Where do the prizes and edge costs come from?

13 of 47

Score for proteins in KSHV infection

  • Proteomics: small p-value high score

  • RNA-seq: infer transcription factor (TF) activities

Mock KSHV

Mock KSHV

High TF score

Low TF score

14 of 47

PPI network costs for edges

SLN1-YPD1:

Cost = 1 – confidence score

15 of 47

Visualizing the Steiner tree problem

Protein-protein interactions

KSHV change

No change

  1. Designate special nodes based on KSHV scores
  2. Assign costs to edges based on reliability
  3. Collect special nodes while minimizing edge costs

These slides derived from examples by Ernest Fraenkel

16 of 47

Visualizing the Steiner tree problem

Node prize�doesn’t justify the �edge costs

Steiner node

KSHV change

No data

  1. Designate special nodes based on KSHV scores
  2. Assign costs to edges based on reliability
  3. Collect special nodes while minimizing edge costs

17 of 47

Prize-collecting Steiner forest objective

3. Collect prizes while minimizing edge costs

Controls tradeoff between prizes and costs

Prizes of excluded vertices (nodes)

Costs of included edges

Minimize

Find forest (subnetwork)

Forest is a collection of disconnected trees

18 of 47

Top Hat question

19 of 47

Constructing a probabilistic approximation algorithm

msgsteiner algorithm Bailly-Bechet et al. PNAS 2011

How many possible subgraphs?

How can we perform inference in that state space?

20 of 47

Transform global tree constraint into local constraints

R

Add Root node (R) and Null node (*)

*

21 of 47

Transform global tree constraint into local constraints

R

Create two variables for every node i:

pai – parent node

di – depth from root

*

22 of 47

Translate omitted prize cost into edge cost

R

Valid tree

pai = *

di = ∞

ci,* = λ p(vi)

Node excluded from tree incurs cost equal to its prize

*

23 of 47

Probabilistic optimization

Boltzmann-Gibbs distribution

H is the “energy” of the state

Statistical mechanics: cavity method�Probabilistic graphical models: belief propagation

24 of 47

Iterative optimization with belief propagation

Node i receives messages from neighbors

Node i updates state of pai and di

J. Ortiz, T. Evans, A. Davison. A visual introduction to Gaussian Belief Propagation, 2021. https://gaussianbp.github.io/

25 of 47

Combining forests into pathways

=

+

+

26 of 47

Latent KSHV “pathway”

27 of 47

Zooming in on regions of interest

Peroxisome subnetwork

28 of 47

Peroxisome validation

  • Peroxisome: organelle involved in metabolism, very long chain fatty acids

Peroxisome counts increase upon infection

Knocking down peroxisomal enzymes increases cell death only in KSHV-infected cells

29 of 47

Integer linear programs (ILPs) for subnetwork selection

  • Most flexible approach
  • Customize a linear objective function and constraints
  • Can formulate prize collecting Steiner forest as an integer linear program

  • Examples:
    • SPINE (Ourfali et al., 2007 Bioinformatics)
    • Chasman et al., 2014 PLOS Computational Biology
    • CARNIVAL (Liu et al., 2019 npj Systems Biology and Applications)

30 of 47

ILP for viral replication subnetwork

Chasman et al., 2014 PLOS Computational Biology

31 of 47

Key idea in ILPs

  • Define free variables that control which elements are in the subnetwork
    • Is this path included or excluded?
    • Is this edge included or excluded?
    • Is this node included or excluded?
  • Define constraints that give valid subnetworks
    • If an edge is included, both nodes must be included
  • Define objective function use solve to fix the free variables

32 of 47

Variables in viral replication ILP

Chasman et al., 2014 PLOS Computational Biology

Set up for complex reasoning about directionality of paths and signs of effects (increase or decrease in viral replication)

33 of 47

Variables in viral replication ILP

Chasman et al., 2014 PLOS Computational Biology

Is path number 9 selected?

Is edge IV selected? Is it activating?

Is node C selected? Is it up during replication?

Is edge I selected? Activating? Directed?

Is edge III selected? Is it inhibiting?

34 of 47

Viral replication ILP objective

  • Maximize score of selected nodes that are not experimental hits

N nodes in the network

NH subset of nodes that are hits

score(n) diffusion kernel score from hits

yn binary variable, node present in any path

35 of 47

Viral replication ILP constraints

  • Limit number of edges connecting to virus node

  • If a path is selected, all edges must be selected (similar for node and edge selection)

  • Edges are activating or inhibiting

  • Others not shown

EX edges connecting to virus

xe edge selected

γ constant

P(e) all paths for edge

σp enumerated path

ae edge is activating

he edge is inhibiting

36 of 47

Information flow problem definition

  • Given
    • Two node sets: sources and sinks (targets)
    • Directed network with edge weights and edge capacities
    • Amount of flow to transmit
  • Do
    • Find the subnetwork that transmits the specified flow between the two node sets at the minimum cost

37 of 47

Information flow between sink to source nodes

Source nodes

Sink (target) nodes

Intermediate nodes

Transmit flow

38 of 47

Information flow-based methods

  • Used for integrating different types of data, as well as for examining perturbations and their effect
  • Integration of different types of “omics” data
    • Min cost flow (ResponseNet, Yeger-Lotem et al. 2008)
  • Details at the end of the slides

39 of 47

Network-based interpretation and integration summary

  • Subnetwork selection can integrate different data types and prioritize observations that are better connected to others
  • Filters experimental noise, nominates unobserved connective nodes
  • Objective function and constraints dictate
    • Available solvers
    • Computational complexity
    • Topology of the identified subnetworks
  • Work well even with only a few samples
  • Focus on selecting nodes and edges
    • Recall HotNet only identifies connected nodes, not edges in the original background network

40 of 47

Web and software resources

41 of 47

Information flow details

42 of 47

Information flow notation

  • A flow network is defined as directed graph G=(V,E), with capacities for each edge
    • V: vertex set
    • E: edge set
  • s: source node
  • t: sink node
  • c(u,v)>0: Capacity of edge (u,v)

43 of 47

Flow in a graph G

  • A flow in G is defined by a function f that has the following properties for each edge (u,v):

  • The value of a flow is defined as

Capacity constraint

Conservation of flow

44 of 47

An example flow network

s

t

v3

v4

v1

v2

16

12

13

4

10

9

14

7

20

4

s

t

v3

v4

v1

v2

11(16)

12(12)

8(13)

1(4)

4(9)

11(14)

7(7)

15(20)

4(4)

Flow network G

A flow of 19 on G

Only positive flows are shown

(10)

45 of 47

Max-flow problem

  • Given
    • A flow network G, source s and sink t

  • Do
    • find a flow f with maximum value
  • How
    • Ford-Fulkerson algorithm

46 of 47

Variation: Min cost flow

  • Often the question is not to maximize flow, but to find the most efficient/least expensive away of doing this
  • In addition to the flow, there is also a cost associated with each edge
    • For example, the cost might be inversely proportional to the edge confidence
  • So we would try to maximize the overall flow at the smallest cost

47 of 47

Min cost flow

  • Define cost of each edge as a(u,v)
  • Overall cost:

  • Minimize cost while transmitting flow as follows:

  • This idea was used in ResponseNet tool
    • E. Yeger-Lotem, L. Riva, L. J. J. Su, A. D. Gitler, A. G. Cashikar, O. D. King, P. K. Auluck, M. L. Geddie, J. S. Valastyan, D. R. Karger, S. Lindquist, and E. Fraenkel, "Bridging high-throughput genetic and transcriptional data reveals cellular responses to alpha-synuclein toxicity." Nature Genetics, vol. 41, no. 3, pp. 316-323, Mar. 2009.