1 of 72

Incorporating graph priors in Bayesian networks

Sep28^th 2021

BMI 826-23 Computational Network Biology�Fall 2021

Sushmita Roy

https://compnetbiocourse.discovery.wisc.edu

2 of 72

Plan for this section

Overview of network inference (Sep 21^st)
Bayesian networks (Sep 21^st,Sep 23rd)
Dependency networks (Sep 23^rd)
Integrative inference of gene regulatory networks (Sep 28^th,Sep 30^th)
Inference of regulatory networks from single cell datasets (Oct 5^th)

3 of 72

RECAP: Expression-based network inference methods

We have seen multiple types of algorithms to learn these networks

Per-gene methods (learn regulators for individual genes)

Sparse candidate, GENIE3, ARACNE, CLR

Per-module methods

Module networks: learn regulators for sets of genes/modules
Other implementations of module networks exist

LIRNET: Learning a Prior on Regulatory Potential from eQTL Data (Su In Lee et al, Plos genetics 2009, http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000358)
LeMoNe: Learning Module Networks (Michoel et al 2007, http://www.biomedcentral.com/1471-2105/8/S2/S5)

Methods differ in

how they quantify dependence between genes
Higher-order or pairwise
Focus on structure or structure & parameters

Expression alone is not enough to infer the structure of the network
Integrative approaches that combine expression with other types of data are likely more successful

4 of 72

Plan for today

Overview of integrative network inference
Defining priors on graph structure
Dynamic Bayesian networks with MCMC search
Dynamic Bayesian networks applied to cancer signaling

5 of 72

Readings

Werhli, Adriano V., and Dirk Husmeier. 2007. “Reconstructing Gene Regulatory Networks with Bayesian Networks by Combining Expression Data with Multiple Sources of Prior Knowledge.” Statistical Applications in Genetics and Molecular Biology 6 (1). https://doi.org/10.2202/1544-6115.1282.

Hill, Steven M., Yiling Lu, Jennifer Molina, Laura M. Heiser, Paul T. Spellman, Terence P. Speed, Joe W. Gray, Gordon B. Mills, and Sach Mukherjee. 2012. “Bayesian Inference of Signaling Network Topology in a Cancer Cell Line.” Bioinformatics (Oxford, England) 28 (21): 2804–2810. https://doi.org/10.1093/bioinformatics/bts514.

6 of 72

Why prior-based structure learning?

Learning genome-scale networks is computationally challenging

The space of possible graphs is huge
Insufficient number of training examples to learn these networks reliably
Multiple equivalent models can be learned

One type of data (expression) might not inform us of all the regulatory edges
Priors can encode additional information to constrain the graph structure

7 of 72

Types of integrative inference frameworks

Supervised learning

Are integrative by design
Require examples of interaction and non-interactions
Train a classifier based on edge-specific features
Closely related to “Link prediction”

Unsupervised learning

Edge aggregation
Model-based learning

8 of 72

Unsupervised network inference

Do not assume the presence of a gold standard set of edges
Have been applied primarily for regulatory networks
Some approaches for integrative inference in regulatory networks

Inferelator (Greenfield et al., Bioinformatics 2013)
Lirnet (Lee et al., Plos computational biology 2009)
Physical Module Networks (Novershtern et al., Bioinformatics 2011)
iRafNet (Petralia et al., 2015)
MERLIN+P (Fotuhi-Siahpirani & Roy, 2016)

9 of 72

Auxiliary data sources to serve as priors

ChIP-chip and ChIP-seq

Sequence specific motifs in accessible regions of the genome using Dnase-seq

Factor knockout followed by whole-transcriptome profiling

Literature

ChIP-seq peaks

Image credit: Alireza Fotuhi Siahpirani

10 of 72

Classes of methods for incorporating priors

Parameter prior based approaches

Inferelator (Greenfield et al., Bioinformatics 2013)
Lirnet (Lee et al., Plos computational biology 2009)

Structure prior based approaches

Dynamic Bayesian network (Hill et al., Bioinformatics, 2012, Werhli et al., 2007)
Physical module networks (Novershtern et al., Bioinformatics 2011)
MERLIN-P (Siahpirani et al.,2016)

11 of 72

Prior-based approaches for network inference

Given

Gene expression data and
Complementary data that supports the presences of an edge

Presence of a sequence motif on a gene promoter
ChIP-chip/seq binding of factor X on gene Y’s promoter

Do

Predict which regulators drive the expression of a target gene

How?

Place a prior on the graph where the prior is obtained from complementary data

12 of 72

Plan for today

Overview of integrative network inference
Defining priors on graph structure
Learning Bayesian networks with priors using Markov Chain Monte Carlo
Applications of Bayesian networks with priors

Inferring the yeast cell cycle network
Inferring cancer signaling

13 of 72

Bayesian formulation of network inference

Optimize posterior distribution of graph given data

Algorithm

Y₁

X₁

X₅

Y₂

X₂

Model prior

Posterior distribution

Data likelihood

14 of 72

A few computational concepts

Energy of a graph and the Gibbs distribution
Dynamic Bayesian networks
Markov Chain Monte Carlo
Hyper parameters

15 of 72

Energy function of a network G

A function that measures agreement between a given graph G and prior knowledge
Allows one to incorporate both positive and negative prior knowledge

16 of 72

Energy function on a graph

A graph G is represented by a binary adjacency matrix

G_ij = 0 if there is no edge from node i to node j
G_ij = 1 if there is an edge from i to j
G_ji = 1 if there is an edge from j to i

Encode a “prior” network as follows:

B_ij= 0.5 if we don’t have any prior knowledge
0≤ B_ij<0.5 if we know that there is no edge
1 ≥B_ij>0.5 if we know there is an edge

Energy of G is

17 of 72

Energy function of a graph

Energy E of a network G is defined as

This is 0 when there is perfect agreement between prior knowledge B and G
Higher the energy of G the greater the mismatch

18 of 72

Using the energy to define a prior distribution of a graph

A prior distribution for a graph G can be defined using E(G)

This is also called a Gibbs distribution
is the hyperparameter: parameter of the prior distribution
Z is the partition function

In general, the partition function is hard to compute

19 of 72

Incorporating multiple sources of prior networks

Suppose we have two sources of prior networks
We can represent them as two prior networks B¹ and B²
And define the energy of G with respect to both of these

20 of 72

Prior distribution incorporating multiple prior networks

The prior takes the form of another Gibbs distribution

This can be extended to more prior networks in the same way
The partition functions are in general hard to compute
However, for a particular class of BNs, these partition functions can be computed easily

21 of 72

Dynamic Bayesian networks

Bayesian networks that we have seen so far do not allow for cyclic dependencies
If we have time series data, we can overcome this limitation using a Dynamic Bayesian network

22 of 72

Dynamic Bayesian Nets (DBNs)

A DBN is a Bayesian Network for dynamic processes
Suppose we have a time course with T time points
Let denote the set of p random variables at time t
Let
A DBN over these variables defines the joint distribution of P(X)
A DBN, like a BN, has a directed acyclic graph G and parameters Θ
G typically specifies the dependencies between time points

In addition we need to specify dependence structure (if any) at the initial time point (t=1)

23 of 72

A DBN for p variables and T time points

p

t=1

X₁¹

X₂¹

X_p¹

…

Dependency at the first time point

X^3:Variables at time t=3

X₁

X₂

X_p

…

¹

X₁

X₂

X_p

…

²

X₁

X₂

X_p

…

^T

t=2

t=3

t=T

…

24 of 72

Stationary assumption in a Bayesian network

Due to this assumption, we only need to specify dependencies between two sets of variables

The stationarity assumption states that the dependency structure and parameters do not change with t

p

t

X₁^t

X₂^t

X_p^t

…

X₁^t+1

X₂^t+1

X_p^t+1

…

t+1

X₁

X₂

X_p

…

¹

X₁

X₂

X_p

…

²

X₁

X₂

X_p

…

^T

t=1

t=2

t=T

…

25 of 72

Computing the joint probability distribution in a DBN

Joint Probability Distribution can be factored into a product of conditional distributions across time and variables:

Parents of X_i^t defined by the graph G

26 of 72

The partition function for a prior over DBN

The energy for a DBN or a BN

Which can be re-written as

27 of 72

The partition function for a DBN prior

Plugging in into the partition function we get:

Re-written as a sum over parent sets:

Because of exponents, we can rewrite the above as:

28 of 72

The partition function for a DBN prior

The partition function of the DBN can be computed using local, easy to compute components
Each sum is over possible configurations of the parent set
If we restrict the number of parents to m, this is polynomial in p
If we allow edges only between time points, we don’t need check for the DAG constraint

29 of 72

Plan for today

Overview of integrative network inference
Defining priors on graph structure
Learning Bayesian networks with priors using Markov Chain Monte Carlo
Applications of Bayesian networks with priors

Inferring the yeast cell cycle network
Inferring cancer signaling

30 of 72

Markov Chain Monte Carlo (MCMC) sampling

We have looked at a greedy hill climbing algorithm to learn the structure of the graph
MCMC provides an alternative (non-greedy) way of finding the graph structure
The idea is to estimate the distribution, P(G|D), and draw “samples” of G from this distribution
MCMC is a general strategy to sample from a complex distribution

If we can sample from the distribution, we can also estimate specific properties of the distribution

31 of 72

MCMC for learning a graph structure

Recall the Bayesian framework to learning Bayesian networks
We wish to estimate P(G|D) and draw multiple G’s from this distribution

But this distribution is difficult to estimate directly

We will devise a Markov Chain such that its stationary distribution will be equal to P(G|D)
We will then use the Markov Chain to also draw potential G’s

32 of 72

Markov chain

A Markov chain is a probabilistic model for sequential observations where there is a dependency between the current and the previous state
It is defined by a graph of possible states and a transition probability matrix defining transitions between each pair of state
The states correspond to the possible assignments a variable can be in
One can think of a Markov chain as doing a random walk on a graph with nodes corresponding to each state

33 of 72

A very simple Markov chain

Suppose we have a time series measurement of a gene’s expression level
Let the gene’s expression be discretized to take three values: HIGH, MEDIUM, LOW
Let X_t denote the expression state of the gene at time t
The temporal nature of this data suggests X_t+1 depends on X_t
We can model the time series of gene expression states using a Markov chain

34 of 72

A very simple Markov chain

0.6

high

medium

low

0.1

0.7

0.2

0.6

0.2

0.3

0.2

P(X_t+1=high|X_t=low)=0.1

We will use the T(X_t+1|X_t) to denote the transition probabilities

These define the transition probabilities

35 of 72

Markov Chain and Stationary distributions

The stationary distribution is a fundamental property of a Markov chain
Stationary distribution of a Markov Chain specifies the probability of being in a state independent of the starting position
A Markov chain has a stationary distribution if it is:

Irreducible: non-zero probability to all states
Aperiodic: has self transition probability

Not all Markov Chains have a stationary distribution

36 of 72

Stationary distribution of a Markov chain

Given a Markov chain with transition probabilities T(X_i|X_k)
We define the probability distribution over states at the next time step as X_i as:

When n tends to infinity, converges to the stationary distribution

37 of 72

Markov Chains for Bayesian network structure learning

We need to devise a Markov chain over the space of possible graphs such that the stationary distribution of this Markov chain is P(G_i|D)

Let G_i denote a graph at step i and let G_k denote the graph at previous step k

We need to define the transition probability of going from G_k to G_i

38 of 72

How do we make sure we will draw from the right distribution?

That is, when the Markov chain has converged to its stationary distribution, how do we make sure that the stationary distribution is the posterior distribution?
If the transition probabilities satisfy, a condition called “detailed balance”, we can get the right distribution

39 of 72

Markov Chains for Bayesian network structure learning

In practice, for us to set up a Markov chain for Bayesian network search, we need to propose a new structure, and accept it with some probability
Let Q(G_i|G_k) denote the proposal probability

This is dependent upon the local graph moves we allow

Let A(G_i|G_k) denote the acceptance probability:

This is designed in a way to make the jump to G_i proportional to how well G_i describes the data

The transition probability is T(G_i|G_k) = Q(G_i|G_k)A(G_i|G_k)
We will keep running the propose and accept steps of our chain until convergence

40 of 72

Acceptance probability

The acceptance probability is defined as

If the proposal distribution is symmetric, the above simplifies to (this is not the case for DAGs)

41 of 72

Metropolis Hastings algorithm

Start from an initial structure G₀
Iterate from n=1.. N

Propose a new structure G_n from G_n-1 using Q(G_n|G_n-1)
Accept G_n with probability

Discard an initial “burn in” period to make sure the Markov Chain has reached a stationary distribution
Using the new samples, estimate different features of the graph, or aggregate different graphs

42 of 72

Elementary proposal moves for DAGs

The proposal distribution is defined by the moves on the graph. The above example shows�a scenario where we have two valid configurations, and a third invalid configuration.

Husmeier, 2005

43 of 72

Defining a proposal distribution from elementary moves

Notice that the neighborhood of the two DAGs are not of the same size

44 of 72

MCMC example

Husmeier 2005

45 of 72

A heuristic to check for MCMC convergence

46 of 72

MCMC for learning a graph prior and structure

Recall that our prior distribution over graphs has a parameter
Ideally, we would like to search over the space of priors and structures, that is sample from

The proposal distribution and the acceptance probabilities need to be updated

47 of 72

MCMC over graph structure and parameters

We need two proposal distributions, one for the graph structure and one for the hyper parameter
Proposing new graph structures

Proposing new a hyper parameter

Accordingly, we need to define new acceptance probabilities

48 of 72

MCMC over graph structure and hyperparameter

This would proceed in a similar way as before
We start with an initial configuration
Repeat for n=1.. N steps

Given current value of the hyperparameter propose G_n from G_n-1 and accept with

Given current G_n propose a new parameter and accept with probability

49 of 72

Plan for today

RECAP on Bayesian networks and priors

Detailed balance
Markov chain Monte Carlo

Applications of Bayesian networks with prior on expression data

Inferring the yeast cell cycle network
Inferring cancer signaling

50 of 72

Performance on real data

Two settings

Yeast cell cycle time series expression data

Two time course datasets were available
Two prior networks

RAF signaling pathway

One non-time course data
One prior network

Questions asked

Can different prior networks be distinguished
Does prior improve the network inference
Are the hyperparameters estimated accurately

51 of 72

Inferred hyperparameters for the yeast cell cycle

The two prior networks are very similar

Red and blue show the trajectory of the hyperparameter values during the MCMC

Posterior probability of the hyper parameters: close to 0.

52 of 72

Using a slightly different prior

Use one of the expression datasets to learn a graph
Use this graph as one prior and combine with one of the other two binding network priors

53 of 72

Prior hyperparameters can be distinguished

Prior that is consistent with the data

54 of 72

Conclusions from the Yeast cell cycle study

None of the TF binding priors appear consistent with the data

More consistency is observed if a prior network is obtained from an expression dataset

55 of 72

Assessing on a well-studied gold standard network: Raf signaling pathway

11 phospho proteins in all.

56 of 72

Results on RAF signaling

The data are not time course
However the partition function computation is a “tight” upper bound and can be used
Compare against

Prior alone
Data alone

57 of 72

Prior helps!

58 of 72

Method can discriminate between true and random prior

59 of 72

Plan for today

RECAP on Bayesian networks and priors

Detailed balance
Markov chain Monte Carlo

Applications of Bayesian networks with prior on expression data

Inferring the yeast cell cycle network
Inferring cancer signaling

60 of 72

Bayesian Inference of Signaling Network Topology in a Cancer Cell Line (Hill et al 2012)

Protein signaling networks are important for many cellular diseases

The networks can differ between normal and disease cell types

But the structure of the network remains incomplete
Temporal activity of interesting proteins can be measured and used to infer the network structure
Build on prior knowledge of signaling networks to learn a better, predictive network

61 of 72

Applying DBNs to infer signaling network topology

Hill et al., Bioinformatics 2012

62 of 72

Application of DBNs to signaling networks

Dataset description

Phospho-protein levels of 20 proteins
Eight time points
Four growth conditions

Use known signaling network as a graph prior
Estimate CPDs as conditional regularized Gaussians
Assume a first-order Markov model

X^t depends on on X^t-1

63 of 72

Integrating prior signaling network into the DBN

A Bayesian approach to graph learning

Graph prior is encoded as

Where f(G)=-|E(G)\E^*| is defined as the number of edges in the graph G, E(G), that are not in the prior set E^*
This prior does not promote new edges, but penalizes edges that are not in the prior

Data likelihood

Graph prior

Prior strength

Graph features

Prior Following Mukherjee & Speed 2008

64 of 72

Calculating posterior probabilities of edges

For each edge e, we need to calculate

This is intractable in general; this work makes some assumptions

Allow edges only forward in time
Assume P(G) factorizes over individual edges
Assume a node can have no more than d_max parents
Grid search over the hyper-parameters

65 of 72

Inferred signaling network using a DBN

Results are not sensitive to prior values

Inferred signaling network

Collapsed network

DBN

66 of 72

Using the DBN to make predictions

Although many edges were expected, several edges were unexpected
Select novel edges based on posterior probability and test them based on inhibitors
For example, if an edge was observed from X to Y, inhibition of X should affect the value of Y if X is a causal regulator of Y
Example edge tested

MAPKp to STAT3p(S727) with high probability (0.98)

Apply MEKi, which is an inhibitor of MAPK, and measure MAPKp and STAT3p post inhibition

AKTp to p70S6Kp, AKTp to MEKp and AKTp to cJUNp

67 of 72

Experimental validation of links

Add MAPK inhibitor and measure MAPK and STAT3

Their success is measured by the difference in the levels of the targets as a function of the levels of the inhibitors

MAPK is significantly inhibited (P-value 5X10^-4)

STAT3 is also inhibited (P-value 3.3X10^-4)

Fig. 4. Validation of predictions by targeted inhibition in breast cancer cell line MDA-MB-468. (a) MAPK-STAT3 crosstalk. Network inference (Fig. 3a) predicted an unexpected link between phospho-MAPK (MAPKp) and STAT3p(S727) in the breast cancer cell line MDA-MB-468. The hypothesis of MAPK-STAT3 crosstalk was tested by MEK inhibition: this successfully reduced MAPK phosphorylation and resulted in a corresponding decrease in STAT3p(S727). (b) AKTp → p70S6Kp, AKT-MAPK crosstalk and AKT-JNK/JUN crosstalk. AKTp is linked to p70S6kp, MEKp and cJUNp. In line with these model predictions, use of an AKT inhibitor reduced both p70S6K and MEK phosphorylation and increased JNK phosphorylation. (RPPA data; MEK inhibitor GSK1120212 and AKT inhibitor GSK690693B at 0 uM, 0.625 uM, 2.5 uMand 10 uM; measurements taken 0, 5, 15, 30, 60, 90, 120 and 180min after EGF stimulation; average values over 3 replicates shown, error bars indicate SEM)

68 of 72

Concluding remarks

Prior knowledge in a Bayesian network can be incorporated as energy functions and used to define a prior distribution

Extensible to multiple priors

Two ways to estimate graph priors

Markov Chain Monte Carlo (MCMC)

Enables us to search over the graph and hyperparameter space
Can distinguish between good and bad (inconsistent priors)

Empirical Bayes

Adding prior helped network structure learning for a small gold-standard network

69 of 72

References

Introduction to Learning Bayesian Networks from Data. Dirk Husmeier 2005
Reconstructing Gene Regulatory Networks with Bayesian Networks by Combining Expression Data with Multiple Sources of Prior Knowledge. Adriano V. Werhli and Dirk Husmeier 2007
Bayesian inference of signaling network topology in a cancer cell line. S. M. Hill, Y. Lu, J. Molina, L. M. Heiser, P. T. Spellman, T. P. Speed, J. W. Gray, G. B. Mills, and S. Mukherjee, Bioinformatics (Oxford, England), vol. 28, no. 21, pp. 2804-2810, Nov. 2012.

70 of 72

X₁

Y₂

I₁₂=?

I₁₂=

1 if X₁ interacts with Y₂

0 otherwise

Define:

X₁Y₂.features: Attributes of X₁ and Y₂

Given:

Prob. of interaction: P(I₁₂=1|X₁Y₂.features)

We need:

Prob. of no interaction: P(I₁₂=0|X₁Y₂.features)

Supervised learning of interactions

I₁₂=1

X₁

Y₂

Yes

Prob. of

interacting >

Prob. of non-interacting?

X₁

Y₂

I₁₂=0

No

71 of 72

Supervised learning of interactions

A

B

C

D

E

F

….

Positive examples

….

G

H

I

J

K

L

Negative examples

FEATURE SET

Feature extraction

TRAINED CLASSIFIER

Training

E

A

G

L

?

Testing

E

A

G

L

Predicted edges

72 of 72

MouseNet: Supervised inference of a functional network

Guan et al., Plos computational biology 2008