1 of 69

Input Output HMMs for modeling network dynamics

Oct 12th and 14th, 2021

BMI 826-23 Computational Network Biology�Fall 2021

Anthony Gitter

https://compnetbiocourse.discovery.wisc.edu

The style of IOHMM is adapted from Prof. Mark Craven’s lectures on HMMs

Original IOHMM slides created by Prof. Sushmita Roy

2 of 69

Goals for today

  • What are Hidden Markov Models (HMMs)?
    • How do they relate to Dynamic Bayesian Networks?
  • What are Input-Output HMMs (IOHMMs)?
  • EM algorithm for learning IOHMMs
  • Application of IOHMMs to examine regulatory network dynamics

3 of 69

Motivation

  • Suppose we are given time series expression profiles
  • We wish to find key regulators that are associated with changes in expression levels over time
  • We have seen a simple approach to do this
    • Activity subgraph/skeleton network-based approaches
  • Can we more explicitly take time into account?

4 of 69

DREM: Dynamic Regulatory Events Miner

Ernst et al., 2007, Mol Sys Biol

5 of 69

Recall Markov chains

  • A Markov chain is a probabilistic model for sequential observations where there is a dependency between the current and the previous state
  • It is defined by a graph of possible states and a transition probability matrix defining transitions between each pair of state
  • The states correspond to the possible assignments a variable can state
  • One can think of a Markov chain as doing a random walk on a graph with nodes corresponding to each state

6 of 69

A three state Markov chain

0.6

high

medium

low

0.1

0.1

0.7

0.2

0.6

0.2

0.3

0.2

P(Xt+1=high|Xt=low)=0.1

These define the transition probabilities

7 of 69

Notation for a Markov chain

  • States are numbered from 1 to K
    • observed character at position t
  • Observed sequence of characters

  • Transition probabilities

8 of 69

Hidden Markov Models

  • Hidden Markov models are also probabilistic models used to model sequential data about a dynamical system
  • At each time point the system is a hidden state that is dependent upon the previous states (history)
  • The observation sequence is the output of a hidden state
  • HMMs are defined by observation models and transition models

Murphy 2000

9 of 69

What does an HMM do?

  • Enables us to model observed sequences of characters generated by a hidden dynamic system
  • The system can exist in a fixed number of “hidden” states
  • The system probabilistically transitions between states and at each state it emits a symbol/character

10 of 69

Defining an HMM

  • States
  • Emission alphabet
  • Parameters
    • State transition probabilities for probabilistic transitions from state at time t to state at time t+1
    • Emission probabilities for probabilistically emitting symbols from a state

11 of 69

Notation

  • States are numbered from 1 to K
  • Observed sequence
  • Hidden state sequence or path
  • Transition probabilities

  • Emission probabilities: Probability of emitting symbol b from state k

12 of 69

An HMM for an occasionally dishonest casino

0.9

  1. 1/6
  2. 1/6
  3. 1/6
  4. 1/6
  5. 1/6
  6. 1/6

0.95

0.05

2

1

0.1

  1. 1/10
  2. 1/10
  3. 1/10
  4. 1/10
  5. 1/10
  6. 1/2

Fair

Loaded

What is hidden?

What is observed?

Emission probabilities

Transition probabilities

13 of 69

An HMM for an occasionally dishonest casino

0.9

  1. 1/6
  2. 1/6
  3. 1/6
  4. 1/6
  5. 1/6
  6. 1/6

0.95

0.05

2

1

0.1

  1. 1/10
  2. 1/10
  3. 1/10
  4. 1/10
  5. 1/10
  6. 1/2

Fair

Loaded

What is hidden?

What is observed?

Which dice is rolled

Number (1-6) on the die

Emission probabilities

Transition probabilities

14 of 69

Goals for today

  • What are Hidden Markov Models (HMMs)?
    • How do they relate to Dynamic Bayesian Networks?
  • What are Input-Output HMMs (IOHMMs)?
  • Learning problems of IOHMMs
  • Application of IOHMMs to examine regulatory network dynamics

15 of 69

Recall a DBN for p variables and T time points

p

t=0

X11

X21

Xp1

Dependency at the first time point

X2: Variables at time t=2

X1

X2

Xp

1

1

1

X1

X2

Xp

2

2

2

X1

X2

Xp

T

T

T

t=1

t=2

t=T

16 of 69

HMM represented as a DBN

DBN

  • A DBN could be used to represent the transition probabilities more compactly.

  • For example, consider the state variable�to be D-dimensional each with K possible values.
    • For example we are tracking D objects and each object can have K possible settings
    • The state variable can have KD possible values

  • An HMM will attempt to model the transition probabilities between all state combinations.

  • In other words, the DBN will look fully connected.�

Kevin Murphy, 2000

17 of 69

DBN version of the occasional dishonest casino

F

L

F

0.95

0.05

L

0.1

0.9

1

2

3

4

5

6

F

1/6

1/6

1/6

1/6

1/6

1/6

L

0.1

0.1

0.1

0.1

0.1

0.5

18 of 69

Three important questions in HMMs

  • What is the probability of a sequence from an HMM?
    • Forward algorithm
  • What is the most likely sequence of states for generating a sequence of observations
    • Viterbi algorithm
  • How can we learn an HMM from a set of sequences?
    • Forward-backward or Baum-Welch (an EM algorithm)

19 of 69

Computing the probability of a sequence from an HMM

Emitting symbol xt

State transition between

consecutive time points

20 of 69

Computing the probability of a sequence from an HMM

  • But we don’t know what the sequence of states (path) is
  • So we need to sum over all paths
  • The probability over all paths is:

  • The forward algorithm gives an efficient way to compute this probability
  • It is based on the concept of dynamic programming

Sum over all paths

21 of 69

Goals for today

  • What are Hidden Markov Models (HMMs)?
    • How do they relate to Dynamic Bayesian Networks?
  • What are Input-Output HMMs (IOHMMs)?
  • Learning problems of IOHMMs
  • Application of IOHMMs to examine regulatory network dynamics

22 of 69

Input output Hidden Markov Models (IOHMM)

  • As in the HMM we have
    • States, emissions and transitions
  • In addition we have a sequence of inputs
    • The transitions and emissions can depend on inputs (u1,..,uT)
  • In a way, IOHMMs map inputs to outputs
    • This is different from HMMs
  • HMMs aim to define P(x1..xT) while IOHMMs define P(x1..xT|u1..uT)

Bengio & Frasconi, IEEE Trans on Neural Networks 1996

23 of 69

Input output Hidden Markov Models (IOHMM)

Inputs

Outputs

Hidden states

Transition and emissions are dependent upon a set of external stimuli

24 of 69

Formally defining an IOHMM

  • The set of K hidden states
  • Emission characters/symbols/values
  • Transition probabilities conditioned on the input
    • Unlike HMMS where we had
    • Here we have
  • Similarly for emission probabilities on the input and state

25 of 69

Three important questions in IOHMMs

  • What is the probability of a sequence from an IOHMM?
    • Forward algorithm
  • What is the most likely sequence of states for generating a sequence of observations
    • Viterbi algorithm
  • How can we learn an IOHMM from a set of sequences?
    • Forward-backward algorithm (an EM algorithm)

26 of 69

Computing the probability of a sequence from an IOHMM

Emitting symbol xt

State transition between

consecutive time points

27 of 69

As in the case of HMMs

  • We would need to sum over the possible state configurations

  • We will use the forward algorithm for this problem

Sum over all paths

28 of 69

How likely is a given sequence: Forward algorithm

  • Define as the probability of observing

and ending in state k at time t given inputs u1..ut

  • This can be written as follows

29 of 69

Steps of the Forward algorithm

  • Initialization

  • Recursion: for t=2 to T

  • Termination

30 of 69

Working through an example

  • Suppose we measured three reporter molecules whose values depend upon input chemical stimulus and whether one of four possible hidden pathways are triggered.
  • Chemical stimulus: {0, 1}
  • Hidden Pathways: {A, B, C, D}
  • Reporter molecules: {r1, r2, r3}
  • Given a sequence of reporter molecule measurements, and chemical stimuli, we can infer different quantities associated with this sequence

31 of 69

Mapping to an IOHMM

Inputs

Outputs

Hidden states

We need to specify three CPTs

32 of 69

The CPTs that we will use

A

B

C

D

0

0.5

0.5

0

0

1

0.5

0.5

0

0

A

B

C

D

0

A

0.6

0.2

0.2

0

0

B

0.2

0.6

0

0.2

0

C

0.1

0

0.8

0.1

0

D

0

0.25

0.25

0.5

1

A

0.8

0.1

0.1

0

1

B

0.8

0.1

0

0.1

1

C

0.1

0

0.8

0.1

1

D

0

0.1

0.8

0.1

r1

r2

r3

0

A

0.8

0.1

0.1

0

B

0.2

0.6

0.2

0

C

0.25

0.5

0.25

0

D

0.2

0.2

0.6

1

A

0.2

0.6

0.2

1

B

0.25

0.5

0.25

1

C

0.2

0.2

0.6

1

D

0.5

0.25

0.25

Suppose we observed the following sequences

0 1 1 0

r1 r1 r2 r3

Input:

Output:

How likely is this observation from our IOHMM?

33 of 69

Transition probabilities encode some independencies

A

B

C

D

A

B

C

D

0

A

0.6

0.2

0.2

0

0

B

0.2

0.6

0

0.2

0

C

0.1

0

0.8

0.1

0

D

0

0.25

0.25

0.5

1

A

0.8

0.1

0.1

0

1

B

0.8

0.1

0

0.1

1

C

0.1

0

0.8

0.1

1

D

0

0.1

0.8

0.1

34 of 69

Applying the forward algorithm

Output:

1

2

3

4

A

B

C

D

Input:

0

1

1

0

r1

r1

r2

r3

A

B

C

D

35 of 69

Working through some calculations: t=1

A

B

C

D

0

0.5

0.5

0

0

1

0.5

0.5

0

0

A

B

C

D

0

A

0.6

0.2

0.2

0

0

B

0.2

0.6

0

0.2

0

C

0.1

0

0.8

0.1

0

D

0

0.25

0.25

0.5

1

A

0.8

0.1

0.1

0

1

B

0.8

0.1

0

0.1

1

C

0.1

0

0.8

0.1

1

D

0

0.1

0.8

0.1

r1

r2

r3

0

A

0.8

0.1

0.1

0

B

0.2

0.6

0.2

0

C

0.25

0.5

0.25

0

D

0.2

0.2

0.6

1

A

0.2

0.6

0.2

1

B

0.25

0.5

0.25

1

C

0.2

0.2

0.6

1

D

0.5

0.25

0.25

0 1 1 0

r1 r1 r2 r3

Input:

Output:

36 of 69

Working through some calculations: t=2

A

B

C

D

0

0.5

0.5

0

0

1

0.5

0.5

0

0

A

B

C

D

0

A

0.6

0.2

0.2

0

0

B

0.2

0.6

0

0.2

0

C

0.1

0

0.8

0.1

0

D

0

0.25

0.25

0.5

1

A

0.8

0.1

0.1

0

1

B

0.8

0.1

0

0.1

1

C

0.1

0

0.8

0.1

1

D

0

0.1

0.8

0.1

r1

r2

r3

0

A

0.8

0.1

0.1

0

B

0.2

0.6

0.2

0

C

0.25

0.5

0.25

0

D

0.2

0.2

0.6

1

A

0.2

0.6

0.2

1

B

0.25

0.5

0.25

1

C

0.2

0.2

0.6

1

D

0.5

0.25

0.25

0 1 1 0

r1 r1 r2 r3

Input:

Output:

37 of 69

Result of applying the forward algorithm

Input:

Output:

1

2

3

4

A

0.4

0.08

0.04488

0.0033861

B

0.1

0.0125

0.033125

0.021860825

C

0

0.008

0.00308

0.0028790625

D

0

0.01

0.0007625

0.00438855

0

1

1

0

r1

r1

r2

r3

38 of 69

Learning an IOHMM from data

  • Given J paired sequences

  • Parameter estimation:
    • Learn the transition and emission probability distributions
    • This is very similar to what is done in HMMs
  • Structure learning:
    • Learn the number of states and the dependencies among the states
    • Because states are hidden variables and we do not how many there are, this adds another level of complexity in learning
    • We will first assume that we know the number of states

39 of 69

The expectation maximization algorithm

  • Expectation Maximization (EM) is a widely used when there are hidden variables
  • It is an iterative algorithm that maximizes the likelihood of the data
  • Each iteration is made up of two steps
    • Expectation step (E): estimate the expected values of hidden variables given the data and previous parameter settings
    • Maximization step (M): estimate the parameters using the expected counts

40 of 69

Learning without hidden information

  • Transition probabilities

  • Emission probabilities

Number of transitions from state k to state l given input p

Number of times c is emitted from k given input p

41 of 69

The expectation step

  • We need to know the probability of the symbol at t being produced by state i, given the entire observation and input sequence u1:T, x1:T
  • Given these we can compute our expected counts for state transitions, character emissions
  • We also need to know the probability of observations at t and (t+1) being produced by state i, and j respectively given sequence x

Bengio & Frasconi, IEEE Trans on Neural Networks 1996

42 of 69

Computing

  • First we compute the probability of the entire observed sequence with the tth symbol being generated by state k

  • Then our quantity of interest is computed as

Obtained from the forward algorithm

43 of 69

Computing

  • To compute

  • We need the forward and backward algorithm

Forward algorithm fk(t)

Backward algorithm bk(t)

44 of 69

Steps of the backward algorithm

  • Initialization (t=T)

  • Recursion (t=T-1 to 1)

45 of 69

Trying out an example with backward algorithm

  • Again assume we have the same CPTs as those associated with the forward algorithm demo
  • Assume we observe the following

  • What are computations for the backward algorithm?

0 1 1

r1 r2 r2

Input:

Output:

46 of 69

Results from applying the backward algorithm

Input:

Output:

1

2

3

A

0.55

1

B

0.555

1

C

1

D

1

0 1 1

r1 r2 r2

A

B

C

D

0

A

0.6

0.2

0.2

0

0

B

0.2

0.6

0

0.2

0

C

0.1

0

0.8

0.1

0

D

0

0.25

0.25

0.5

1

A

0.8

0.1

0.1

0

1

B

0.8

0.1

0

0.1

1

C

0.1

0

0.8

0.1

1

D

0

0.1

0.8

0.1

r1

r2

r3

0

A

0.8

0.1

0.1

0

B

0.2

0.6

0.2

0

C

0.25

0.5

0.25

0

D

0.2

0.2

0.6

1

A

0.2

0.6

0.2

1

B

0.25

0.5

0.25

1

C

0.2

0.2

0.6

1

D

0.5

0.25

0.25

47 of 69

Computing

  • Using the forward and backward variables, this is computed as

48 of 69

Computing

  • This is the probability of symbols at t and t+1 emitted from states k and l given the entire observed and sequence x1:T and input sequence u1:T

49 of 69

Putting it all together

  • Assume we are given J training instances

  • Expectation step
    • Using current parameter values for each
      • Apply the forward and backward algorithms
      • Compute
        • expected number of transitions between all pairs of states
        • expected number of emissions for all states
  • Maximization step
    • Using current expected counts
      • Compute the transition and emission probabilities

50 of 69

Baum-Welch one iteration

  • Let’s assume we have J=2 training instances

  • Each training example will contribute to the expected counts of transition and emission probabilities
  • Expectation step:
    • Compute the forward and backward variables for both training samples, for all time points

51 of 69

Baum-Welch one iteration M step

  • Suppose we are updating the transition probability of A to B given input u=1

Contribution from sample 1

Sample 2 will not contribute as there is no relevant configuration

52 of 69

Baum-Welch one iteration M step

  • Suppose we are updating the expected counts for observing r2 from state B given input u=1

Contribution from sample 1

Contribution from sample 2

53 of 69

Goals for today

  • What are Hidden Markov Models (HMMs)?
    • How do they relate to Dynamic Bayesian Networks?
  • What are Input-Output HMMs (IOHMMs)?
  • Learning problems of IOHMMs
  • Application of IOHMMs to examine regulatory network dynamics

54 of 69

Bifurcation events

  • Bifurcation events occur when sets of genes that have roughly the same expression level up until some time point diverge

Bifurcation events

55 of 69

Dynamic Regulatory Events Miner (DREM)

  • Given
    • Gene expression time course
    • Static TF binding data or signaling networks
  • Do
    • Identifies important regulators for interesting temporal changes

  • DREM is suited for short time courses
  • DREM is based on an Input-Output HMM

Ernst et al., 2007 Mol Sys Biol

56 of 69

DREM inputs and output

Bar-Joseph et al. 2012 Nat Rev Genet

Time

t1

t2

tN

TF-gene binding interactions

Time series gene expression

log2 fold change

57 of 69

DREM key idea

Ernst et al., 2007, Mol Sys Biol

58 of 69

IOHMM model in DREM

  • The output distributions were modeled as Gaussians
    • Enabled modeling continuous expression values
  • State transitions depended on static input and the current state
    • A binary classifier was trained in the M step for each state with two children, to discriminate between genes assigned to the bifurcating states

59 of 69

Defining the transition probability in DREM

  • DREM uses a binary classifier (logistic regression) to define transition probabilities
  • Assume we are state h, which has two child states a and b

  • Retrain in maximization step of Baum-Welch

Input associated with the ith gene: collection of� binding sites on gene i’s promoter

State-specific parameters

60 of 69

Structure learning in DREM

  • How do we find the tree structure?

  • Split genes into train and test
  • Initialize as linear chain
  • Iterate
    • Consider adding each possible child state, retrain
    • Consider removing each branch, retrain
    • Track best structure evaluated on test genes
  • Resplit genes and tune tree structure

61 of 69

Results

  • Application of DREM to yeast expression data
    • Amino acid (AA) starvation
    • One time point ChIP binding in AA starvation
  • Analysis of condition-specific binding
  • Application to multiple stress and normal conditions

62 of 69

DREM application in yeast amino acid starvation

DREM identified 11 paths, and associated important AA related TFs for each split

63 of 69

Does condition non-specific data help?

Yes, adding additional non-condition specific data helped explain more splits or found more TFs per split

64 of 69

Validation of INO4 binding

  • INO4 was a novel prediction by the method
  • Using a small scale experiment, test binding in 4 gene promoters after AA starvation
  • Measure genome-wide binding profile of INO4 in AA starvation and SCD and compare relative binding

65 of 69

Validation of INO4 binding

Genes associated with INO4 split profile

INO4 occupancy is much higher in AA �starvation compared to normal (SCD)

More genes are bound genome-wide in AA starvation

Stronger binding in AA starvation of genes in this path

66 of 69

Does integration help?

  • Randomize ChIP data and ask if enriched TFs with paths were identified
    • Fewer TFs were identified
  • Compare IOHMM vs HMM
    • Lesser enrichment of Gene Ontology processes in HMMs paths compared to IOHMMs

67 of 69

Take away points

  • Network dynamics can be defined in multiple ways
  • Skeleton network-based approaches
    • The universe of networks is fixed, nodes become on or off
    • Simple to implement, and does not need lot of data
    • No assumption of how the network changes over time
    • No model of how the network changes over time
    • Requires the skeleton network to be complete

68 of 69

Take away points

  • Network dynamics can be defined in multiple ways
  • Dynamic Bayesian network
    • Can learn new edges
    • Describes how the system transitions from one state to another
    • Can incorporate prior knowledge
    • Assumes that the dependency between t-1 and t is the same for all time points (saw how to relax this)
    • Requires sufficient number of timepoints

69 of 69

Take away points

  • Network dynamics can be defined in multiple ways
  • IOHMMs (DREM approach)
    • Integrates static TF-DNA and dynamic gene expression responses
    • Works at the level of groups of genes
    • Does not assume TF expression controls TF activity
    • Focus on bifurcation points in the time course
    • Tree structure might be restrictive (although possible extensions are discussed)
    • Depends upon the completeness of the TF binding data