1 of 37

ME 5990: Introduction to Machine Learning

Recurrent Models Discussion

2 of 37

Outline

  • Recurrent model
    • Markov Model
    • Hidden Markov Model
  • Recurrent Neural Network
  • Long-short term memory

3 of 37

Markov Chain

Andrey Andreyevich Markov (1856–1922)

Russian Mathematician

Professor, Saint Petersburg State University

Thomas Bayes (1701- 1761)

English statistician, philosopher and Presbyterian minister

One known mathematics publication

Bayes’ Theorem was presented after his death by Richard Price

Richard Price (1723-1791)

British moral philosopher, nonconformist preacher and mathematician

Formally described Bayes’ theorem Considered Bayes’ theorem helped prove the existence of God

Image from Wikipedia.org

4 of 37

Markov Chain 101: State Transition

  • Suppose the weather can be rainy and sunny, the following diagram depicts the transition from one day to another:
    • A - Rainy
    • B - Sunny

  • What does this diagram mean??
    • P(A|A) = 0.3
    • P(B|A) = 0.7
    • P(A|B) = 0.8
    • P(B|B) = 0.2

Previous state

Next state

Previous Next

A

B

A

0.3

0.7

B

0.8

0.2

5 of 37

Markov Chain: forward propagation

  • If we know day 1 is raining, what is the chance day 2 is raining, and day 3, and day 4?

  • State transition: A-A-B, or A-B-B
  • P(A|A) = 0.3
  • P(B|A) = 0.7

A

B

A: Rainy 0.3

B: Sunny 0.7

Day 1- Rainy: 0.3*0.7

Day 1- Sunny: 0.7*0.2

Day 1

Day 2

Day 3

6 of 37

Markov Chain: forward propagation

  • What is the probability the 3rd day sunny?

  • State transition: A-A-B, or A-B-B
  • P(A|A) = 0.3 P(x3=B|x1=A, x2=A) = P(B|A)*P(A|A) = 0.21
  • P(B|A) = 0.7 P(x3=B|x1=A, x2=B ) = P(B|B)*P(B|A) = 0.14
  • P(x3=B) = P(B|A,A) + P(B|A, B) = 0.35

A

B

A: Rainy 0.3

B: Sunny 0.7

Day 1- Rainy: 0.3*0.3

Day 1- Sunny: 0.7*0.8

Day 1

Day 2

Day 3

7 of 37

Markov Chain: forward propagation

  • What if we ask the probability the third day is raining? (A)

  • State transition: A-A-B, or A-B-B
  • P(A|A) = 0.3 P(x3=A|x1=A, x2=A) = P(A|A)*P(A|A) = 0.09
  • P(B|A) = 0.7 P(x3=A|x1=A, x2=B ) = P(A|B)*P(B|A) = 0.56
  • P(x3=A) = P(A|A,A) + P(A|A, B) = 0.65

A

A

A: Rainy 0.3

B: Sunny 0.7

Day 1- Rainy: 0.3*0.3

Day 1- Sunny: 0.7*0.8

Day 1

Day 2

Day 3

8 of 37

Hidden Markov model

  • What if we cannot know whether it is raining, we can only observe “emission?”

  • We know:

X0

X1

X2

u0

u1

u2

z0

z1

z2

Previous

Rain

Sunny

x=Rain, u =Rain

0.8

0.2

x=Rain, u =sun

0.4

0.6

x= Sun, u= Rain

0.7

0.3

X= Sun, u= sun

0.6

0.4

State

Cat Sleep

Cat Outdoor

Rain

0.8

0.2

Sun

0.2

0.8

Xt

zt

ut

9 of 37

Hidden Markov Model

  • Question:
  • At day 3:
    • we know day 1 it is x0 = rain
    • we know forecast from day 0 u0 = rain, day 1 u1 = rain

  • We also observed:
    • Cat is sleeping day 1 z1 = sleep, sleeping day 2 z2= sleep

  • What is the chance day 2 is rainy or sunny based on all previous information?

10 of 37

Problem formulation

  •  

11 of 37

Hidden Markov Model

Previous

X=R

X=S

x=R, u =R

0.8

0.2

x=R, u =S

0.4

0.6

x= S, u= R

0.7

0.3

X= S, u= S

0.6

0.4

State

Sleep (L)

Out (O)

Rain

0.8

0.2

Sun

0.2

0.8

 

X0

X1

X2

u0

u1

u2

z0

z1

z2

Xt

zt

ut

12 of 37

Change of story for robotics

  • In robotics localization
    • u: command
    • x: position
    • z: sensor observation

X0

X1

X2

u0

u1

u2

z0

z1

z2

Xt

zt

ut

13 of 37

Extended Kalman Filter

  • EKF is an application of HMM

 

Red: expecting position from the command (open-loop); Blue: actual position of the robot; Green: estimated position after EKF, Green circle: EKF probability distribution�Cayan and dots: land marks for observation.

https://www.researchgate.net/figure/The-Extended-Kalman-Filter-EKF-in-action-a-Demo-using-the-EKF-algotithm-for-robot_fig2_348640995

14 of 37

Recurrent Neural Network

  • In the Markov Model or HMM, the transitions:
    • Between states are modeled using classic probability distribution
    • From Input effect to the states are modeled using classic probability distribution
    • From states to output emission are modeled using classic probability distribution
  • Can we use a neural network to represent the transition?

15 of 37

Example: word generation

  • Character-level Language Model
  • Assume our world has 4 character only [h, e, l, o]
  • We want to train a model that when we type “h”, the model will hint “hello”

16 of 37

Example: word generation

  • Let’s simply use FC in this case

17 of 37

Example: word generation

  • Output layer

18 of 37

Example: word generation

  • Forward Propagation

19 of 37

Example: word generation

  • You can add layers between any transition
    • For instance, you can add an “embedding FC layer”

20 of 37

Example: word generation

  • Training
    • Forward through entire sequence for the loss
    • Then backward entire sequence to compute the gradient

21 of 37

Example: word generation

  • Training in-practice: Truncated
    • Run forward and backward through chunks of the sequence instead of whole sequence

22 of 37

Example: word generation

  • Training
    • Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps

23 of 37

Example: word generation

  • Training in-practice
    • Truncated Backpropagation through time

24 of 37

Example: image captioning

25 of 37

Example: image captioning

  • Image captioning

26 of 37

Example: image captioning

  • Start with a pre-trained CNN model for image recognition
  • Remove the last few layers

27 of 37

Example: image captioning

  • The hidden state of an image is the start “hidden” state
  • Start token is given for the first input

28 of 37

Example: image captioning

  • End with an end token

29 of 37

RNN variation

  • You don’t have to have input and output for each transition

30 of 37

RNN variation

  • You can play RNN in all-ways you wish

31 of 37

Long-short term memory

  • RNN have short memories
  • Example: “I grew up in France… I speak fluent French
  • It is unlikely RNN will be able to emit “French” based on the input “France”, which is many words ahead

32 of 37

Long-short term memory

  • They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.
  • LSTMs are explicitly designed to avoid the long-term dependency problem
  • Regular RNN

 

 

33 of 37

Long-short term memory

  • LSTM has C state and H state
  • C state preserve Long term information
  • H state transit short term information, and responsible for emission (not shown)

34 of 37

LSTM

  • C state flow: limited transfer,
    • Only add or multiply information from inputs

35 of 37

LSTM

  • H state flow: major transition for each cell
  • Responsible for emission

36 of 37

LSTM

  • Many variations

37 of 37

Summary

  • Recurrent network is suitable for models need temporal or sequential inputs/outputs
    • Language Model
    • Robotics Model
  • A model shall have a sequence of inputs, and sequence of outputs (emissions)
  • Hidden layer represents “hidden states” that human not necessarily can understand their physical meaning