JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 37

ME 5990: Introduction to Machine Learning

�

Recurrent Models Discussion

Slide Based on:

Stanford CS231n: Lecture 10 RNN�https://cs231n.stanford.edu/slides/2022/lecture_10_ruohan.pdf

LSTM: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

2 of 37

Outline

Recurrent model

Markov Model
Hidden Markov Model

Recurrent Neural Network
Long-short term memory

3 of 37

Markov Chain

Andrey Andreyevich Markov (1856–1922)

Russian Mathematician

Professor, Saint Petersburg State University

Thomas Bayes (1701- 1761)

English statistician, philosopher and Presbyterian minister

One known mathematics publication

Bayes’ Theorem was presented after his death by Richard Price

Richard Price (1723-1791)

British moral philosopher, nonconformist preacher and mathematician

Formally described Bayes’ theorem Considered Bayes’ theorem helped prove the existence of God

Image from Wikipedia.org

4 of 37

Markov Chain 101: State Transition

Suppose the weather can be rainy and sunny, the following diagram depicts the transition from one day to another:

A - Rainy
B - Sunny

What does this diagram mean??

P(A|A) = 0.3
P(B|A) = 0.7
P(A|B) = 0.8
P(B|B) = 0.2

Previous state

Next state

Previous Next	A	B
A	0.3	0.7
B	0.8	0.2

5 of 37

Markov Chain: forward propagation

If we know day 1 is raining, what is the chance day 2 is raining, and day 3, and day 4?

State transition: A-A-B, or A-B-B
P(A|A) = 0.3
P(B|A) = 0.7

A: Rainy 0.3

B: Sunny 0.7

Day 1- Rainy: 0.3*0.7

Day 1- Sunny: 0.7*0.2

Day 1

Day 2

Day 3

…

6 of 37

Markov Chain: forward propagation

What is the probability the 3^rd day sunny?

State transition: A-A-B, or A-B-B
P(A|A) = 0.3 P(x₃=B|x₁=A, x₂=A) = P(B|A)*P(A|A) = 0.21
P(B|A) = 0.7 P(x₃=B|x₁=A, x₂=B ) = P(B|B)*P(B|A) = 0.14
P(x₃=B) = P(B|A,A) + P(B|A, B) = 0.35

A: Rainy 0.3

B: Sunny 0.7

Day 1- Rainy: 0.3*0.3

Day 1- Sunny: 0.7*0.8

Day 1

Day 2

Day 3

…

7 of 37

Markov Chain: forward propagation

What if we ask the probability the third day is raining? (A)

State transition: A-A-B, or A-B-B
P(A|A) = 0.3 P(x₃=A|x₁=A, x₂=A) = P(A|A)*P(A|A) = 0.09
P(B|A) = 0.7 P(x₃=A|x₁=A, x₂=B ) = P(A|B)*P(B|A) = 0.56
P(x₃=A) = P(A|A,A) + P(A|A, B) = 0.65

A: Rainy 0.3

B: Sunny 0.7

Day 1- Rainy: 0.3*0.3

Day 1- Sunny: 0.7*0.8

Day 1

Day 2

Day 3

…

8 of 37

Hidden Markov model

What if we cannot know whether it is raining, we can only observe “emission?”

We know:

X₀

X₁

X₂

u₀

u₁

u₂

z₀

z₁

z₂

Previous	Rain	Sunny
x=Rain, u =Rain	0.8	0.2
x=Rain, u =sun	0.4	0.6
x= Sun, u= Rain	0.7	0.3
X= Sun, u= sun	0.6	0.4

State	Cat Sleep	Cat Outdoor
Rain	0.8	0.2
Sun	0.2	0.8

…

X_t

z_t

u_t

9 of 37

Hidden Markov Model

Question:
At day 3:

we know day 1 it is x₀ = rain
we know forecast from day 0 u₀ = rain, day 1 u₁ = rain

We also observed:

Cat is sleeping day 1 z₁= sleep, sleeping day 2 z₂= sleep

What is the chance day 2 is rainy or sunny based on all previous information?

10 of 37

Problem formulation

11 of 37

Hidden Markov Model

Previous	X=R	X=S
x=R, u =R	0.8	0.2
x=R, u =S	0.4	0.6
x= S, u= R	0.7	0.3
X= S, u= S	0.6	0.4

State	Sleep (L)	Out (O)
Rain	0.8	0.2
Sun	0.2	0.8

X₀

X₁

X₂

u₀

u₁

u₂

z₀

z₁

z₂

…

X_t

z_t

u_t

12 of 37

Change of story for robotics

In robotics localization

u: command
x: position
z: sensor observation

X₀

X₁

X₂

u₀

u₁

u₂

z₀

z₁

z₂

…

X_t

z_t

u_t

13 of 37

Extended Kalman Filter

EKF is an application of HMM

Red: expecting position from the command (open-loop); Blue: actual position of the robot; Green: estimated position after EKF, Green circle: EKF probability distribution�Cayan and dots: land marks for observation.

https://www.researchgate.net/figure/The-Extended-Kalman-Filter-EKF-in-action-a-Demo-using-the-EKF-algotithm-for-robot_fig2_348640995

14 of 37

Recurrent Neural Network

In the Markov Model or HMM, the transitions:

Between states are modeled using classic probability distribution
From Input effect to the states are modeled using classic probability distribution
From states to output emission are modeled using classic probability distribution

Can we use a neural network to represent the transition?

15 of 37

Example: word generation

Character-level Language Model
Assume our world has 4 character only [h, e, l, o]
We want to train a model that when we type “h”, the model will hint “hello”

16 of 37

Example: word generation

Let’s simply use FC in this case

17 of 37

Example: word generation

Output layer

18 of 37

Example: word generation

Forward Propagation

19 of 37

Example: word generation

You can add layers between any transition

For instance, you can add an “embedding FC layer”

20 of 37

Example: word generation

Training

Forward through entire sequence for the loss
Then backward entire sequence to compute the gradient

21 of 37

Example: word generation

Training in-practice: Truncated

Run forward and backward through chunks of the sequence instead of whole sequence

22 of 37

Example: word generation

Training

Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps

23 of 37

Example: word generation

Training in-practice

Truncated Backpropagation through time

24 of 37

Example: image captioning

25 of 37

Example: image captioning

Image captioning

26 of 37

Example: image captioning

Start with a pre-trained CNN model for image recognition
Remove the last few layers

27 of 37

Example: image captioning

The hidden state of an image is the start “hidden” state
Start token is given for the first input

28 of 37

Example: image captioning

End with an end token

29 of 37

RNN variation

You don’t have to have input and output for each transition

30 of 37

RNN variation

You can play RNN in all-ways you wish

31 of 37

Long-short term memory

RNN have short memories
Example: “I grew up in France… I speak fluent French”
It is unlikely RNN will be able to emit “French” based on the input “France”, which is many words ahead

32 of 37

Long-short term memory

They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.
LSTMs are explicitly designed to avoid the long-term dependency problem
Regular RNN

33 of 37

Long-short term memory

LSTM has C state and H state
C state preserve Long term information
H state transit short term information, and responsible for emission (not shown)

34 of 37

LSTM

C state flow: limited transfer,

Only add or multiply information from inputs

35 of 37

LSTM

H state flow: major transition for each cell
Responsible for emission

36 of 37

LSTM

Many variations

37 of 37

Summary

Recurrent network is suitable for models need temporal or sequential inputs/outputs

Language Model
Robotics Model

A model shall have a sequence of inputs, and sequence of outputs (emissions)
Hidden layer represents “hidden states” that human not necessarily can understand their physical meaning