Lecture 4�Recurrent Neural Networks – Part2���Sookyung Kim�sookim@ewha.ac.kr
1
Towards Modeling Longer Dependence
2
Gradient Flow Problem with Vanilla RNN
Let’s take a closer look inside the RNN cell:� Backprop from ht to ht-1 multiplies by Whh.
xt
yt
ht
×
×
+
tanh
ht-1
Whh
Wxh
3
Gradient Flow Problem with Vanilla RNN
At each output yt, we compute the loss Lt, and it backdrops all the way back to the beginning.
x1
y1
h1
×
×
+
tanh
h0
Whh
Wxh
x2
y2
h2
×
×
+
tanh
Whh
Wxh
x3
y3
h3
×
×
+
tanh
Whh
Wxh
4
Gradient Flow Problem with Vanilla RNN
What does this formula mean?
tanh’ is almost always < 1�→ Vanishing gradients
Iterative multiplication of a same matrix.�If the largest singular value > 1,�exploding gradients.�If the largest singular value < 1,�vanishing gradients.
→ Hard to control gradients...
5
Notation
For simplicity, let’s use “FC” (fully connected) box instead of each weight matrix.
Vanilla RNN with old notation
Vanilla RNN with new notation
xt
yt
ht
×
×
+
tanh
ht-1
Whh
Wxh
xt
yt
ht
tanh
ht-1
FC
6
Long Short Term Memory (LSTM)
Recall that vanilla RNN had vanishing gradient problem, as we the backward passes through an FC.
xt
ht
ht-1
FC
tanh
yt
7
Long Short Term Memory (LSTM)
To avoid this, we add a “highway” detouring the FC layer, and a new set of hidden states called cell state (ct). An additional non-linearity added after adding with ct-1.
xt
ct
ht
ht-1
FC
tanh
+
yt
ct-1
tanh
8
Long Short Term Memory (LSTM)
Even if the cell state is for long-term memory, we still need some mechanism to control it. The forget gate is added for this purpose.
xt
ct
ht
ht-1
FC
FC
tanh
×
+
yt
ct-1
tanh
σ
Forget gate
9
Long Short Term Memory (LSTM)
Similarly to the input side, we add the input gate to control the flow from the input.
xt
ct
ht
FC
ht-1
FC
FC
tanh
×
+
×
yt
ct-1
tanh
σ
σ
Input gate
Forget gate
10
Long Short Term Memory (LSTM)
Lastly, we add the output gate to control the value updated to the hidden state ht.
xt
ct
ht
FC
ht-1
FC
FC
tanh
FC
×
+
×
yt
ct-1
tanh
×
σ
σ
σ
Input gate
Forget gate
Output gate
11
Long Short Term Memory (LSTM)
Overall, the input xt and previous hidden state ht-1 determine the next hidden and cell states (ct, ht) as well as how much they keep old value or update to new value.
xt
ct
ht
FC
ht-1
FC
FC
tanh
FC
×
+
×
yt
ct-1
tanh
×
σ
σ
σ
Input gate
Forget gate
Output gate
12
Long Short Term Memory (LSTM)
xt
ct
ht
FC
ht-1
FC
FC
tanh
FC
×
+
×
yt
ct-1
tanh
×
σ
σ
σ
ft
it
ot
13
Long Short Term Memory (LSTM)
14
Gated Recurrent Units (GRU)
xt
ht
FC
ht-1
FC
FC
tanh
yt
σ
σ
rt
zt
×
cnvx
15
TensorFlow API: LSTM, GRU
Dimensionality of hidden state
tf.keras.layers.GRU(
units,� activation='tanh', � recurrent_activation='sigmoid',
use_bias=True, � kernel_initializer='glorot_uniform',
recurrent_initializer='orthogonal',
bias_initializer='zeros', � kernel_regularizer=None,
recurrent_regularizer=None, � bias_regularizer=None, � activity_regularizer=None,
kernel_constraint=None, � recurrent_constraint=None, � bias_constraint=None,
dropout=0.0, recurrent_dropout=0.0, � return_sequences=False,� return_state=False,
go_backwards=False, stateful=False, � unroll=False, time_major=False,
reset_after=True, **kwargs
)
tf.keras.layers.LSTM(
units,� activation='tanh',� recurrent_activation='sigmoid',
use_bias=True, � kernel_initializer='glorot_uniform',
recurrent_initializer='orthogonal',
bias_initializer='zeros', � unit_forget_bias=True,
kernel_regularizer=None, � recurrent_regularizer=None, � bias_regularizer=None,
activity_regularizer=None, � kernel_constraint=None, � recurrent_constraint=None,
bias_constraint=None, dropout=0.0, � recurrent_dropout=0.0,
return_sequences=False, � return_state=False, � go_backwards=False, stateful=False,
time_major=False, unroll=False, **kwargs
)
16
Practical Guides
17