1 of 17

Lecture 4�Recurrent Neural Networks – Part2���Sookyung Kim�sookim@ewha.ac.kr

1

2 of 17

Towards Modeling Longer Dependence

2

3 of 17

Gradient Flow Problem with Vanilla RNN

Let’s take a closer look inside the RNN cell:� Backprop from ht to ht-1 multiplies by Whh.

xt

yt

ht

×

×

+

tanh

ht-1

Whh

Wxh

3

4 of 17

Gradient Flow Problem with Vanilla RNN

At each output yt, we compute the loss Lt, and it backdrops all the way back to the beginning.

x1

y1

h1

×

×

+

tanh

h0

Whh

Wxh

x2

y2

h2

×

×

+

tanh

Whh

Wxh

x3

y3

h3

×

×

+

tanh

Whh

Wxh

4

5 of 17

Gradient Flow Problem with Vanilla RNN

What does this formula mean?

tanh’ is almost always < 1�→ Vanishing gradients

Iterative multiplication of a same matrix.�If the largest singular value > 1,�exploding gradients.�If the largest singular value < 1,�vanishing gradients.

→ Hard to control gradients...

  • Exploding gradients can be addressed by gradient clipping (scale gradient down if its norm is above some threshold).
  • Vanishing gradients again? Hard to deal with this without changing architecture.

5

6 of 17

Notation

For simplicity, let’s use “FC” (fully connected) box instead of each weight matrix.

Vanilla RNN with old notation

Vanilla RNN with new notation

xt

yt

ht

×

×

+

tanh

ht-1

Whh

Wxh

xt

yt

ht

tanh

ht-1

FC

6

7 of 17

Long Short Term Memory (LSTM)

Recall that vanilla RNN had vanishing gradient problem, as we the backward passes through an FC.

xt

ht

ht-1

FC

tanh

yt

7

8 of 17

Long Short Term Memory (LSTM)

To avoid this, we add a “highway” detouring the FC layer, and a new set of hidden states called cell state (ct). An additional non-linearity added after adding with ct-1.

xt

ct

ht

ht-1

FC

tanh

+

yt

ct-1

tanh

8

9 of 17

Long Short Term Memory (LSTM)

Even if the cell state is for long-term memory, we still need some mechanism to control it. The forget gate is added for this purpose.

xt

ct

ht

ht-1

FC

FC

tanh

×

+

yt

ct-1

tanh

σ

Forget gate

9

10 of 17

Long Short Term Memory (LSTM)

Similarly to the input side, we add the input gate to control the flow from the input.

xt

ct

ht

FC

ht-1

FC

FC

tanh

×

+

×

yt

ct-1

tanh

σ

σ

Input gate

Forget gate

10

11 of 17

Long Short Term Memory (LSTM)

Lastly, we add the output gate to control the value updated to the hidden state ht.

xt

ct

ht

FC

ht-1

FC

FC

tanh

FC

×

+

×

yt

ct-1

tanh

×

σ

σ

σ

Input gate

Forget gate

Output gate

11

12 of 17

Long Short Term Memory (LSTM)

Overall, the input xt and previous hidden state ht-1 determine the next hidden and cell states (ct, ht) as well as how much they keep old value or update to new value.

xt

ct

ht

FC

ht-1

FC

FC

tanh

FC

×

+

×

yt

ct-1

tanh

×

σ

σ

σ

Input gate

Forget gate

Output gate

12

13 of 17

Long Short Term Memory (LSTM)

xt

ct

ht

FC

ht-1

FC

FC

tanh

FC

×

+

×

yt

ct-1

tanh

×

σ

σ

σ

ft

it

ot

13

14 of 17

Long Short Term Memory (LSTM)

  • With the cell states and the uninterrupted gradient highway, LSTM can preserve long-range information better than vanilla RNN can.
    • If forget gate = 1 and input gate = 0, cell state is preserved indefinitely.�(If this is needed, the model will learn this from the data.)�
  • LSTM does NOT guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies.

14

15 of 17

Gated Recurrent Units (GRU)

  • Another idea similar to LSTM, providing long-range dependency on RNNs.
    • No additional cell states as in LSTM.
    • Fewer parameters compared to LSTM.
    • Provide a gradient highway similar to LSTM, using a convex combination of previous hidden state and new one computed from the input.

xt

ht

FC

ht-1

FC

FC

tanh

yt

σ

σ

rt

zt

×

cnvx

15

16 of 17

TensorFlow API: LSTM, GRU

Dimensionality of hidden state

tf.keras.layers.GRU(

units,� activation='tanh', � recurrent_activation='sigmoid',

use_bias=True, � kernel_initializer='glorot_uniform',

recurrent_initializer='orthogonal',

bias_initializer='zeros', � kernel_regularizer=None,

recurrent_regularizer=None, � bias_regularizer=None, � activity_regularizer=None,

kernel_constraint=None, � recurrent_constraint=None, � bias_constraint=None,

dropout=0.0, recurrent_dropout=0.0, � return_sequences=False,� return_state=False,

go_backwards=False, stateful=False, � unroll=False, time_major=False,

reset_after=True, **kwargs

)

tf.keras.layers.LSTM(

units,� activation='tanh',� recurrent_activation='sigmoid',

use_bias=True, � kernel_initializer='glorot_uniform',

recurrent_initializer='orthogonal',

bias_initializer='zeros', � unit_forget_bias=True,

kernel_regularizer=None, � recurrent_regularizer=None, � bias_regularizer=None,

activity_regularizer=None, � kernel_constraint=None, � recurrent_constraint=None,

bias_constraint=None, dropout=0.0, � recurrent_dropout=0.0,

return_sequences=False, � return_state=False, � go_backwards=False, stateful=False,

time_major=False, unroll=False, **kwargs

)

16

17 of 17

Practical Guides

  • LSTM is a good default choice for RNN.�
  • Consider variants like GRU if you want faster computation and less parameters.�
  • For NLP-heavy problems, consider Transformers.
    • We will cover Transformers later in this course.

17