1 of 17

Lecture 4�Recurrent Neural Networks – Part2��Sookyung Kim�sookim@ewha.ac.kr

1

2 of 17

Towards Modeling Longer Dependence

2

3 of 17

Gradient Flow Problem with Vanilla RNN

Let’s take a closer look inside the RNN cell:� Backprop from h_t to h_t-1 multiplies by W_hh.

x_t

y_t

h_t

×

+

tanh

h_t-1

W_hh

W_xh

3

4 of 17

Gradient Flow Problem with Vanilla RNN

At each output y_t, we compute the loss L_t, and it backdrops all the way back to the beginning.

x₁

y₁

h₁

×

+

tanh

h₀

W_hh

W_xh

x₂

y₂

h₂

×

+

tanh

W_hh

W_xh

x₃

y₃

h₃

×

+

tanh

W_hh

W_xh

4

5 of 17

Gradient Flow Problem with Vanilla RNN

What does this formula mean?

tanh’ is almost always < 1�→ Vanishing gradients

Iterative multiplication of a same matrix.�If the largest singular value > 1,�exploding gradients.�If the largest singular value < 1,�vanishing gradients.

→ Hard to control gradients...

Exploding gradients can be addressed by gradient clipping (scale gradient down if its norm is above some threshold).
Vanishing gradients again? Hard to deal with this without changing architecture.

5

6 of 17

Notation

For simplicity, let’s use “FC” (fully connected) box instead of each weight matrix.

Vanilla RNN with old notation

Vanilla RNN with new notation

x_t

y_t

h_t

×

+

tanh

h_t-1

W_hh

W_xh

x_t

y_t

h_t

tanh

h_t-1

FC

6

7 of 17

Long Short Term Memory (LSTM)

Recall that vanilla RNN had vanishing gradient problem, as we the backward passes through an FC.

x_t

h_t

h_t-1

FC

tanh

y_t

7

8 of 17

Long Short Term Memory (LSTM)

To avoid this, we add a “highway” detouring the FC layer, and a new set of hidden states called cell state (c_t). An additional non-linearity added after adding with c_t-1.

x_t

c_t

h_t

h_t-1

FC

tanh

+

y_t

c_t-1

tanh

8

9 of 17

Long Short Term Memory (LSTM)

Even if the cell state is for long-term memory, we still need some mechanism to control it. The forget gate is added for this purpose.

x_t

c_t

h_t

h_t-1

FC

tanh

×

+

y_t

c_t-1

tanh

σ

Forget gate

9

10 of 17

Long Short Term Memory (LSTM)

Similarly to the input side, we add the input gate to control the flow from the input.

x_t

c_t

h_t

FC

h_t-1

FC

tanh

×

+

×

y_t

c_t-1

tanh

σ

Input gate

Forget gate

10

11 of 17

Long Short Term Memory (LSTM)

Lastly, we add the output gate to control the value updated to the hidden state h_t.

x_t

c_t

h_t

FC

h_t-1

FC

tanh

FC

×

+

×

y_t

c_t-1

tanh

×

σ

Input gate

Forget gate

Output gate

11

12 of 17

Long Short Term Memory (LSTM)

Overall, the input x_t and previous hidden state h_t-1 determine the next hidden and cell states (c_t, h_t) as well as how much they keep old value or update to new value.

x_t

c_t

h_t

FC

h_t-1

FC

tanh

FC

×

+

×

y_t

c_t-1

tanh

×

σ

Input gate

Forget gate

Output gate

12

13 of 17

Long Short Term Memory (LSTM)

x_t

c_t

h_t

FC

h_t-1

FC

tanh

FC

×

+

×

y_t

c_t-1

tanh

×

σ

f_t

i_t

o_t

13

14 of 17

Long Short Term Memory (LSTM)

With the cell states and the uninterrupted gradient highway, LSTM can preserve long-range information better than vanilla RNN can.

If forget gate = 1 and input gate = 0, cell state is preserved indefinitely.�(If this is needed, the model will learn this from the data.)�

LSTM does NOT guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies.

14

15 of 17

Gated Recurrent Units (GRU)

Another idea similar to LSTM, providing long-range dependency on RNNs.

No additional cell states as in LSTM.
Fewer parameters compared to LSTM.
Provide a gradient highway similar to LSTM, using a convex combination of previous hidden state and new one computed from the input.

x_t

h_t

FC

h_t-1

FC

tanh

y_t

σ

r_t

z_t

×

cnvx

https://arxiv.org/pdf/1406.1078.pdf

15

16 of 17

TensorFlow API: LSTM, GRU

Dimensionality of hidden state

tf.keras.layers.GRU(

units,� activation='tanh', � recurrent_activation='sigmoid',

use_bias=True, � kernel_initializer='glorot_uniform',

recurrent_initializer='orthogonal',

bias_initializer='zeros', � kernel_regularizer=None,

recurrent_regularizer=None, � bias_regularizer=None, � activity_regularizer=None,

kernel_constraint=None, � recurrent_constraint=None, � bias_constraint=None,

dropout=0.0, recurrent_dropout=0.0, � return_sequences=False,� return_state=False,

go_backwards=False, stateful=False, � unroll=False, time_major=False,

reset_after=True, **kwargs

)

tf.keras.layers.LSTM(

units,� activation='tanh',� recurrent_activation='sigmoid',

use_bias=True, � kernel_initializer='glorot_uniform',

recurrent_initializer='orthogonal',

bias_initializer='zeros', � unit_forget_bias=True,

kernel_regularizer=None, � recurrent_regularizer=None, � bias_regularizer=None,

activity_regularizer=None, � kernel_constraint=None, � recurrent_constraint=None,

bias_constraint=None, dropout=0.0, � recurrent_dropout=0.0,

return_sequences=False, � return_state=False, � go_backwards=False, stateful=False,

time_major=False, unroll=False, **kwargs

)

https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM

https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU

16

17 of 17

Practical Guides

LSTM is a good default choice for RNN.�
Consider variants like GRU if you want faster computation and less parameters.�
For NLP-heavy problems, consider Transformers.

We will cover Transformers later in this course.

17