1 of 37

Ahmad Kalhor-University of Tehran

1

Chapter 5

Recurrent Neural Networks

    • RNN (Recurrent Neural Network)
    • LSTM (Long Short Term Memory)
    • GRU (Gated Recurrent Unit)

Memory Neural Networks

NNs which associate an input pattern (or sequenced input patterns) to an output pattern(or sequenced output patterns) (Supervised Learning )

* such NNs make memories for pattern association.

Recurrent NN

x1….x2…x3…xN

Patterns are taken sequentially

(Often through the time)

y

Output pattern

Or Sequenced Patterns

Sequenced Patterns

2 of 37

Some Applications of RNNs

  1. Language Modeling and Generating Text

Ahmad Kalhor- University of Tehan

2

Given a sequence of words we want to predict the probability of each word given the previous words. Language Models allow us to measure how likely a sentence is, which is an important input for Machine Translation (since high-probability sentences are typically correct). 

Machine Translation is similar to language modeling in that our input is a sequence of words in our source language (e.g. German). We want to output a sequence of words in our target language (e.g. English).

3 of 37

3. Speech Recognition

Given an input sequence of acoustic signals from a sound wave, we can predict a sequence of phonetic segments together with their probabilities.

4. Generating Image Descriptions

  • Together with convolutional Neural Networks, RNNs have been used as part of a model to generate descriptions for unlabeled images

Ahmad Kalhor- University of Tehan

3

4 of 37

Challenges in learning of RNNs

MNNs with a sequenced input patterns

  1. The sequenced input patterns may be combined with disturbance and non-related patterns.
  2. The required features (to associate true output(outputs)) are hidden through sequenced inputs.
  3. Among the sequenced input patterns, it may be short and long dependencies.
  4. The sequenced input patterns may be presented with different modalities.

* RNNs make robust memories against distortion, disturbances and dimension variations of input patterns or different sampling rates.

Ahmad Kalhor- University of Tehran

4

5 of 37

  • Some known Recurrent NNs
    • Recurrent Neural Networks(RNNs)
    • Long-Short Term Memory (LSTM)
    • Gated Recurrent Unit (GRU)

  • Such Networks make nonlinear non-homogenous difference Equations. By solving such equations, the nonlinear functions, which generate the output patterns based on the implicit sequenced input patterns, are revealed.
  • Recurrent structures will make a plenty of memories utilized in pattern associations between implicit inputs and target outputs.

  • Pattern Association (Memories) types.
    • One to one: many classification problems: fingerprint/biometric signals/signature
    • Many to one: voice/handwriting/…
    • One to Many-: image captioning
    • Many to Many -- Translation machines….text to picture/Film

Ahmad Kalhor- University of Tehan

5

6 of 37

Training RNNs

Ahmad Kalhor- University of Tehan

6

Back Propagation Through Time (BPTT)

Updating Rules

Loss function

7 of 37

A Typical simple RNN

Ahmad Kalhor- University of Tehan

7

8 of 37

Back Propagation Through Time (BPTT)

Ahmad Kalhor- University of Tehan

8

9 of 37

Main limitation in RNN

Ahmad Kalhor- University of Tehan

9

derivative of sigmoid functions is less than one.

In “BPTT” when the RNN is unfolded for many times the back-propagated gradient coefficient is vanished for the inputs taken at older times.

The learning for long dependencies is not effective any more.

Error gradients vanish exponentially quickly with the size of the time lag between important events

10 of 37

RNNs are good to make Short Term Memory

The clouds are in the SKY

implicit input 🡪 target

Ahmad Kalhor- University of Tehan

10

I grew up in France………....I speak fluent French

RNNs are not good to make Long Term Memory

?

11 of 37

RNN Extensions

  • Bidirectional RNNs are based on the idea that the output at time  may not only depend on the previous elements in the sequence, but also future elements. For example, to predict a missing word in a sequence you want to look at both the left and the right context. Bidirectional RNNs are quite simple. They are just two RNNs stacked on top of each other. The output is then computed based on the hidden state of both RNNs.
  • Deep (Bidirectional) RNNs are similar to Bidirectional RNNs, only that we now have multiple layers per time step. In practice this gives us a higher learning capacity (but we also need a lot of training data)

Ahmad Kalhor- University of Tehan

11

12 of 37

LSTM (long short term memory)�Hochreitor & Shmidhuber 1997

Ahmad Kalhor- University of Tehan

12

The repeating module in a standard RNN contains a single layer.

� �

The repeating module in an LSTM contains four interacting layers.

Module in RNN

Module in LSTM

(ht is the same st)

ht =tanh(Uxt+W ht-1)

the LSTM can read, write and delete information from its memory.

13 of 37

Some Concepts

Ahmad Kalhor- University of Tehan

13

1. Cell state

Gates are a way to optionally let information through.

They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

The notation 

Two key concepts in LSTMs

2. Gate

14 of 37

Step-by-Step LSTM Walk Through�

Ahmad Kalhor- University of Tehan

14

Step1: to decide what information we’re going to throw away from the cell state.

15 of 37

Ahmad Kalhor- University of Tehan

15

Step2: to decide what new information we’re going to store in the cell state.

16 of 37

Ahmad Kalhor- University of Tehan

16

Step3: drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

17 of 37

Ahmad Kalhor- University of Tehan

17

Step4: to decide what we’re going to output

18 of 37

Compact Form Equations

Ahmad Kalhor- University of Tehan

18

19 of 37

  • Parameters are updated by “BPTT”�All computed gradients of parameters are added together through the time

Ahmad Kalhor- University of Tehan

19

20 of 37

LSTM Learning methods

  1. Like RNN, “BPTT” can be performed to learn the weights and biases of a LSTM module. Unlike standard RNNs, the error remains in the unit's memory.
  2. LSTM can also be trained by a combination of artificial evolution for weights to the hidden units, and pseudo-inverse or support vector machines for weights to the output units.
  3. In reinforcement learning applications LSTM can be trained by policy gradient methods,evolution strategies or genetic algorithms.

Ahmad Kalhor- University of Tehan

20

21 of 37

Ahmad Kalhor- University of Tehan

21

LSTM Forward and Backward Pass, Arun Mallya

22 of 37

Ahmad Kalhor- University of Tehan

22

23 of 37

Ahmad Kalhor- University of Tehan

23

24 of 37

Ahmad Kalhor- University of Tehan

24

25 of 37

Ahmad Kalhor- University of Tehan

25

26 of 37

Ahmad Kalhor- University of Tehan

26

27 of 37

Ahmad Kalhor- University of Tehan

27

28 of 37

Ahmad Kalhor- University of Tehan

28

29 of 37

Ahmad Kalhor- University of Tehan

29

30 of 37

Ahmad Kalhor- University of Tehan

30

31 of 37

Ahmad Kalhor- University of Tehan

31

32 of 37

Ahmad Kalhor- University of Tehan

32

33 of 37

Ahmad Kalhor- University of Tehan

33

34 of 37

Extended Versions

Ahmad Kalhor- University of Tehan

34

1. One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

35 of 37

Ahmad Kalhor- University of Tehan

35

2. Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

36 of 37

Ahmad Kalhor- University of Tehan

36

2. A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

37 of 37

End of Chapter 5

Thank you

Ahmad Kalhor- University of Tehran

37