3 of 32

RNN

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task
Sometimes, we only need to look at recent information to perform the present task

predict the last word in “the clouds are in the sky”

In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information

4 of 32

Long Distance Dependencies

It is very difficult to train RNNs like Vanilla (with tanh/sigmoid activation) to retain information over many time steps

When trying to handle long temporal dependency

This make is very difficult to learn RNNs that handle long-distance dependencies, such as subject-verb agreement.

5 of 32

Vanishing/Exploding Gradient Problem

Backpropagated errors multiply at each layer, resulting in exponential

decay (if Gradients < 1)
growth (if Gradients > 1)

Makes it very difficult train deep networks, or simple recurrent networks over many time steps.

6 of 32

How to resolve Vanishing Gradient Problems?

Possible solutions

Activation functions
CNN: Residual networks [He et al., 2016]
RNN: LSTM (Long Short-Term Memory)

7 of 32

Solving Vanishing Gradient: Activation Functions

Use different activation functions that are not bounded:

Recent works largely use ReLU or their variants
No saturation, easy to optimize

8 of 32

Residual networks (ResNet[He et al., 2016])

Feed-forward NN with “shortcut connections”
Can preserve gradient flow throughout the entire depth of the network
Possible to train more than 100 layers by simply stacking residual blocks

Solving Vanishing Gradient: Residual Networks

9 of 32

LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Units)

Specially designed RNN which can remember information for much longer period

3 main steps:

Forget irrelevant parts of previous state
Selectively update the cell state based on the new input
Selectively decide what part of the cell state to output as the new hidden state

Solving Vanishing Gradient: LSTM

10 of 32

Long Short-Term Memory (LSTM)

Core Idea: A memory cell which can maintain its state over time, consisting of an explicit memory (the cell state vector) and gating units which regulate the information flow into and out of the memory
LSTM networks, add additional gating units in each memory cell.

Forget gate
Input gate
Output gate

Prevents vanishing/exploding gradient problem and allows network to retain state information over longer periods of time.

11 of 32

RNN

All recurrent neural networks have the form of a chain of repeating modules of neural network
In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer

12 of 32

LSTM Network Architecture

LSTMs also have this chain like structure, but the repeating module has a different structure
Instead of having a single neural network layer, there are four, interacting in a very special way

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others
pointwise operations, includes vector addition, multiplication etc.
Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations

13 of 32

LSTM vs RNN

14 of 32

Cell State

The key to LSTMs is the cell state
Maintains a vector C_t that is the same dimensionality as the hidden state, h_t
It runs straight down the entire chain, with only some minor linear interactions
Information can be added or deleted from this state vector via the forget and input gates.

15 of 32

Gates

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates
Gates control the flow of information to/from the memory
Composed of a sigmoid neural net layer and a pointwise multiplication operation
The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through
A value of zero means “let nothing through,” while a value of one means “let everything through!”
An LSTM has three of these gates, to protect and control the cell state

sigmoid neural net layer followed by pointwise multiplication operator

16 of 32

Forget Gate

Forget gate: Throw away the information that is not required by the cell state
Forget gate computes a 0-1 value using a logistic sigmoid output function from the input, x_t, and the current hidden state, h_t-1:
Multiplicatively combined with cell state, "forgetting" information where the gate outputs something close to 0.

17 of 32

Input Gate

What we want to store in the cell state?
Two parts:

Sigmoid part (input gate layer): Identifies which entries in the cell state to update by computing 0-1 sigmoid output
tanh part (context gate layer): Generates new candidate values that should be added to the cell state

18 of 32

Updating the Cell State

Update cell state

Forget: Multiply the output with previous cell state
Update: Add the candidate values to the cell state

19 of 32

Output Gate

Output is decided on the cell state

Sigmoid layer: To decide which parts we want to output
Tanh function: Cell state is scaled between -1 and 1
Multiply the output of sigmoid layer and tanh function to output only the parts that we decided in the sigmoid layer

Hidden state is updated based on a "filtered" version of the cell state, scaled to –1 to 1 using tanh.
Output gate computes a sigmoid function of the input and current hidden state to determine which elements of the cell state to "output".

20 of 32

Overall LSTM Cell Architecture

21 of 32

Overall LSTM Computation

22 of 32

LSTM Training

Backpropagation Through Time (BPTT) most common
What weights are learned?

Each cell has many parameters (W_f, W_i, W_C, W_o)

Generally, requires lots of training data.
Requires lots of compute time that exploits GPU clusters.
Gates (input/output/forget)
Input tanh layer

Outputs depend on the task

Single output prediction for the whole sequence (text classification)
One output at each time step (sequence labeling)

Stochastic gradient descent (randomize order of examples in each epoch) with momentum (bias weight changes to continue in same direction as last update).

ADAM optimizer (Kingma & Ma, 2015)

Each cell has many parameters (W_f, W_i, W_C, W_o)

Generally requires lots of training data.
Requires lots of compute time that exploits GPU clusters.

23 of 32

LSTM Training

24 of 32

General Problems Solved with LSTMs

Sequence labeling

Train with supervised output at each time step computed using a single or multilayer network that maps the hidden state (h_t) to an output vector (O_t).

Language modeling

Train to predict next input (O_t=I_t+1)

Sequence (e.g. text) classification

Train a single or multilayer network that maps the final hidden state (h_n) to an output vector (O).

25 of 32

Sequence to Sequence �Transduction (Mapping)

Encoder/Decoder framework maps one sequence to a "deep vector" then another LSTM maps this vector to an output sequence.

Train model "end to end" on I/O pairs of sequences.

26 of 32

Summary of �LSTM Application Architectures

Image Captioning

Video Activity Recog

Text Classification

Video Captioning

Machine Translation

POS Tagging

Language Modeling

27 of 32

Successful Applications of LSTMs

Speech recognition: Language and acoustic modeling
Sequence labeling

POS Tagging https://www.aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
NER
Phrase Chunking

Neural syntactic and semantic parsing
Image captioning: CNN output vector to sequence
Sequence to Sequence

Machine Translation (Sustkever, Vinyals, & Le, 2014)
Video Captioning (input sequence of CNN frame outputs)

28 of 32

Deep LSTMs

Deep LSTMs created by stacking multiple LSTM layers vertically, with the output sequence of one layer forming the input sequence of the next
Increases the number of parameters

but given sufficient data, performs significantly better than single-layer LSTMs (Graves et al. 2013)

Dropout usually applied only to non-recurrent edges, including between layers

29 of 32

Bi-directional LSTM (Bi-LSTM)

Data processed in both directions processed with two separate hidden layers, which are then fed forward into the same output layer
Bidirectional RNNs can better exploit context in both directions,

The output may not only depend on the previous elements in the sequence, but also on future elements in the sequence
bidirectional LSTMs perform better than unidirectional ones in speech recognition (Graves et al. 2013)

It resembles two RNNs stacked on top of each other

x_t+1

x_t

x_t-1

h_t-1

h_t+1

h_t

Outputs both past and future elements

30 of 32

Gated Recurrent Unit (GRU)

Alternative RNN to LSTM that uses fewer gates (Cho, et al., 2014)

Combines forget and input gates into “update” gate.
Eliminates cell state vector

It combines the forget and input into a single update gate.
It also merges the cell state and hidden state. This is simpler
than LSTM. There are many other variants too.

31 of 32

GRU vs. LSTM

GRUs also takes x_t and h_t-1 as inputs
They perform some calculations and then pass along h_t
What makes them different from LSTMs is that GRUs don't need the cell layer to pass values along
The calculations within each iteration insure that the h_t values being passed along either retain a high amount of old information or are jump-started with a high amount of new information
GRU has significantly fewer parameters and trains faster
Experimental results comparing the two are still inconclusive, many problems they perform the same, but each has problems on which they work better

32 of 32

Conclusions

By adding “gates” to an RNN, we can prevent the vanishing/exploding gradient problem.
Trained LSTMs/GRUs can retain state information longer and handle long-distance dependencies.
Recent impressive results on a range of challenging NLP problems.