1 of 32

Long Short-Term Memory (LSTM)

2 of 32

RNN

An unrolled recurrent neural network

3 of 32

RNN

  • One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task
  • Sometimes, we only need to look at recent information to perform the present task
    • predict the last word in “the clouds are in the sky”
  • In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information

4 of 32

Long Distance Dependencies

  • It is very difficult to train RNNs like Vanilla (with tanh/sigmoid activation) to retain information over many time steps
    • When trying to handle long temporal dependency
  • This make is very difficult to learn RNNs that handle long-distance dependencies, such as subject-verb agreement.

4

5 of 32

Vanishing/Exploding Gradient Problem

  • Backpropagated errors multiply at each layer, resulting in exponential
    • decay (if Gradients < 1)
    • growth (if Gradients > 1)
  • Makes it very difficult train deep networks, or simple recurrent networks over many time steps.

 

6 of 32

How to resolve Vanishing Gradient Problems?

  • Possible solutions
    • Activation functions
    • CNN: Residual networks [He et al., 2016]
    • RNN: LSTM (Long Short-Term Memory)

7 of 32

Solving Vanishing Gradient: Activation Functions

  • Use different activation functions that are not bounded:
    • Recent works largely use ReLU or their variants
    • No saturation, easy to optimize

8 of 32

  • Residual networks (ResNet[He et al., 2016])
    • Feed-forward NN with “shortcut connections”
    • Can preserve gradient flow throughout the entire depth of the network
    • Possible to train more than 100 layers by simply stacking residual blocks

Solving Vanishing Gradient: Residual Networks

9 of 32

  • LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Units)
    • Specially designed RNN which can remember information for much longer period
  • 3 main steps:
    • Forget irrelevant parts of previous state
    • Selectively update the cell state based on the new input
    • Selectively decide what part of the cell state to output as the new hidden state

Solving Vanishing Gradient: LSTM

10 of 32

Long Short-Term Memory (LSTM)

  • Core Idea: A memory cell which can maintain its state over time, consisting of an explicit memory (the cell state vector) and gating units which regulate the information flow into and out of the memory
  • LSTM networks, add additional gating units in each memory cell.
    • Forget gate
    • Input gate
    • Output gate
  • Prevents vanishing/exploding gradient problem and allows network to retain state information over longer periods of time.

11 of 32

RNN

  • All recurrent neural networks have the form of a chain of repeating modules of neural network
  • In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer

12 of 32

LSTM Network Architecture

  • LSTMs also have this chain like structure, but the repeating module has a different structure
  • Instead of having a single neural network layer, there are four, interacting in a very special way
  • In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others
  • pointwise operations, includes vector addition, multiplication etc.
  • Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations

13 of 32

LSTM vs RNN

14 of 32

Cell State 

  • The key to LSTMs is the cell state
  • Maintains a vector Ct that is the same dimensionality as the hidden state, ht
  • It runs straight down the entire chain, with only some minor linear interactions
  • Information can be added or deleted from this state vector via the forget and input gates.

15 of 32

Gates

  • The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates
  • Gates control the flow of information to/from the memory
  • Composed of a sigmoid neural net layer and a pointwise multiplication operation
  • The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through
  • A value of zero means “let nothing through,” while a value of one means “let everything through!”
  • An LSTM has three of these gates, to protect and control the cell state

sigmoid neural net layer followed by pointwise multiplication operator

16 of 32

Forget Gate

  • Forget gate: Throw away the information that is not required by the cell state
  • Forget gate computes a 0-1 value using a logistic sigmoid output function from the input, xt, and the current hidden state, ht-1:
  • Multiplicatively combined with cell state, "forgetting" information where the gate outputs something close to 0.

17 of 32

Input Gate

  • What we want to store in the cell state?
  • Two parts:
    • Sigmoid part (input gate layer): Identifies which entries in the cell state to update by computing 0-1 sigmoid output
    • tanh part (context gate layer): Generates new candidate values that should be added to the cell state

18 of 32

Updating the Cell State

  • Update cell state
    • Forget: Multiply the output with previous cell state
    • Update: Add the candidate values to the cell state

19 of 32

Output Gate

  • Output is decided on the cell state
    • Sigmoid layer: To decide which parts we want to output
    • Tanh function: Cell state is scaled between -1 and 1
    • Multiply the output of sigmoid layer and tanh function to output only the parts that we decided in the sigmoid layer
  • Hidden state is updated based on a "filtered" version of the cell state, scaled to –1 to 1 using tanh.
  • Output gate computes a sigmoid function of the input and current hidden state to determine which elements of the cell state to "output".

20 of 32

Overall LSTM Cell Architecture

21 of 32

Overall LSTM Computation

22 of 32

LSTM Training

  • Backpropagation Through Time (BPTT) most common
  • What weights are learned?
    • Each cell has many parameters (Wf, Wi, WC, Wo)
      • Generally, requires lots of training data.
      • Requires lots of compute time that exploits GPU clusters.
      • Gates (input/output/forget)
      • Input tanh layer
  • Outputs depend on the task
    • Single output prediction for the whole sequence (text classification)
    • One output at each time step (sequence labeling)
  • Stochastic gradient descent (randomize order of examples in each epoch) with momentum (bias weight changes to continue in same direction as last update).
    • ADAM optimizer (Kingma & Ma, 2015)
  • Each cell has many parameters (Wf, Wi, WC, Wo)
    • Generally requires lots of training data.
    • Requires lots of compute time that exploits GPU clusters.

23 of 32

LSTM Training

24 of 32

General Problems Solved with LSTMs

  • Sequence labeling
    • Train with supervised output at each time step computed using a single or multilayer network that maps the hidden state (ht) to an output vector (Ot).
  • Language modeling
    • Train to predict next input (Ot =It+1)
  • Sequence (e.g. text) classification
    • Train a single or multilayer network that maps the final hidden state (hn) to an output vector (O). 

25 of 32

Sequence to Sequence �Transduction (Mapping)

  • Encoder/Decoder framework maps one sequence to a "deep vector" then another LSTM maps this vector to an output sequence.
  • Train model "end to end" on I/O pairs of sequences.

26 of 32

Summary of �LSTM Application Architectures

Image Captioning

Video Activity Recog

Text Classification

Video Captioning

Machine Translation

POS Tagging

Language Modeling

27 of 32

Successful Applications of LSTMs

  • Speech recognition: Language and acoustic modeling
  • Sequence labeling
  • Neural syntactic and semantic parsing
  • Image captioning: CNN output vector to sequence
  • Sequence to Sequence
    • Machine Translation (Sustkever, Vinyals, & Le, 2014)
    • Video Captioning (input sequence of CNN frame outputs)

28 of 32

Deep LSTMs

  • Deep LSTMs created by stacking multiple LSTM layers vertically, with the output sequence of one layer forming the input sequence of the next
  • Increases the number of parameters
    • but given sufficient data, performs significantly better than single-layer LSTMs (Graves et al. 2013)
  • Dropout usually applied only to non-recurrent edges, including between layers

29 of 32

Bi-directional LSTM (Bi-LSTM)

  • Data processed in both directions processed with two separate hidden layers, which are then fed forward into the same output layer
  • Bidirectional RNNs can better exploit context in both directions,
    • The output may not only depend on the previous elements in the sequence, but also on future elements in the sequence
    • bidirectional LSTMs perform better than unidirectional ones in speech recognition (Graves et al. 2013)
  • It resembles two RNNs stacked on top of each other

xt+1

xt

xt-1

ht-1

ht+1

ht

 

 

 

Outputs both past and future elements

30 of 32

Gated Recurrent Unit (GRU)

  • Alternative RNN to LSTM that uses fewer gates (Cho, et al., 2014)
    • Combines forget and input gates into “update” gate.
    • Eliminates cell state vector
  • It combines the forget and input into a single update gate.
  • It also merges the cell state and hidden state. This is simpler
  • than LSTM. There are many other variants too.

31 of 32

GRU vs. LSTM

  • GRUs also takes xt and ht-1 as inputs
  • They perform some calculations and then pass along ht
  • What makes them different from LSTMs is that GRUs don't need the cell layer to pass values along
  • The calculations within each iteration insure that the ht values being passed along either retain a high amount of old information or are jump-started with a high amount of new information
  • GRU has significantly fewer parameters and trains faster
  • Experimental results comparing the two are still inconclusive, many problems they perform the same, but each has problems on which they work better

32 of 32

Conclusions

  • By adding “gates” to an RNN, we can prevent the vanishing/exploding gradient problem.
  • Trained LSTMs/GRUs can retain state information longer and handle long-distance dependencies.
  • Recent impressive results on a range of challenging NLP problems.