Sequences
Aram, Greg, Sami
CS 699. Representation Learning. Fall 2019
Agenda
CS 699. Representation Learning. Fall 2019
Logistics
CS 699. Representation Learning. Fall 2019
Projects
CS 699. Representation Learning. Fall 2019
Agenda
CS 699. Representation Learning. Fall 2019
Motivation
Why should you pay attention:
CS 699. Representation Learning. Fall 2019
High-level Overview of RNNs
(I, like, this, movie)
positive
Sentiment Classifier
Variable-length output aka. Sequence-to-Sequence
POS Tagging
Translation
Each task has different output type: part of the model (encoder) does not care about output type. Only decoder cares about output type!
Classification: Fixed-length output
Output at every step [Same-length as input, like segmentation in vision]
Output at every step
Autoregressive
mujhe yah philm pasand hai
(PN, V, A, N)
I like this movie
RNN Model Overview
Legend
(I, like, this, movie)
Word Embeddings
Zero vector
RNN Cell
State Vector
RNN Model Overview
(I, like, this, movie)
Zero vector
Encoder
RNN Model Overview
(I, like, this, movie)
Zero vector
Encoder
positive
Classification Decoder
Variable Length input → Fixed Length output
RNN Model Overview
(I, like, this, movie)
Zero vector
Encoder
PN
V
A
N
POS Tagger
POS Tagger
POS Tagger
POS Tagger
Variable Length input → Same Length output
RNN Model Overview
(I, like, this, movie)
Zero vector
Encoder
Sentence in target language
… or conditioning information e.g. from another modality
Translation (variable-length) Decoder
Variable Length input → Variable Length output
RNNs are trained in batches!
If there is time and interest, we will come back to this. Copied from HW5 of CS544: Masters NLP course
CS 699. Representation Learning. Fall 2019
Optional: Attention on Whiteboard!
[visual and temporal attention]
CS 699. Representation Learning. Fall 2019
Agenda
CS 699. Representation Learning. Fall 2019
Simple RNN
Predicts at every time step
CS 699. Representation Learning. Fall 2019
Simple RNN Equations
CS 699. Representation Learning. Fall 2019
LSTMs [1 slide]
RNNs suffer from not remembering long sequences. LSTM has been proposed. Has explicit way to “remember” or “forget” which is differentiable. Good blog
CS 699. Representation Learning. Fall 2019
RNNs conditioned on output rather than state
Strictly less powerful than previous slide
edge h(t-1) → h(t) is better than o(t-1) → h(t)
o(t-1) unlikely contains the entire history, unless user is certain that output encodes it
According to textbook, one advantage: you can use “teacher forcing” [next slide]
CS 699. Representation Learning. Fall 2019
Teacher Forcing
CS 699. Representation Learning. Fall 2019
CS 699. Representation Learning. Fall 2019
Single Output
Side note: In YouTube-8M paper [Abu-El-Haija et al, 2016], for video classification, we found that it was beneficial to add video-level loss at more frames.
Another alternative to combining sequence latents into a single prediction is through “Temporal Attention”: Take a weighted average! Weights are coming from an “attention network”
CS 699. Representation Learning. Fall 2019
Bidirectional RNN
Future utterances are useful for understanding information
CS 699. Representation Learning. Fall 2019
Gated Recurrent Units (GRU)
Half-way between RNN and LSTM
CS 699. Representation Learning. Fall 2019
Agenda
CS 699. Representation Learning. Fall 2019
Backpropagation Through Time
...
26
multiply
concat
σ
multiply
concat
σ
multiply
concat
σ
CS 699. Representation Learning. Fall 2019
Backpropagation Through Time
...
27
multiply
concat
σ
multiply
concat
σ
multiply
concat
σ
CS 699. Representation Learning. Fall 2019
Vanishing/Exploding Gradients
CS 699. Representation Learning. Fall 2019
Explod/Vanishing Gradients
Does Contribution from early larger than later words? It depends on:
Researchers developed:
CS 699. Representation Learning. Fall 2019
Vanishing & Exploding Gradients
Let γ be an upper-bound on the gradient of activation σ
Let largest eigenvalue of W be λ1
If λ1γ > 1 : Gradients will explode
If λ1γ < 1 : Gradients will vanish
Proposed solution: Gradient clipping
If you don’t have a project yet and into optimization: I have an idea!
CS 699. Representation Learning. Fall 2019
Agenda
CS 699. Representation Learning. Fall 2019
Contrast with FVSBN (Autoregressive Model)
Fully Visible Sigmoid Belief Net (see slides on AR models)
To output a sequence via RNN, use “stop symbol” as a special token. Symbol must be appended to all training examples.
Another option: dedicate Bernoulli unit that indicates “stop”, that can be trained with cross-entropy. Useful e.g. when outputting sequence of continuous numbers
RNN: Edges between y’s removed. Hidden state can encode past.
CS 699. Representation Learning. Fall 2019
Conditional Generation
vector-to-sequence
CS 699. Representation Learning. Fall 2019
Case Study: Image Captioning
Show and Tell (2014) paper [link]
Can learn powerful representations using “Visual attention”
CS 699. Representation Learning. Fall 2019
At every step, decoder has access to previous state + image features [as list of cuboids]. Uses attention on the cuboids.
Visual Results next slide
CS 699. Representation Learning. Fall 2019
CS 699. Representation Learning. Fall 2019
Sequence-to-Sequence
Lengths can vary from each other
Training maximizes
Stop symbol inserted at end of input
CS 699. Representation Learning. Fall 2019
Agenda
CS 699. Representation Learning. Fall 2019
Deep RNNs
CS 699. Representation Learning. Fall 2019
Results: Deep RNNs
Negative log likelihood on language modelling:
-log likelihood on music modeling
CS 699. Representation Learning. Fall 2019
Large Number of Classes?
The output is often the size of vocabulary (hundreds of thousands?) How to have such an output layer?
CS 699. Representation Learning. Fall 2019
Hierarchical Softmax (right figure)
Tree can be “random”, or using some ontology e.g. WordNet.
CS 699. Representation Learning. Fall 2019
Transformer Models
CS 699. Representation Learning. Fall 2019
CS 699. Representation Learning. Fall 2019
Attention is All You Need: Model Architecture
CS 699. Representation Learning. Fall 2019
Model illustration (http://jalammar.github.io/illustrated-transformer/)
CS 699. Representation Learning. Fall 2019
Embed words or Characters?
[AAAI 2019]: they use a Transformer model of 64 layers
CS 699. Representation Learning. Fall 2019
Thank you!
If there is interest + Time, talk about coding RNNs
Office hours: Same time and place (Leavey 2-3:20 PM)
CS 699. Representation Learning. Fall 2019