2 of 48

Agenda

Motivation & Applications
Overview RNN Configurations
Vanishing/Exploding Gradient
RNNs as Generative Models
Deep RNNs and large number of classes

CS 699. Representation Learning. Fall 2019

3 of 48

Logistics

HW: Who has made significant progress?
Beyond Assignments [i.e. projects]

Who does not have a partner? Who did not make an appointment with me?
Tentative Deadline: December 13.

Submit report: ~4 pages with 1 column e.g. NeurIPS style file. Summarizing your work (application, dataset, model, objective, training) and summarizing your impact (e.g. accuracy, log-likelihood, visualizations)

Met with a lot of groups! AMAZING IDEAS!!! Covering range of

Invertible functions
Multimodal data [graphs+text+language+time+audio+location]
Human behavior [Social Network, Eye Gaze]
Geometry and 3D Rendering
Learning to play Games by Modeling Physics
…

Do you want to hear about them?

CS 699. Representation Learning. Fall 2019

4 of 48

Projects

~10 minutes presentation + ~7 minutes discussion Q&A.
5 or 6 groups per session.
No need to get “results” by the presentation.
Focus on:

The task & the datasets.
What are some existing works.
What is your proposal.
That’s it: Get feedback from others about potential ideas.

CS 699. Representation Learning. Fall 2019

5 of 48

Agenda

Motivation & Applications
Overview RNN Configurations
Vanishing/Exploding Gradient
RNNs as Generative Models
Deep RNNs and large number of classes

CS 699. Representation Learning. Fall 2019

6 of 48

Motivation

Why should you pay attention:

We will go over Recurrent Neural Nets (RNNs) which can model Sequences
Sequences are everywhere!

Language
Speech
Videos
RL: Trajectories / Planning

You can apply them to non-sequences:

GraphSAGE applies it as a “graph pooling layer”.
I’ve seen work where RNNs are trained on top of images (4 RNNs, one in each direction)

CS 699. Representation Learning. Fall 2019

7 of 48

High-level Overview of RNNs

(I, like, this, movie)

positive

Sentiment Classifier

Variable-length output aka. Sequence-to-Sequence

POS Tagging

Translation

Each task has different output type: part of the model (encoder) does not care about output type. Only decoder cares about output type!

Classification: Fixed-length output

Output at every step [Same-length as input, like segmentation in vision]

Output at every step

Autoregressive

mujhe yah philm pasand hai

(PN, V, A, N)

I like this movie

8 of 48

RNN Model Overview

Legend

(I, like, this, movie)

Word Embeddings

Zero vector

RNN Cell

State Vector

9 of 48

RNN Model Overview

(I, like, this, movie)

Zero vector

Encoder

10 of 48

RNN Model Overview

(I, like, this, movie)

Zero vector

Encoder

positive

Classification Decoder

Variable Length input → Fixed Length output

11 of 48

RNN Model Overview

(I, like, this, movie)

Zero vector

Encoder

POS Tagger

Variable Length input → Same Length output

12 of 48

RNN Model Overview

(I, like, this, movie)

Zero vector

Encoder

Sentence in target language

… or conditioning information e.g. from another modality

Translation (variable-length) Decoder

Variable Length input → Variable Length output

13 of 48

RNNs are trained in batches!

If there is time and interest, we will come back to this. Copied from HW5 of CS544: Masters NLP course

CS 699. Representation Learning. Fall 2019

14 of 48

Optional: Attention on Whiteboard!

[visual and temporal attention]

CS 699. Representation Learning. Fall 2019

15 of 48

Agenda

Motivation & Applications
Overview RNN Configurations
Vanishing/Exploding Gradient
RNNs as Generative Models
Deep RNNs and large number of classes

CS 699. Representation Learning. Fall 2019

16 of 48

Simple RNN

Predicts at every time step

CS 699. Representation Learning. Fall 2019

17 of 48

Simple RNN Equations

CS 699. Representation Learning. Fall 2019

18 of 48

LSTMs [1 slide]

RNNs suffer from not remembering long sequences. LSTM has been proposed. Has explicit way to “remember” or “forget” which is differentiable. Good blog

CS 699. Representation Learning. Fall 2019

19 of 48

RNNs conditioned on output rather than state

Strictly less powerful than previous slide

edge h^(t-1) → h^(t) is better than o^(t-1) → h^(t)

o^(t-1) unlikely contains the entire history, unless user is certain that output encodes it

According to textbook, one advantage: you can use “teacher forcing” [next slide]

CS 699. Representation Learning. Fall 2019

20 of 48

Teacher Forcing

Advantage: Allows parallel training

Converges fast. You can do as pre-training.

Big difference between “training and testing”
Mitigated with “Scheduled Sampling” [Bengio et al, 2015]

CS 699. Representation Learning. Fall 2019

21 of 48

CS 699. Representation Learning. Fall 2019

22 of 48

Single Output

Side note: In YouTube-8M paper [Abu-El-Haija et al, 2016], for video classification, we found that it was beneficial to add video-level loss at more frames.

Another alternative to combining sequence latents into a single prediction is through “Temporal Attention”: Take a weighted average! Weights are coming from an “attention network”

[Ramanathan et al, CVPR 2016]

CS 699. Representation Learning. Fall 2019

23 of 48

Bidirectional RNN

Future utterances are useful for understanding information

CS 699. Representation Learning. Fall 2019

24 of 48

Gated Recurrent Units (GRU)

Half-way between RNN and LSTM

CS 699. Representation Learning. Fall 2019

25 of 48

Agenda

Motivation & Applications
Overview RNN Configurations
Vanishing/Exploding Gradient
RNNs as Generative Models
Deep RNNs and large number of classes

CS 699. Representation Learning. Fall 2019

26 of 48

Backpropagation Through Time

...

How to calculate ?
Treat W as if it is different variables

multiply

concat

multiply

concat

multiply

concat

CS 699. Representation Learning. Fall 2019

27 of 48

Backpropagation Through Time

...

How to calculate ?
Treat W as if it is different variables
Feed them same value!
Calculate:

multiply

concat

multiply

concat

multiply

concat

CS 699. Representation Learning. Fall 2019

28 of 48

Vanishing/Exploding Gradients

Which term contributes more to the sum?
Which terms dominate the gradient?

CS 699. Representation Learning. Fall 2019

29 of 48

Explod/Vanishing Gradients

Does Contribution from early larger than later words? It depends on:

The norm of x (i.e. activation)
The derivative of element-wise activation
Values of W (i.e. initialization)

Researchers developed:

Good initialization.
Good activation (e.g. tanh).
LayerNorm, BatchNorm, WeightNorm.
Residual (skip) Connections

CS 699. Representation Learning. Fall 2019

30 of 48

Vanishing & Exploding Gradients

Pascanu et al 2012 On the difficulty of training RNNs

Let γ be an upper-bound on the gradient of activation σ

Let largest eigenvalue of W be λ₁

If λ₁γ > 1 : Gradients will explode

If λ₁γ < 1 : Gradients will vanish

Proposed solution: Gradient clipping

If you don’t have a project yet and into optimization: I have an idea!

CS 699. Representation Learning. Fall 2019

31 of 48

Agenda

Motivation & Applications
Overview RNN Configurations
Vanishing/Exploding Gradient
RNNs as Generative Models
Deep RNNs and large number of classes

CS 699. Representation Learning. Fall 2019

32 of 48

Contrast with FVSBN (Autoregressive Model)

Fully Visible Sigmoid Belief Net (see slides on AR models)

To output a sequence via RNN, use “stop symbol” as a special token. Symbol must be appended to all training examples.

Another option: dedicate Bernoulli unit that indicates “stop”, that can be trained with cross-entropy. Useful e.g. when outputting sequence of continuous numbers

RNN: Edges between y’s removed. Hidden state can encode past.

CS 699. Representation Learning. Fall 2019

33 of 48

Conditional Generation

vector-to-sequence

CS 699. Representation Learning. Fall 2019

34 of 48

Case Study: Image Captioning

Show and Tell (2014) paper [link]

Can learn powerful representations using “Visual attention”

CS 699. Representation Learning. Fall 2019

35 of 48

[“Show, Attend, and Tell” 2015]

At every step, decoder has access to previous state + image features [as list of cuboids]. Uses attention on the cuboids.

Visual Results next slide

CS 699. Representation Learning. Fall 2019

36 of 48

[“Show, Attend, and Tell” 2015]

CS 699. Representation Learning. Fall 2019

37 of 48

Sequence-to-Sequence

Lengths can vary from each other

Training maximizes

Stop symbol inserted at end of input

CS 699. Representation Learning. Fall 2019

38 of 48

Agenda

Motivation & Applications
Overview RNN Configurations
Vanishing/Exploding Gradient
RNNs as Generative Models
Deep RNNs and large number of classes

CS 699. Representation Learning. Fall 2019

39 of 48

Deep RNNs

How to construct Deep RNNs Pascanu et al 2013

How can we make an RNN deep?

CS 699. Representation Learning. Fall 2019

40 of 48

Results: Deep RNNs

Negative log likelihood on language modelling:

-log likelihood on music modeling

CS 699. Representation Learning. Fall 2019

41 of 48

Large Number of Classes?

The output is often the size of vocabulary (hundreds of thousands?) How to have such an output layer?

Hierarchical Softmax [paper, AISTATS 2005]
Sampled Softmax [paper, 2014]
Negative Sampling [like word2vec’s NEG objective]
Long list on https://ruder.io/word-embeddings-softmax/ and also Chapter 12.4 of deeplearning book

CS 699. Representation Learning. Fall 2019

42 of 48

Hierarchical Softmax (right figure)

Tree can be “random”, or using some ontology e.g. WordNet.

CS 699. Representation Learning. Fall 2019

43 of 48

Transformer Models

Introduced Self-attention
Achieve state-of-the-art on many tasks (e.g. Translation, Part-of-speech tagging)
Gave inspiration to GAT (Graph Attention Networks)
Base of BERT [state-of-the-art embedding]
Character-level modeling [embedding dictionary has only single characters]

CS 699. Representation Learning. Fall 2019

44 of 48

Attention is All You Need

Rather than propagating information forward in time
process all words simultaneously in parallel, then let them all attend to each other (self attention).
Encode position through harmonic series to enable words have relative distance information
Attention Visualization →

CS 699. Representation Learning. Fall 2019

45 of 48

Attention is All You Need: Model Architecture

CS 699. Representation Learning. Fall 2019

46 of 48

Model illustration (http://jalammar.github.io/illustrated-transformer/)

CS 699. Representation Learning. Fall 2019

47 of 48

Embed words or Characters?

[AAAI 2019]: they use a Transformer model of 64 layers

CS 699. Representation Learning. Fall 2019

48 of 48

Thank you!

If there is interest + Time, talk about coding RNNs

Office hours: Same time and place (Leavey 2-3:20 PM)

CS 699. Representation Learning. Fall 2019