1 of 38

Unsupervised Learning of Video Representations using LSTMs

Seyed Kamyar Seyed Ghasemipour

2 of 38

They’re important & they’re everywhere

Representations contain all the information we have about a piece of data
Success of methods highly dependent on the data representation used

3 of 38

NLP:

Word and Sentence Representations:

Word2Vec

Skip-Thought Vectors

4 of 38

Vision:

SIFT Features
HOG Features
SURF Features
“[A-Z]+ Features”

Visual

Words

5 of 38

What makes a representation good?

6 of 38

What makes a representation good?

Representative of the data
Distributed representations
Hierarchy of features
Disentangle many factors of variation
Semi-supervised learning
….

7 of 38

Representations for Videos

8 of 38

Motivation

Videos are a rich source of information
What constitutes an object: object boundaries, occlusion
Information regarding behaviour:

How objects, animals, humans behave and deform

Intuitive physics

9 of 38

Motivation

Videos are a rich source of information
What constitutes an object: object boundaries, occlusion
Information regarding behaviour:

How objects, animals, humans behave and deform

Intuitive physics

source: http://meowgifs.com/wp-content/uploads/2013/10/cat-jumps-into-bean-bag.gif

10 of 38

Motivation

Information regarding behaviour:

How objects, animals, humans behave and deform

Cat 1

Cat 2

Cat 3

Cat 1

source: screenshots from https://www.youtube.com/watch?v=okOVxfuSYPk

Images

Video

vs.

11 of 38

Motivation

Videos are a rich source of information
What constitutes an object: object boundaries, occlusion
Information regarding behaviour:

How objects, animals, humans behave

Intuitive physics:

Understand that objects can be seen from different viewpoints

source: http://img02.olx.com.pk/images_olxpk/113966021_2_1000x700_orignal-extreme-punch-top-young-male-for-cat-lovers-shown-in-picturess-upload-photos.jpg

http://www.dundjinni.com/forums/uploads/RavenStarhawke/F72_DraftHorse&Harness_2_RS-SR.png

http://www.gaddidekho.com/wp-content/uploads/2013/06/Bentley-GTC-V8-Top-Angle.jpg

12 of 38

Related Work: Video (Language) Modelling: A Baseline for Generative Models of Natural Video (Ranzato et. al 2014)

Defining a good loss function is hard (more on this towards the end)
Quantize 8x8 patches using k-means clustering, 10K clusters
Feed model a patch and its neighborhood across time
Two Tasks:

Predicting future frames
Interpolating in-between frames

source: Ranzato, MarcAurelio, et al. "Video (language) modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).

13 of 38

Related Work: Video (Language) Modelling: ...

source: Ranzato, MarcAurelio, et al. "Video (language) modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).

Model

Constant Optical Flow Assumption

14 of 38

Related Work: Video (Language) Modelling: ...

source: Ranzato, MarcAurelio, et al. "Video (language) modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).

Model

Linear Interp. of Optical Flow

Linear Interp. in Pixel Space

15 of 38

Representations for Videos

16 of 38

Model

Important Decision: Model Design Choices & Loss Function
LSTM Encoder-Decoder model
Output of the Model:

Reconstruction
Future Prediction
Both

Decoder:

Conditioned
Unconditioned

Input of the Model:

Image Patches
Extracted Features

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

17 of 38

Model Design Choices: Output

Input Reconstruction

LSTM asked to reconstruct the input it received
Representation needs to remember information about:

Objects in the video
Background
Motion in the video

Con: It might tend to memorize the input

Both

Have the LSTM both reconstruct the input it received and predict future frames of the video
Hope is to relieve the potential cons of each approach

Future Prediction

LSTM asked to predict the the future frames in the video sequence
Representation needs to remember information about:

Objects in the video
Background
Motion in the video

Con: It might tend to only care about the last few frames of the input

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

18 of 38

Model Design Choices: (Un)/Conditioned

Conditioned

“It allows the decoder to model multiple modes in the target sequence”
This would be useful for predicting the future frames in the video
Given what is currently seen, the future is still unpredictible

Unconditioned

Difference between consecutive frames is small
Conditioned on previous generated frame, the model relies more on short-range correlations
It won’t try to focus on needed long-term knowledge

To condition, or not to condition, that is the question:

20 of 38

Qualitative Analysis: Moving MNIST

2 digits moving inside a 64x64 image
Velocity and direction assigned uniformly at random
Duration: 20 frames
Encoder sees 10 frames
Model asked to:

Reconstruct 10 frames
Predict next 10 frames

Logistic output, Cross-Entropy Loss

source: https://github.com/emansim/unsupervised-videos

21 of 38

Qualitative Analysis: Moving MNIST

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

22 of 38

Qualitative Analysis: Natural Images

UCF-101 dataset:

13,320 videos
Avg. Length: 6.2 seconds
101 different action categories
32 x 32 patches extracted from videos

Encoder sees 16 frames
Model asked to:

Reconstruct 16 frames
Predict 13 future frames

Linear output units, MSE Loss

source: https://github.com/emansim/unsupervised-videos

23 of 38

Qualitative Analysis: Moving MNIST

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

24 of 38

Qualitative Analysis: Generalization Over Time

MNIST 2 digits task
Training:

Encoder sees 10 frames
Model predicts 10 frames

Test Time:

Encoder sees 10 frames
Model predicts 100 frames

source: http://www.cs.toronto.edu/~nitish/unsupervised_video/

25 of 38

Qualitative Analysis: Generalization Over Time

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

26 of 38

Qualitative Analysis: Out of Domain Inputs

MNIST 2 digits task
Test on frames that contain 1 or 3 digits instead

Authors’ Hypothesis:

Attention mechanism
Variable amounts of computation

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

27 of 38

Qualitative Analysis: Visualizing Features

MNIST 2 digit tasks
Visualizing the weights of the learned model

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

28 of 38

Quantitative Analysis: Supervised Tasks

Task: Action Recognition
Datasets:

UCF-101: mentioned in previous slides
HMDB-51:

5100 videos
51 different action categories
Avg. Length: 3.2 seconds

Representation Learning:

Two layer composite model with 2048 hidden units
Dataset:

Sports-1M, 1 million YouTube Videos, 240x320 dimensions
Created 300 hours of YouTube videos
Sampled 10 second clips at 30 fps
“Percepts”: Features from fully connected layer of a convnet

Model sees 16 frames
Reconstructs 16, Predicts 13

29 of 38

Quantitative Analysis: Supervised Tasks

2. Initialize an LSTM with encoder weights

3. Compare to randomly initialized LSTM

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

30 of 38

Quantitative Analysis: Supervised Tasks

Improvements when there is not much labelled data

UCF-101: 1 labelled video per class, Boost from 29.6% to 34.3%
HMDB-51: 1 labelled video per class, Boost from 14.4% to 19.1%

UCF-101 full contains 70 videos per class

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

31 of 38

Quantitative Analysis: Other Action Recognizers

Only RGB:
3D Convolutions
Long Term Recurrent CNNs
Only Flow Features:
3D Convolutions
Long Term Recurrent CNNs
Both:
3D Convolutions

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

32 of 38

Quantitative Analysis: Different Types of Models

Compare performance of different model design choices
Task: Predict future frames

MNIST 2 digits: Cross Entropy
UCF-101: SSD

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

33 of 38

Summary:

Learning Unsupervised Video Representations
LSTM Encoder-Decoder Model

Input Reconstruction
Future Prediction

Results:

Works well when there is significant structure
Performs poorly on natural videos

34 of 38

Future Directions:

Make prediction work for natural vidoes → Full-sized, longer videos

Attention Mechanism
Applying the model convolutionally over entire frames and stack layers
Lower CNN Layer Features

Loss Function

source: http://previews.123rf.com/images/maxym/maxym1307/maxym130700657/20911654-American-Country-Road-Side-View--Stock-Photo.jpg

http://www.dzzyn.com/wp-content/uploads/2015/09/10-Athlete-PNG-Images-Free-Cutout-People-for-Architecture-Landscape-Interior-Renderings-Cyclist.png

35 of 38

Related Work: Video (Language) Modelling: A Baseline for Generative Models of Natural Video

SSD:

Not a good metric in pixel space
Not stable to small image deformations
Responds to uncertainty with linear blurring

Log-likelihood:

Reduces unsupervised learning to a density estimation problem
Estimating densities in very high dimensional spaces can be difficult
The distribution of natural images which is highly concentrated and multimodal

Food for Thought: What would be better loss functions?

Low-level error measure (SSD after mean-pooling)
Combined with error measure on feature vector from FC layer of a CNN

source: Ranzato, MarcAurelio, et al. "Video (language) modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).

36 of 38

Future Directions: Loss Function

source: http://previews.123rf.com/images/maxym/maxym1307/maxym130700657/20911654-American-Country-Road-Side-View--Stock-Photo.jpg

http://www.dzzyn.com/wp-content/uploads/2015/09/10-Athlete-PNG-Images-Free-Cutout-People-for-Architecture-Landscape-Interior-Renderings-Cyclist.png

http://howset.com/media/cache/9a/7a/9a7a790f7cbe9e0f518a977e88504580.png

SSD: ~713

SSD: ~728

37 of 38

Thanks for listening! :D

Questions?

38 of 38

Bibliography

Bengio, Yoshua, Aaron Courville, and Pierre Vincent. "Representation learning: A review and new perspectives." Pattern Analysis and Machine Intelligence, IEEE Transactions on 35.8 (2013): 1798-1828.
Ranzato, MarcAurelio, et al. "Video (language) modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).
Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

1 of 38

2 of 38

3 of 38

4 of 38

5 of 38

6 of 38

7 of 38

8 of 38

9 of 38

10 of 38

11 of 38

12 of 38

13 of 38

14 of 38

15 of 38

16 of 38

17 of 38

18 of 38

19 of 38

20 of 38

21 of 38

22 of 38

23 of 38

24 of 38

25 of 38

26 of 38

27 of 38

28 of 38

29 of 38

30 of 38

31 of 38

32 of 38

33 of 38

34 of 38

35 of 38

36 of 38

37 of 38

38 of 38