1 of 38

Unsupervised Learning of Video Representations using LSTMs

Seyed Kamyar Seyed Ghasemipour

2 of 38

They’re important & they’re everywhere

  • Representations contain all the information we have about a piece of data
  • Success of methods highly dependent on the data representation used

3 of 38

NLP:

  • Word and Sentence Representations:

Word2Vec

Skip-Thought Vectors

4 of 38

Vision:

  • SIFT Features
  • HOG Features
  • SURF Features
  • “[A-Z]+ Features”

Visual

Words

5 of 38

What makes a representation good?

6 of 38

What makes a representation good?

  • Representative of the data
  • Distributed representations
  • Hierarchy of features
  • Disentangle many factors of variation
  • Semi-supervised learning
  • ….

7 of 38

Representations for Videos

8 of 38

Motivation

  • Videos are a rich source of information
  • What constitutes an object: object boundaries, occlusion
  • Information regarding behaviour:
    • How objects, animals, humans behave and deform
  • Intuitive physics

9 of 38

Motivation

  • Videos are a rich source of information
  • What constitutes an object: object boundaries, occlusion
  • Information regarding behaviour:
    • How objects, animals, humans behave and deform
  • Intuitive physics

source: http://meowgifs.com/wp-content/uploads/2013/10/cat-jumps-into-bean-bag.gif

10 of 38

Motivation

  • Information regarding behaviour:
    • How objects, animals, humans behave and deform

Cat 1

Cat 2

Cat 3

Cat 1

source: screenshots from https://www.youtube.com/watch?v=okOVxfuSYPk

Images

Video

vs.

11 of 38

Motivation

  • Videos are a rich source of information
  • What constitutes an object: object boundaries, occlusion
  • Information regarding behaviour:
    • How objects, animals, humans behave
  • Intuitive physics:
    • Understand that objects can be seen from different viewpoints

12 of 38

Related Work: Video (Language) Modelling: A Baseline for Generative Models of Natural Video (Ranzato et. al 2014)

  • Defining a good loss function is hard (more on this towards the end)
  • Quantize 8x8 patches using k-means clustering, 10K clusters
  • Feed model a patch and its neighborhood across time
  • Two Tasks:
    • Predicting future frames
    • Interpolating in-between frames

source: Ranzato, MarcAurelio, et al. "Video (language) modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).

13 of 38

Related Work: Video (Language) Modelling: ...

source: Ranzato, MarcAurelio, et al. "Video (language) modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).

Model

Constant Optical Flow Assumption

14 of 38

Related Work: Video (Language) Modelling: ...

source: Ranzato, MarcAurelio, et al. "Video (language) modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).

Model

Linear Interp. of Optical Flow

Linear Interp. in Pixel Space

15 of 38

Representations for Videos

16 of 38

Model

  • Important Decision: Model Design Choices & Loss Function
  • LSTM Encoder-Decoder model
  • Output of the Model:
    • Reconstruction
    • Future Prediction
    • Both
  • Decoder:
    • Conditioned
    • Unconditioned
  • Input of the Model:
    • Image Patches
    • Extracted Features

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

17 of 38

Model Design Choices: Output

Input Reconstruction

  • LSTM asked to reconstruct the input it received
  • Representation needs to remember information about:
    • Objects in the video
    • Background
    • Motion in the video
  • Con: It might tend to memorize the input

Both

  • Have the LSTM both reconstruct the input it received and predict future frames of the video
  • Hope is to relieve the potential cons of each approach

Future Prediction

  • LSTM asked to predict the the future frames in the video sequence
  • Representation needs to remember information about:
    • Objects in the video
    • Background
    • Motion in the video
  • Con: It might tend to only care about the last few frames of the input

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

18 of 38

Model Design Choices: (Un)/Conditioned

Conditioned

  • “It allows the decoder to model multiple modes in the target sequence”
  • This would be useful for predicting the future frames in the video
  • Given what is currently seen, the future is still unpredictible

Unconditioned

  • Difference between consecutive frames is small
  • Conditioned on previous generated frame, the model relies more on short-range correlations
  • It won’t try to focus on needed long-term knowledge

To condition, or not to condition, that is the question:

19 of 38

Evaluation

20 of 38

Qualitative Analysis: Moving MNIST

  • 2 digits moving inside a 64x64 image
  • Velocity and direction assigned uniformly at random
  • Duration: 20 frames
  • Encoder sees 10 frames
  • Model asked to:
    • Reconstruct 10 frames
    • Predict next 10 frames
  • Logistic output, Cross-Entropy Loss

source: https://github.com/emansim/unsupervised-videos

21 of 38

Qualitative Analysis: Moving MNIST

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

22 of 38

Qualitative Analysis: Natural Images

  • UCF-101 dataset:
    • 13,320 videos
    • Avg. Length: 6.2 seconds
    • 101 different action categories
    • 32 x 32 patches extracted from videos
  • Encoder sees 16 frames
  • Model asked to:
    • Reconstruct 16 frames
    • Predict 13 future frames
  • Linear output units, MSE Loss

source: https://github.com/emansim/unsupervised-videos

23 of 38

Qualitative Analysis: Moving MNIST

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

24 of 38

Qualitative Analysis: Generalization Over Time

  • MNIST 2 digits task
  • Training:
    • Encoder sees 10 frames
    • Model predicts 10 frames
  • Test Time:
    • Encoder sees 10 frames
    • Model predicts 100 frames

source: http://www.cs.toronto.edu/~nitish/unsupervised_video/

25 of 38

Qualitative Analysis: Generalization Over Time

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

26 of 38

Qualitative Analysis: Out of Domain Inputs

  • MNIST 2 digits task
  • Test on frames that contain 1 or 3 digits instead

  • Authors’ Hypothesis:
    • Attention mechanism
    • Variable amounts of computation

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

27 of 38

Qualitative Analysis: Visualizing Features

  • MNIST 2 digit tasks
  • Visualizing the weights of the learned model

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

28 of 38

Quantitative Analysis: Supervised Tasks

  • Task: Action Recognition
  • Datasets:
    • UCF-101: mentioned in previous slides
    • HMDB-51:
      • 5100 videos
      • 51 different action categories
      • Avg. Length: 3.2 seconds
  • Representation Learning:
    • Two layer composite model with 2048 hidden units
    • Dataset:
      • Sports-1M, 1 million YouTube Videos, 240x320 dimensions
      • Created 300 hours of YouTube videos
      • Sampled 10 second clips at 30 fps
      • “Percepts”: Features from fully connected layer of a convnet
    • Model sees 16 frames
    • Reconstructs 16, Predicts 13

29 of 38

Quantitative Analysis: Supervised Tasks

2. Initialize an LSTM with encoder weights

3. Compare to randomly initialized LSTM

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

30 of 38

Quantitative Analysis: Supervised Tasks

  • Improvements when there is not much labelled data
    • UCF-101: 1 labelled video per class, Boost from 29.6% to 34.3%
    • HMDB-51: 1 labelled video per class, Boost from 14.4% to 19.1%

  • UCF-101 full contains 70 videos per class

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

31 of 38

Quantitative Analysis: Other Action Recognizers

  • Only RGB:
  • 3D Convolutions
  • Long Term Recurrent CNNs
  • Only Flow Features:
  • 3D Convolutions
  • Long Term Recurrent CNNs
  • Both:
  • 3D Convolutions

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

32 of 38

Quantitative Analysis: Different Types of Models

  • Compare performance of different model design choices
  • Task: Predict future frames
    • MNIST 2 digits: Cross Entropy
    • UCF-101: SSD

source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).

33 of 38

Summary:

  • Learning Unsupervised Video Representations
  • LSTM Encoder-Decoder Model
    • Input Reconstruction
    • Future Prediction
  • Results:
    • Works well when there is significant structure
    • Performs poorly on natural videos

34 of 38

Future Directions:

  • Make prediction work for natural vidoes → Full-sized, longer videos
    • Attention Mechanism
    • Applying the model convolutionally over entire frames and stack layers
    • Lower CNN Layer Features
  • Loss Function

source: http://previews.123rf.com/images/maxym/maxym1307/maxym130700657/20911654-American-Country-Road-Side-View--Stock-Photo.jpg

http://www.dzzyn.com/wp-content/uploads/2015/09/10-Athlete-PNG-Images-Free-Cutout-People-for-Architecture-Landscape-Interior-Renderings-Cyclist.png

35 of 38

Related Work: Video (Language) Modelling: A Baseline for Generative Models of Natural Video

  • SSD:
    • Not a good metric in pixel space
    • Not stable to small image deformations
    • Responds to uncertainty with linear blurring
  • Log-likelihood:
    • Reduces unsupervised learning to a density estimation problem
    • Estimating densities in very high dimensional spaces can be difficult
    • The distribution of natural images which is highly concentrated and multimodal
  • Food for Thought: What would be better loss functions?
    • Low-level error measure (SSD after mean-pooling)
    • Combined with error measure on feature vector from FC layer of a CNN

source: Ranzato, MarcAurelio, et al. "Video (language) modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).

36 of 38

Future Directions: Loss Function

SSD: ~713

SSD: ~728

37 of 38

Thanks for listening! :D

Questions?

38 of 38

Bibliography

  • Bengio, Yoshua, Aaron Courville, and Pierre Vincent. "Representation learning: A review and new perspectives." Pattern Analysis and Machine Intelligence, IEEE Transactions on 35.8 (2013): 1798-1828.
  • Ranzato, MarcAurelio, et al. "Video (language) modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).
  • Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).