Unsupervised Learning of Video Representations using LSTMs
Seyed Kamyar Seyed Ghasemipour
They’re important & they’re everywhere
NLP:
Word2Vec
Skip-Thought Vectors
Vision:
Visual
Words
What makes a representation good?
What makes a representation good?
Representations for Videos
Motivation
Motivation
source: http://meowgifs.com/wp-content/uploads/2013/10/cat-jumps-into-bean-bag.gif
Motivation
Cat 1
Cat 2
Cat 3
Cat 1
source: screenshots from https://www.youtube.com/watch?v=okOVxfuSYPk
Images
Video
vs.
Motivation
http://www.dundjinni.com/forums/uploads/RavenStarhawke/F72_DraftHorse&Harness_2_RS-SR.png
http://www.gaddidekho.com/wp-content/uploads/2013/06/Bentley-GTC-V8-Top-Angle.jpg
Related Work: Video (Language) Modelling: A Baseline for Generative Models of Natural Video (Ranzato et. al 2014)
source: Ranzato, MarcAurelio, et al. "Video (language) modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).
Related Work: Video (Language) Modelling: ...
source: Ranzato, MarcAurelio, et al. "Video (language) modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).
Model
Constant Optical Flow Assumption
Related Work: Video (Language) Modelling: ...
source: Ranzato, MarcAurelio, et al. "Video (language) modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).
Model
Linear Interp. of Optical Flow
Linear Interp. in Pixel Space
Representations for Videos
Model
source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).
Model Design Choices: Output
Input Reconstruction
Both
Future Prediction
source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).
Model Design Choices: (Un)/Conditioned
Conditioned
Unconditioned
To condition, or not to condition, that is the question:
Evaluation
Qualitative Analysis: Moving MNIST
source: https://github.com/emansim/unsupervised-videos
Qualitative Analysis: Moving MNIST
source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).
Qualitative Analysis: Natural Images
source: https://github.com/emansim/unsupervised-videos
Qualitative Analysis: Moving MNIST
source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).
Qualitative Analysis: Generalization Over Time
source: http://www.cs.toronto.edu/~nitish/unsupervised_video/
Qualitative Analysis: Generalization Over Time
source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).
Qualitative Analysis: Out of Domain Inputs
source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).
Qualitative Analysis: Visualizing Features
source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).
Quantitative Analysis: Supervised Tasks
Quantitative Analysis: Supervised Tasks
2. Initialize an LSTM with encoder weights
3. Compare to randomly initialized LSTM
source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).
Quantitative Analysis: Supervised Tasks
source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).
Quantitative Analysis: Other Action Recognizers
source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).
Quantitative Analysis: Different Types of Models
source: Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhutdinov. "Unsupervised learning of video representations using LSTMs." arXiv preprint arXiv:1502.04681 (2015).
Summary:
Future Directions:
http://www.dzzyn.com/wp-content/uploads/2015/09/10-Athlete-PNG-Images-Free-Cutout-People-for-Architecture-Landscape-Interior-Renderings-Cyclist.png
Related Work: Video (Language) Modelling: A Baseline for Generative Models of Natural Video
source: Ranzato, MarcAurelio, et al. "Video (language) modeling: a baseline for generative models of natural videos." arXiv preprint arXiv:1412.6604 (2014).
Future Directions: Loss Function
http://howset.com/media/cache/9a/7a/9a7a790f7cbe9e0f518a977e88504580.png
SSD: ~713
SSD: ~728
Thanks for listening! :D
Questions?
Bibliography