Sign Language Recognition
Video Understanding through Deep Learning�Frederik Schorr | July 2018
Sign Language for Deaf People
Project Idea: �Tablet that Translates �Sign Language
My wife teaches with signs
Focus: Recognise Gestures = Video Classification
“Natural” Language
Sign Language
Speech Recognition
= Solved
Understanding gestures �too complex – until �very recently!
Live-demo: �prototype recognising isolated gestures �of ~5 sec
State-of-the-Art: Image Understanding Solved with Deep Learning
Transfer Learning!
Convolutional Neural Nets Learn Hierarchical Features
1000 object categories
1.2 million training images
Classification Error in 2017: 2.3%(1)
Video Adds Time Dimension ⇒ Much Harder !
(1) Figure from YouTube-8M: A Large-Scale Video Classification Benchmark, Google Research, Sep 2016. Enriched with additional video datasets.
Short-term: Detect “Flow” of Movement
Long-term: Remember Sequence �(eg first lightning then rain)
Required Large-Scale Video Datasets(1) only Emerging Since 2016
(2016)
(2012)
(2010)
(2015)
Kinetics
(2017)
ChaLearn249�(2016)
(2015)
20BN�Jester�(2017)
(2014)
Subset of 20 Italian sign language gestures used for this prototype
1st Prototype: ImageNet Transfer Learning + �Memory Unit
...
...
...
time
Video Frames �224x224
01
40
...
...
1024 x 1
01
40
Good result on proof-of-concept: ~70% accuracy on UCF101(1) human action video dataset �(~10 sec each)
But not satisfying on short-term movements required for sign language. Only ~40% accuracy on gestures from ChaLearn20(2) ...
🤔
(1) UCF101 dataset consists of 101 human actions (eg badminton, applying lipstick) and 13.000 videos
(2) ChaLearn dataset consists of 249 human gestures and 50,000 videos. For this prototype 20 mostly Italian sign language gestures were used (3200 videos)
CNN
CNN
CNN
LSTMs
1. Slice video into 40 frames
Gesture
2. Extract image “features” from �CNN “MobileNet” pretrained on ImageNet
3. Feed the image features (sequentially) into Long-Short-Term-Memory
1st
1st Try: “Optical flow” Boosts Short-Term Movement Detection
Optical flow (instead of “normal” frames) increases accuracy on ChaLearn20 above 70% 😀
epochs
2nd
1st Solution = State-of-the-Art: �Pre-Trained 3D-CNN from Deepmind
...
...
...
time
Optical Flow �224x224
01
40
93%
Accuracy on ChaLearn20 gestures
(1) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, J Carreira, A Zisserman, Deepmind, 2017. Keras version available since 2018 https://github.com/dlpbc/keras-kinetics-i3d
(2) ChaLearn249 is very hard challenge because val/test sets are performed by “new” signers. Highscore in Summer 2016 competition was “only” 57% on test. My I3D network achieves 58% on val set :)
(2)
3rd
3D-CNN
Gesture
Live-Demo of Sign Recognition Prototype
Sign Language Recognition Prototype: Screenshot
Sign Language Recognition Prototype: �1 min Video
Future: From Sign to Sentence
(1) Figure from Neural Sign Language Translation, Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, Richard Bowden, 2018. (2) Natural Language Processing
Anybody interested in making this happen?
Historically: �Isolate signs in video stream, then translate 1-by-1
New end-to-end approach: �Sign Language Translation(1) merging video recognition �with NLP(2)
(Almost) No Large-Scale Sign Language Datasets Available
Backup
Phoenix Weather by RWTH Aachen |
|
ChaLearn by �Barcelona University |
|
LedaSila by �Alpen Adria Universität |
|
20BN Jester by twentybn.com |
|
Sign language on public TV |
|