1 of 13

Sign Language Recognition

Video Understanding through Deep Learning�Frederik Schorr | July 2018

Photo by rawpixel on Unsplash

2 of 13

Sign Language for Deaf People

  • ~1-3 deaf in 1.000 people �in Europe ~700.000 - 2.1M
  • Different sign languages all over the world, own grammar & syntax
  • Complex body movements, the „flow“ is essential, not only static signs
  • Communication restricted to small group of like-minded, relatives and teachers

Project Idea: �Tablet that Translates �Sign Language

My wife teaches with signs

3 of 13

Focus: Recognise Gestures = Video Classification

“Natural” Language

Sign Language

Speech Recognition

= Solved

Understanding gestures �too complex – until �very recently!

Live-demo: �prototype recognising isolated gestures �of ~5 sec

4 of 13

State-of-the-Art: Image Understanding Solved with Deep Learning

Transfer Learning!

  • Convolutional Neural Networks trained on ImageNet can be used to extract image features, eg eyes, corners, …
  • Can easily be re-used for any image classification - not only on 1000 standard categories
  • Enables extremely powerful transfer learning (analogous to human brain)
  • ⇒ Fuels current AI hype

Convolutional Neural Nets Learn Hierarchical Features

1000 object categories

1.2 million training images

Classification Error in 2017: 2.3%(1)

5 of 13

Video Adds Time Dimension ⇒ Much Harder !

(1) Figure from YouTube-8M: A Large-Scale Video Classification Benchmark, Google Research, Sep 2016. Enriched with additional video datasets.

Short-term: Detect “Flow” of Movement

Long-term: Remember Sequence �(eg first lightning then rain)

Required Large-Scale Video Datasets(1) only Emerging Since 2016

(2016)

(2012)

(2010)

(2015)

Kinetics

(2017)

ChaLearn249�(2016)

(2015)

20BN�Jester�(2017)

(2014)

Subset of 20 Italian sign language gestures used for this prototype

6 of 13

1st Prototype: ImageNet Transfer Learning + �Memory Unit

...

...

...

time

Video Frames �224x224

01

40

...

...

1024 x 1

01

40

Good result on proof-of-concept: ~70% accuracy on UCF101(1) human action video dataset �(~10 sec each)

But not satisfying on short-term movements required for sign language. Only ~40% accuracy on gestures from ChaLearn20(2) ...

🤔

(1) UCF101 dataset consists of 101 human actions (eg badminton, applying lipstick) and 13.000 videos

(2) ChaLearn dataset consists of 249 human gestures and 50,000 videos. For this prototype 20 mostly Italian sign language gestures were used (3200 videos)

CNN

CNN

CNN

LSTMs

1. Slice video into 40 frames

Gesture

2. Extract image “features” from �CNN “MobileNet” pretrained on ImageNet

3. Feed the image features (sequentially) into Long-Short-Term-Memory

1st

7 of 13

1st Try: “Optical flow” Boosts Short-Term Movement Detection

Optical flow (instead of “normal” frames) increases accuracy on ChaLearn20 above 70% 😀

  • Calculated from 2 successive frames
  • Computationally very expensive!

epochs

2nd

8 of 13

1st Solution = State-of-the-Art: �Pre-Trained 3D-CNN from Deepmind

...

...

...

time

Optical Flow �224x224

01

40

93%

Accuracy on ChaLearn20 gestures

  • 3D-CNN add time as 3rd dimension to “classical” image CNNs (with 2D) ⇒ very complex models
  • In 2017 Deepmind published I3D(1), which is “inflated” from the famous 2D-“Inception” CNN
  • I3D inherits pretraining from ImageNet + is pretrained on Kinetics human action video dataset
  • Pretrained I3D can be used for transfer learning!

(1) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, J Carreira, A Zisserman, Deepmind, 2017. Keras version available since 2018 https://github.com/dlpbc/keras-kinetics-i3d

(2) ChaLearn249 is very hard challenge because val/test sets are performed by “new” signers. Highscore in Summer 2016 competition was “only” 57% on test. My I3D network achieves 58% on val set :)

(2)

3rd

3D-CNN

Gesture

9 of 13

Live-Demo of Sign Recognition Prototype

  • All training of neural networks on Amazon cloud with GPU
  • Live-demo on MacBook Pro i5 (no GPU)
  • Using 20 Italian sign language gestures (3.200 videos) from ChaLearn dataset
  • Prototype captures video (5 sec) from attached webcam, displays optical flow (~5 sec) and predicts gesture with I3D neural network (~5 sec)
  • https://github.com/FrederikSchorr/�sign-language

10 of 13

Sign Language Recognition Prototype: Screenshot

11 of 13

Sign Language Recognition Prototype: �1 min Video

12 of 13

Future: From Sign to Sentence

(1) Figure from Neural Sign Language Translation, Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, Richard Bowden, 2018. (2) Natural Language Processing

Anybody interested in making this happen?

Historically: �Isolate signs in video stream, then translate 1-by-1

New end-to-end approach: �Sign Language Translation(1) merging video recognition �with NLP(2)

13 of 13

(Almost) No Large-Scale Sign Language Datasets Available

Backup

Phoenix Weather by RWTH Aachen

  • Based on 2009-2011 weather forecast on german TV
  • Isolated gestures and whole annotated sentences
  • 7 signers, very homogenous TV setting

ChaLearn by �Barcelona University

  • 249 (isolated) human gestures, 50.000 videos
  • 21 signers, heterogenous background settings
  • Validation/test set performed by “new” signers ⇒ very hard challenge
  • Subset of 20 Italian sign language gestures used for this prototype

LedaSila by �Alpen Adria Universität

  • Lexicon for Austrian sign language, 16.000 signs, 33.000 videos
  • Most words occur only once, only 10 words occur more than 20 times ⇒ not enough for neural network training

20BN Jester by twentybn.com

  • 27 human gestures, 150.000 videos
  • Similar to sign language gestures - but with very low complexity

Sign language on public TV

  • Several public tv stations offer sign language streams with subtitles
  • These could probably be scrapped ⇒ separate project