2 of 13

Sign Language for Deaf People

~1-3 deaf in 1.000 people �in Europe ~700.000 - 2.1M
Different sign languages all over the world, own grammar & syntax
Complex body movements, the „flow“ is essential, not only static signs
Communication restricted to small group of like-minded, relatives and teachers

Project Idea: �Tablet that Translates �Sign Language

My wife teaches with signs

3 of 13

Focus: Recognise Gestures = Video Classification

“Natural” Language

Sign Language

Speech Recognition

= Solved

Understanding gestures �too complex – until �very recently!

Live-demo: �prototype recognising isolated gestures �of ~5 sec

4 of 13

State-of-the-Art: Image Understanding Solved with Deep Learning

Transfer Learning!

Convolutional Neural Networks trained on ImageNet can be used to extract image features, eg eyes, corners, …
Can easily be re-used for any image classification - not only on 1000 standard categories
Enables extremely powerful transfer learning (analogous to human brain)
⇒ Fuels current AI hype

Convolutional Neural Nets Learn Hierarchical Features

(1) http://image-net.org/challenges/LSVRC/2017/

1000 object categories

1.2 million training images

Classification Error in 2017: 2.3%⁽¹⁾

5 of 13

Video Adds Time Dimension ⇒ Much Harder !

(1) Figure from YouTube-8M: A Large-Scale Video Classification Benchmark, Google Research, Sep 2016. Enriched with additional video datasets.

Short-term: Detect “Flow” of Movement

Long-term: Remember Sequence �(eg first lightning then rain)

Required Large-Scale Video Datasets⁽¹⁾ only Emerging Since 2016

(2016)

(2012)

(2010)

(2015)

Kinetics

(2017)

ChaLearn249�(2016)

(2015)

20BN�Jester�(2017)

(2014)

Subset of 20 Italian sign language gestures used for this prototype

6 of 13

1^st Prototype: ImageNet Transfer Learning + �Memory Unit

...

time

Video Frames �224x224

...

1024 x 1

Good result on proof-of-concept: ~70% accuracy on UCF101⁽¹⁾ human action video dataset �(~10 sec each)

But not satisfying on short-term movements required for sign language. Only ~40% accuracy on gestures from ChaLearn20⁽²⁾ ...

🤔

(1) UCF101 dataset consists of 101 human actions (eg badminton, applying lipstick) and 13.000 videos

(2) ChaLearn dataset consists of 249 human gestures and 50,000 videos. For this prototype 20 mostly Italian sign language gestures were used (3200 videos)

CNN

LSTMs

1. Slice video into 40 frames

Gesture

2. Extract image “features” from �CNN “MobileNet” pretrained on ImageNet

3. Feed the image features (sequentially) into Long-Short-Term-Memory

1^st

7 of 13

1^st Try: “Optical flow” Boosts Short-Term Movement Detection

Optical flow (instead of “normal” frames) increases accuracy on ChaLearn20 above 70% 😀

Calculated from 2 successive frames
Computationally very expensive!

epochs

2^nd

8 of 13

1^st Solution = State-of-the-Art: �Pre-Trained 3D-CNN from Deepmind

...

time

Optical Flow �224x224

93%

Accuracy on ChaLearn20 gestures

3D-CNN add time as 3^rd dimension to “classical” image CNNs (with 2D) ⇒ very complex models
In 2017 Deepmind published I3D⁽¹⁾, which is “inflated” from the famous 2D-“Inception” CNN
I3D inherits pretraining from ImageNet + is pretrained on Kinetics human action video dataset
Pretrained I3D can be used for transfer learning!

(1) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, J Carreira, A Zisserman, Deepmind, 2017. Keras version available since 2018 https://github.com/dlpbc/keras-kinetics-i3d

(2) ChaLearn249 is very hard challenge because val/test sets are performed by “new” signers. Highscore in Summer 2016 competition was “only” 57% on test. My I3D network achieves 58% on val set :)

⁽²⁾

3^r^d

3D-CNN

Gesture

9 of 13

Live-Demo of Sign Recognition Prototype

All training of neural networks on Amazon cloud with GPU
Live-demo on MacBook Pro i5 (no GPU)
Using 20 Italian sign language gestures (3.200 videos) from ChaLearn dataset
Prototype captures video (5 sec) from attached webcam, displays optical flow (~5 sec) and predicts gesture with I3D neural network (~5 sec)
https://github.com/FrederikSchorr/�sign-language

10 of 13

Sign Language Recognition Prototype: Screenshot

11 of 13

Sign Language Recognition Prototype: �1 min Video

12 of 13

Future: From Sign to Sentence

(1) Figure from Neural Sign Language Translation, Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, Richard Bowden, 2018. (2) Natural Language Processing

Anybody interested in making this happen?

Historically: �Isolate signs in video stream, then translate 1-by-1

New end-to-end approach: �Sign Language Translation⁽¹⁾ merging video recognition �with NLP⁽²⁾

13 of 13

(Almost) No Large-Scale Sign Language Datasets Available

Backup

Phoenix Weather by RWTH Aachen	Based on 2009-2011 weather forecast on german TV Isolated gestures and whole annotated sentences 7 signers, very homogenous TV setting
ChaLearn by �Barcelona University	249 (isolated) human gestures, 50.000 videos 21 signers, heterogenous background settings Validation/test set performed by “new” signers ⇒ very hard challenge Subset of 20 Italian sign language gestures used for this prototype
LedaSila by �Alpen Adria Universität	Lexicon for Austrian sign language, 16.000 signs, 33.000 videos Most words occur only once, only 10 words occur more than 20 times ⇒ not enough for neural network training
20BN Jester by twentybn.com	27 human gestures, 150.000 videos Similar to sign language gestures - but with very low complexity
Sign language on public TV	Several public tv stations offer sign language streams with subtitles These could probably be scrapped ⇒ separate project