1 of 34

Continuous Sign Language Recognition

Group 18 - 費群安, 馬莎琳, 齊婉平, 杜威

Department of Computer Science & Information Engineering,

National Central University, Taiwan.

�

2 of 34

Introduction

Sign language (SL) is an important way to communicate between people with hearing loss (signer). However there is still barrier when communicating with normal people (non-signer). SL is a complex language combining hand gesture, body movement, and facial expression.
With today's image capturing technology, not only we can obtain the image, we also can obtain the rgb, and key-points of skeletal. Therefore we get more information about a video.

3 of 34

Introduction & Motivation

Sign Language Recognition (SLR) is a more challenging problem:

Sign language requires both global body motion and delicate arm/hand gestures to distinctly and accurately express its meaning.
Similar gestures can even impose various meanings depending on the number of repetitions.
Different signers may perform sign language differently (e.g., speed, localism, left-handers, right-handers, body shape)

Hypothesis:

Skeleton based methods can act as a strong complements to RGB / RGB-D based methods.
Combining skeletal and RGB method may resulting better result
Different modalities contain different valuable information. Their ensembles always improve the overall performance. (e.g. RGB + Optical flow)

4 of 34

Contributions

Support the SL user (hearing-impaired) / community to communicate.
Proposed a novel approach to translate sequence of gesture / continuous gesture to its sentence using full frame image and key points features.

5 of 34

Proposed Architecture

Spatial Module

Keypoint Features
Full-frame Feature

Temporal Module

Multi-feature Self-Attention Layer

Sequence Learning Module

Bidirectional Long Short-Term Memory (BiLSTM)
Connectionist Temporal Classification (CTC)

6 of 34

Whole-body Pose Estimation

Traditional 2D human pose estimation:

16 points or 17 points only
Does not include hand keypoints �

Problems using separate hand pose model:

Hand pose estimator cannot work without detector.
Hand detector fails due to motion blur / low resolution.�

133-point whole-body keypoints:

Face: 68 points
Body: 17 points
Hands: 34 points
Feet: 6 points �

Advantages of whole-body keypoint estimator:

Consistent and faithful estimation of hand keypoints
Resistant to motion blurs

7 of 34

Pros and Cons of Skeleton-based SLR

Pros:

• Accuracy is high.

• No interference of background.

• Signer-invariant.

• Lightweight network, easy to train.

Cons:

• Finger key points estimation may not be accurate.

Solution:

• Those inaccurate key points may be corrected by other modalities (full-frame).

Finger that

wasn’t captured

8 of 34

Dataset

9 of 34

Dataset

Chinese Sign Language Recognition Dataset

1920x1080 resolution
100 Clases/Sentences
25.000 Videos
RGB + D from kinect

10 of 34

Spatial Module

11 of 34

Pretrain Weight

12 of 34

Pixel-Map Weight Training

To let the backbone layer �(VGG Conv) warm-up
Obtain a good weight for full-frame feature
Improve recognition capability

(56, 56, 256)�16 Sample Visualized

13 of 34

Preprocessing

14 of 34

Full-frame Feature

(n, 56, 56, 256)�64 Sample Visualized

(n, 224, 224, 3)

Cropped & Resized

Full-frame

input

*n = frame length

15 of 34

Keypoint Feature

We utilize High-Resolution Net (HRNet-W48) to extract keypoint from each video
Obtained 133 key-points in total from full model prediction
get 27 important key-points as our input

(n x 1 x 27 x 3)

*n = frame length

16 of 34

Model Input

(n, 56, 56, 256)�64 Sample Visualized

(n, 1, 27, 3)�Keypoint Visualized

*n = frame length

17 of 34

Temporal Module

18 of 34

Sequence Learning Module

19 of 34

Sequence Learning Overview

With the proposed Spatial Module Features (SMF) and Temporal Module Features (TMF), the network now could generate inter-cue feature sequence.
Utilize BiLSTM which then fed into CTC layer to map the previous output to the sign gloss/label.

20 of 34

Bidirectional Long Short-Term Memory (BLSTM)

Recurrent neural networks (RNN) can use their internal state to model the state transitions in the sequence of inputs. Hence, we use RNN to map the spatial-temporal feature sequence to its sign gloss sequence.
In our method, we select the BiLSTM unit as the recurrent unit for its ability in processing long-term dependencies.
BLSTM concatenates forward and backward hidden states from bidirectional inputs.

21 of 34

Connectionist Temporal Classification (CTC)

Mainly used to tackle problem of mapping video sequence to ordered sign gloss/label sequence, where the explicit alignment between them is unknown.
The main objective of CTC is to maximize the sum of probabilities of all possible alignment paths between input and target sequence.

BLSTM Input

22 of 34

Training

23 of 34

Overview

Model implemented in Keras & Tensorflow 2.0 framework
Data split into Training & Test Set (80/20)

20000 videos for training
5000 videos for testing

24 of 34

Result

25 of 34

Training Result

Without Attention
9% Word Error Rate (WER)

26 of 34

Training Result

Load Weight from previous training
3% WER

27 of 34

Comparison With Other Result

Currently our result still �peaked at ~3% WER
Possible to decrease the current result by tweaking & fine tuning.

28 of 34

Current Problem for Future Work

Training / epoch took long time ~1 day per epoch
Pre-trained data from full-frame feature take a lot of storage

~5 TB data from extracted video

Took a long time to do prediction from the saved model

29 of 34

Future Works

Change Backbone to ResNet �(on going)

Faster to load
Faster training time
Suitable for end to end prediction
Comparable result with vgg configuration

Added Graph Modality
Evaluate model on another continuous SL dataset:

Greek Sign Language
Phoenix 2014 (German)

(n, 7, 7, 512)�64 Sample Visualized

*n = frame length

30 of 34

Conclusion

In this project we proposed a novel approach combining full frame image with key points feature to achieve better translation. Our approach combine SMF and TMF, BiLSTM, and CTC. By using this approach we achieve 3% of WER which competitive enough compared to existing solution.

31 of 34

Demo

32 of 34

33 of 34

Q & A ?

34 of 34

Thank You!

謝謝!