1 of 34

Continuous Sign Language Recognition

Group 18 - 費群安, 馬莎琳, 齊婉平, 杜威

Department of Computer Science & Information Engineering,

National Central University, Taiwan.

2 of 34

Introduction

  • Sign language (SL) is an important way to communicate between people with hearing loss (signer). However there is still barrier when communicating with normal people (non-signer). SL is a complex language combining hand gesture, body movement, and facial expression.
  • With today's image capturing technology, not only we can obtain the image, we also can obtain the rgb, and key-points of skeletal. Therefore we get more information about a video.

3 of 34

Introduction & Motivation

  • Sign Language Recognition (SLR) is a more challenging problem:
    • Sign language requires both global body motion and delicate arm/hand gestures to distinctly and accurately express its meaning.
    • Similar gestures can even impose various meanings depending on the number of repetitions.
    • Different signers may perform sign language differently (e.g., speed, localism, left-handers, right-handers, body shape)
  • Hypothesis:
    • Skeleton based methods can act as a strong complements to RGB / RGB-D based methods.
    • Combining skeletal and RGB method may resulting better result
    • Different modalities contain different valuable information. Their ensembles always improve the overall performance. (e.g. RGB + Optical flow)

4 of 34

Contributions

  • Support the SL user (hearing-impaired) / community to communicate.
  • Proposed a novel approach to translate sequence of gesture / continuous gesture to its sentence using full frame image and key points features.

5 of 34

Proposed Architecture

  • Spatial Module
    • Keypoint Features
    • Full-frame Feature
  • Temporal Module
    • Multi-feature Self-Attention Layer
  • Sequence Learning Module
    • Bidirectional Long Short-Term Memory (BiLSTM)
    • Connectionist Temporal Classification (CTC)

6 of 34

Whole-body Pose Estimation

  • Traditional 2D human pose estimation:
    • 16 points or 17 points only
    • Does not include hand keypoints �
  • Problems using separate hand pose model:
    • Hand pose estimator cannot work without detector.
    • Hand detector fails due to motion blur / low resolution.�
  • 133-point whole-body keypoints:
    • Face: 68 points
    • Body: 17 points
    • Hands: 34 points
    • Feet: 6 points �
  • Advantages of whole-body keypoint estimator:
    • Consistent and faithful estimation of hand keypoints
    • Resistant to motion blurs

7 of 34

Pros and Cons of Skeleton-based SLR

Pros:

• Accuracy is high.

• No interference of background.

• Signer-invariant.

• Lightweight network, easy to train.

Cons:

• Finger key points estimation may not be accurate.

Solution:

• Those inaccurate key points may be corrected by other modalities (full-frame).

Finger that

wasn’t captured

8 of 34

Dataset

9 of 34

Dataset

  • Chinese Sign Language Recognition Dataset
    • 1920x1080 resolution
    • 100 Clases/Sentences
    • 25.000 Videos
    • RGB + D from kinect

10 of 34

Spatial Module

11 of 34

Pretrain Weight

12 of 34

Pixel-Map Weight Training

  • To let the backbone layer �(VGG Conv) warm-up
  • Obtain a good weight for full-frame feature
  • Improve recognition capability

(56, 56, 256)�16 Sample Visualized

13 of 34

Preprocessing

14 of 34

Full-frame Feature

(n, 56, 56, 256)�64 Sample Visualized

(n, 224, 224, 3)

Cropped & Resized

Full-frame

input

*n = frame length

15 of 34

Keypoint Feature

  • We utilize High-Resolution Net (HRNet-W48) to extract keypoint from each video
  • Obtained 133 key-points in total from full model prediction
  • get 27 important key-points as our input

(n x 1 x 27 x 3)

*n = frame length

16 of 34

Model Input

(n, 56, 56, 256)�64 Sample Visualized

(n, 1, 27, 3)�Keypoint Visualized

*n = frame length

17 of 34

Temporal Module

18 of 34

Sequence Learning Module

19 of 34

Sequence Learning Overview

  • With the proposed Spatial Module Features (SMF) and Temporal Module Features (TMF), the network now could generate inter-cue feature sequence.
  • Utilize BiLSTM which then fed into CTC layer to map the previous output to the sign gloss/label.

20 of 34

Bidirectional Long Short-Term Memory (BLSTM)

  • Recurrent neural networks (RNN) can use their internal state to model the state transitions in the sequence of inputs. Hence, we use RNN to map the spatial-temporal feature sequence to its sign gloss sequence.
  • In our method, we select the BiLSTM unit as the recurrent unit for its ability in processing long-term dependencies.
  • BLSTM concatenates forward and backward hidden states from bidirectional inputs.

21 of 34

Connectionist Temporal Classification (CTC)

  • Mainly used to tackle problem of mapping video sequence to ordered sign gloss/label sequence, where the explicit alignment between them is unknown.
  • The main objective of CTC is to maximize the sum of probabilities of all possible alignment paths between input and target sequence.

BLSTM Input

22 of 34

Training

23 of 34

Overview

  • Model implemented in Keras & Tensorflow 2.0 framework
  • Data split into Training & Test Set (80/20)
    • 20000 videos for training
    • 5000 videos for testing

24 of 34

Result

25 of 34

Training Result

  • Without Attention
  • 9% Word Error Rate (WER)

26 of 34

Training Result

  • Load Weight from previous training
  • 3% WER

27 of 34

Comparison With Other Result

  • Currently our result still �peaked at ~3% WER
  • Possible to decrease the current result by tweaking & fine tuning.

28 of 34

Current Problem for Future Work

  • Training / epoch took long time ~1 day per epoch
  • Pre-trained data from full-frame feature take a lot of storage
    • ~5 TB data from extracted video
  • Took a long time to do prediction from the saved model

29 of 34

Future Works

  • Change Backbone to ResNet �(on going)
    • Faster to load
    • Faster training time
    • Suitable for end to end prediction
    • Comparable result with vgg configuration
  • Added Graph Modality
  • Evaluate model on another continuous SL dataset:
    • Greek Sign Language
    • Phoenix 2014 (German)

(n, 7, 7, 512)�64 Sample Visualized

*n = frame length

30 of 34

Conclusion

In this project we proposed a novel approach combining full frame image with key points feature to achieve better translation. Our approach combine SMF and TMF, BiLSTM, and CTC. By using this approach we achieve 3% of WER which competitive enough compared to existing solution.

31 of 34

Demo

32 of 34

33 of 34

Q & A ?

34 of 34

Thank You!

謝謝!