2 of 47

Outline

Part I Core Architecture

I Deep Speech Architecture

II CTC Algorithm

III Language Model

IV Performance

Part II Future Architectural Variants

I Network Variants

II CTC Variants

Part III Open Speech Corpora

I Open Speech Corpora

II Project Common Voice

Part IV Future Directions

3 of 47

Part I Core Architecture

4 of 47

I Deep Speech Architecture

5 of 47

Deep Speech Architecture: Overview

Input Features

Feedforward Layers

Bidirectional RNN Layer

Feedforward Layer

Softmax Layer

SoftMax

6 of 47

Deep Speech Architecture: Input Features

Mel-Frequency Cepstrum Coefficients

• 16 bit audio input at 16kHz

• 25ms audio window every 10ms

• 26 Cepstral Coefficients

• Stride of 2

• Context window width 9

• Data “whitened” before use

SoftMax

7 of 47

Deep Speech Architecture: Feedforward Layers

Feedforward Layers

• 3 layers

• Layer width 2048

• RELU cells

• RELU clipped at 20

• Dropout 0.20 to 0.30

SoftMax

8 of 47

Deep Speech Architecture: Bidirectional RNN Layer

Bidirectional RNN Layer

• 1 layer

• Layer width 2048

• LSTM cells

• No clipping

• Dropout 0.20 to 0.30

SoftMax

9 of 47

Deep Speech Architecture: Feedforward Layer

Feedforward Layer

• 1 layer

• Layer width 2048

• RELU cells

• RELU clipped at 20

• Dropout 0.20 to 0.30

SoftMax

10 of 47

Deep Speech Architecture: Softmax Layer

Softmax Layer

• L ≡ Alphabet

• Output width k ≡ |L| + 1

• Extra for a “blank label”

SoftMax

11 of 47

II CTC Algorithm

12 of 47

CTC Algorithm: Path Probabilities

SoftMax

1-of-k

• L ≡ Alphabet

• k ≡ |L| + 1

• Extra “blank label”

13 of 47

CTC Algorithm: Path Probabilities

SoftMax

1-of-k

Path ≡ Seq of T characters

π ∈ L’^T

L’ ≡ L ∪ {blank}

T ≡ Time Ticks

14 of 47

CTC Algorithm: Path Probabilities

SoftMax

y_π¹

y_π²

y_π^T

Path Probability

15 of 47

CTC Algorithm: Label Probabilities

L’^T

L^≤^T

ℬ

Paths

Labels

Def: ℬ

• Remove repeated characters

• Remove blanks

16 of 47

CTC Algorithm: Label Probabilities

ℬ

Paths

Labels

L’^T

L^≤^T

Def: ℬ

• Remove repeated characters

• Remove blanks

17 of 47

CTC Algorithm: Label Probabilities

SoftMax

y_π¹

y_π²

y_π^T

Label Probability

18 of 47

CTC Algorithm: Label Probabilities

SoftMax

y_π¹

y_π²

y_π^T

Label Probability

Problem Sum is Big

19 of 47

CTC Algorithm: Label Probabilities

SoftMax

y_π¹

y_π²

y_π^T

Label Probability

Solution

Forward-Backward

Algorithm

20 of 47

III Language Model

21 of 47

Language Model: Definition

Labels

Def: Language Model

l₁

l₃

l₂

l₄

p_LM( l₁)

p_LM( l₂)

p_LM( l₃)

p_LM( l₄)

Language Model-”Probability distribution” over sequences of characters

Sequences of characters

22 of 47

Language Model: Loss Function

SoftMax

y_π¹

y_π²

y_π^T

Loss Function Version 1.0

Loss Function Version 2.0

Loss Function Version 3.0

23 of 47

Language Model: Loss Function

SoftMax

y_π¹

y_π²

y_π^T

Loss Function Version 3.0

α = 2.15

β = -0.10

β’ = 1.10

24 of 47

IV Performance

25 of 47

Performance: WER

Training Data

• TED (Approx 200 hours)

• Fisher (Approx 2000 hours)

• Librivox (Approx 1000 hours)

26 of 47

Performance: WER

Training Data

• TED (Approx 200 hours)

• Fisher (Approx 2000 hours)

• Librivox (Approx 1000 hours)

On Librivox clean test 6.48% WER

27 of 47

Part II Future Architectural Variants

28 of 47

I Network Variants

29 of 47

Network Variants: Deep Speech 2 Architecture

Input Features

Convolutional Layers

(Bidirectional) RNN Layer

Softmax Layer

CTC Layer

30 of 47

II CTC Variants

31 of 47

CTC Variants: RNN Transducer

SoftMax

y_π¹

y_π²

y_π^T

Path Probability

32 of 47

CTC Variants: RNN Transducer

h₁⁽⁵⁾

h₂⁽⁵⁾

h_T⁽⁵⁾

Path Probability

33 of 47

CTC Variants: RNN Transducer

h₁⁽⁵⁾

h₂⁽⁵⁾

h_T⁽⁵⁾

Path Probability

Character Probability

34 of 47

CTC Variants: RNN Transducer

h₁⁽⁵⁾

h₂⁽⁵⁾

h_T⁽⁵⁾

Path Probability

Character Probability

RNN Probability

35 of 47

CTC Variants: Sequence-to-Sequence Model with Attention

Encoder (BiRNN)

Decoder(RNN)

h₁

h₂

h_T

c_i

S _i-1

Context vector

Attention Module

Decoder hidden state

Annotation vectors

36 of 47

CTC Variants: Sequence-to-Sequence Model with Attention

“a” annotation vector

h₁=(h¹_f,h²_f,h³_f, h¹_b,h²_b,h³_b)

“a” annotation vector

h₄=(h¹_f,h²_f,h³_f, h¹_b,h²_b,h³_b)

a — — a b —

37 of 47

CTC Variants: Sequence-to-Sequence Model with Attention

2^st context vector

1^sthidden state

1^st context vector

0^th hidden state

a a b c c c

38 of 47

CTC Variants: Sequence-to-Sequence Model with Attention

(s _i-1

h_j)

e_ij

Annotation vector

Decoder hidden state

Feedforward neural network
Input:

s_i-1 decoder hidden state before i^th prediction
h_j annotation for j^th input character

Output:

e_ij logit of j^th annotation for i^th prediction

α_ij normalized weight of j^th annotation for i^th prediction

c_i context vector, weighted annotations

39 of 47

Part III Open Speech Corpora

40 of 47

I Open Speech Corpora

41 of 47

Open Speech Corpora: Open, Commercially Usable Corpora

Librivox

VoxForge

• 16 bit audio input at 16kHz

• 1000 hours of audio

• Read speech

• Clean subset

• Dirty subset

• 16 bit audio input at 16kHz

• 100 hours of audio

• Read speech

42 of 47

II Project Common Voice

43 of 47

Project Common Voice: Overview

44 of 47

Project Common Voice: Recording

45 of 47

Project Common Voice: Validating

46 of 47

Part IV Future Directions

47 of 47

Future Directions...

Production Ready Packaging

Evaluating Network Variants

Evaluating

CTC Variants

Hyperparameter

Tuning

Network

Quantization

Other

Languages