BerlinNLP: Mozilla’s Deep Speech
Outline
Part I Core Architecture
I Deep Speech Architecture
II CTC Algorithm
III Language Model
IV Performance
Part II Future Architectural Variants
I Network Variants
II CTC Variants
Part III Open Speech Corpora
I Open Speech Corpora
II Project Common Voice
Part IV Future Directions
Part I Core Architecture
I Deep Speech Architecture
Deep Speech Architecture: Overview
Input Features
Feedforward Layers
Bidirectional RNN Layer
Feedforward Layer
Softmax Layer
SoftMax
Deep Speech Architecture: Input Features
Mel-Frequency Cepstrum Coefficients
• 16 bit audio input at 16kHz
• 25ms audio window every 10ms
• 26 Cepstral Coefficients
• Stride of 2
• Context window width 9
• Data “whitened” before use
SoftMax
Deep Speech Architecture: Feedforward Layers
Feedforward Layers
• 3 layers
• Layer width 2048
• RELU cells
• RELU clipped at 20
• Dropout 0.20 to 0.30
SoftMax
Deep Speech Architecture: Bidirectional RNN Layer
Bidirectional RNN Layer
• 1 layer
• Layer width 2048
• LSTM cells
• No clipping
• Dropout 0.20 to 0.30
SoftMax
Deep Speech Architecture: Feedforward Layer
Feedforward Layer
• 1 layer
• Layer width 2048
• RELU cells
• RELU clipped at 20
• Dropout 0.20 to 0.30
SoftMax
Deep Speech Architecture: Softmax Layer
Softmax Layer
• L ≡ Alphabet
• Output width k ≡ |L| + 1
• Extra for a “blank label”
SoftMax
II CTC Algorithm
CTC Algorithm: Path Probabilities
SoftMax
1-of-k
1-of-k
1-of-k
• L ≡ Alphabet
• k ≡ |L| + 1
• Extra “blank label”
CTC Algorithm: Path Probabilities
SoftMax
1-of-k
1-of-k
1-of-k
Path ≡ Seq of T characters
π ∈ L’T
L’ ≡ L ∪ {blank}
T ≡ Time Ticks
CTC Algorithm: Path Probabilities
SoftMax
yπ1
yπ2
yπT
1
2
T
Path Probability
CTC Algorithm: Label Probabilities
L’T
L≤T
ℬ
Paths
Labels
Def: ℬ
• Remove repeated characters
• Remove blanks
CTC Algorithm: Label Probabilities
ℬ
Paths
Labels
L’T
L≤T
Def: ℬ
• Remove repeated characters
• Remove blanks
CTC Algorithm: Label Probabilities
SoftMax
yπ1
yπ2
yπT
1
2
T
Label Probability
CTC Algorithm: Label Probabilities
SoftMax
yπ1
yπ2
yπT
1
2
T
Label Probability
Problem Sum is Big
CTC Algorithm: Label Probabilities
SoftMax
yπ1
yπ2
yπT
1
2
T
Label Probability
Solution
Forward-Backward
Algorithm
III Language Model
Language Model: Definition
Labels
Def: Language Model
l1
l3
l2
l4
pLM( l1 )
pLM( l2 )
pLM( l3 )
pLM( l4 )
Language Model-”Probability distribution” over sequences of characters
Sequences of characters
Language Model: Loss Function
SoftMax
yπ1
yπ2
yπT
1
2
T
Loss Function Version 1.0
Loss Function Version 2.0
Loss Function Version 3.0
Language Model: Loss Function
SoftMax
yπ1
yπ2
yπT
1
2
T
Loss Function Version 3.0
α = 2.15
β = -0.10
β’ = 1.10
IV Performance
Performance: WER
Training Data
• TED (Approx 200 hours)
• Fisher (Approx 2000 hours)
• Librivox (Approx 1000 hours)
Performance: WER
Training Data
• TED (Approx 200 hours)
• Fisher (Approx 2000 hours)
• Librivox (Approx 1000 hours)
On Librivox clean test 6.48% WER
Part II Future Architectural Variants
I Network Variants
Network Variants: Deep Speech 2 Architecture
Input Features
Convolutional Layers
(Bidirectional) RNN Layer
Softmax Layer
CTC Layer
II CTC Variants
CTC Variants: RNN Transducer
SoftMax
yπ1
yπ2
yπT
1
2
T
Path Probability
CTC Variants: RNN Transducer
h1(5)
h2(5)
hT(5)
Path Probability
CTC Variants: RNN Transducer
h1(5)
h2(5)
hT(5)
Path Probability
Character Probability
CTC Variants: RNN Transducer
h1(5)
h2(5)
hT(5)
Path Probability
Character Probability
RNN Probability
CTC Variants: Sequence-to-Sequence Model with Attention
Encoder (BiRNN)
Decoder(RNN)
p
l
h1
h2
hT
ci
S i-1
Context vector
Attention Module
Decoder hidden state
Annotation vectors
CTC Variants: Sequence-to-Sequence Model with Attention
“a” annotation vector
h1=(h1f ,h2f ,h3f , h1b ,h2b ,h3b)
“a” annotation vector
h4=(h1f ,h2f ,h3f , h1b ,h2b ,h3b)
a — — a b —
CTC Variants: Sequence-to-Sequence Model with Attention
2st context vector
&
1st hidden state
1st context vector
&
0th hidden state
a a b c c c
a a b c c c
CTC Variants: Sequence-to-Sequence Model with Attention
21
32
23
32
43
97
42
10
65
76
98
11
12
34
65
55
21
32
23
32
43
97
42
10
14
65
(s i-1
hj)
eij
Annotation vector
Decoder hidden state
Part III Open Speech Corpora
I Open Speech Corpora
Open Speech Corpora: Open, Commercially Usable Corpora
Librivox
VoxForge
• 16 bit audio input at 16kHz
• 1000 hours of audio
• Read speech
• Clean subset
• Dirty subset
• 16 bit audio input at 16kHz
• 100 hours of audio
• Read speech
II Project Common Voice
Project Common Voice: Overview
Project Common Voice: Recording
Project Common Voice: Validating
Part IV Future Directions
Future Directions...
Production Ready Packaging
Evaluating Network Variants
Evaluating
CTC Variants
Hyperparameter
Tuning
Network
Quantization
Other
Languages