1 of 47

BerlinNLP: Mozilla’s Deep Speech

2 of 47

Outline

Part I Core Architecture

I Deep Speech Architecture

II CTC Algorithm

III Language Model

IV Performance

Part II Future Architectural Variants

I Network Variants

II CTC Variants

Part III Open Speech Corpora

I Open Speech Corpora

II Project Common Voice

Part IV Future Directions

3 of 47

Part I Core Architecture

4 of 47

I Deep Speech Architecture

5 of 47

Deep Speech Architecture: Overview

Input Features

Feedforward Layers

Bidirectional RNN Layer

Feedforward Layer

Softmax Layer

SoftMax

6 of 47

Deep Speech Architecture: Input Features

Mel-Frequency Cepstrum Coefficients

16 bit audio input at 16kHz

25ms audio window every 10ms

26 Cepstral Coefficients

Stride of 2

Context window width 9

Data “whitened” before use

SoftMax

7 of 47

Deep Speech Architecture: Feedforward Layers

Feedforward Layers

3 layers

Layer width 2048

RELU cells

RELU clipped at 20

Dropout 0.20 to 0.30

SoftMax

8 of 47

Deep Speech Architecture: Bidirectional RNN Layer

Bidirectional RNN Layer

1 layer

Layer width 2048

LSTM cells

No clipping

Dropout 0.20 to 0.30

SoftMax

9 of 47

Deep Speech Architecture: Feedforward Layer

Feedforward Layer

1 layer

Layer width 2048

RELU cells

RELU clipped at 20

Dropout 0.20 to 0.30

SoftMax

10 of 47

Deep Speech Architecture: Softmax Layer

Softmax Layer

L ≡ Alphabet

Output width k ≡ |L| + 1

Extra for a “blank label

SoftMax

11 of 47

II CTC Algorithm

12 of 47

CTC Algorithm: Path Probabilities

SoftMax

1-of-k

1-of-k

1-of-k

LAlphabet

k ≡ |L| + 1

Extra “blank label

13 of 47

CTC Algorithm: Path Probabilities

SoftMax

1-of-k

1-of-k

1-of-k

Path ≡ Seq of T characters

π ∈ L’T

L’ L ∪ {blank}

T ≡ Time Ticks

14 of 47

CTC Algorithm: Path Probabilities

SoftMax

yπ1

yπ2

yπT

1

2

T

Path Probability

15 of 47

CTC Algorithm: Label Probabilities

L’T

LT

Paths

Labels

Def:

• Remove repeated characters

• Remove blanks

16 of 47

CTC Algorithm: Label Probabilities

Paths

Labels

L’T

LT

Def:

• Remove repeated characters

• Remove blanks

17 of 47

CTC Algorithm: Label Probabilities

SoftMax

yπ1

yπ2

yπT

1

2

T

Label Probability

18 of 47

CTC Algorithm: Label Probabilities

SoftMax

yπ1

yπ2

yπT

1

2

T

Label Probability

Problem Sum is Big

19 of 47

CTC Algorithm: Label Probabilities

SoftMax

yπ1

yπ2

yπT

1

2

T

Label Probability

Solution

Forward-Backward

Algorithm

20 of 47

III Language Model

21 of 47

Language Model: Definition

Labels

Def: Language Model

l1

l3

l2

l4

pLM( l1 )

pLM( l2 )

pLM( l3 )

pLM( l4 )

Language Model-”Probability distribution” over sequences of characters

Sequences of characters

22 of 47

Language Model: Loss Function

SoftMax

yπ1

yπ2

yπT

1

2

T

Loss Function Version 1.0

Loss Function Version 2.0

Loss Function Version 3.0

23 of 47

Language Model: Loss Function

SoftMax

yπ1

yπ2

yπT

1

2

T

Loss Function Version 3.0

α = 2.15

β = -0.10

β’ = 1.10

24 of 47

IV Performance

25 of 47

Performance: WER

Training Data

TED (Approx 200 hours)

Fisher (Approx 2000 hours)

Librivox (Approx 1000 hours)

26 of 47

Performance: WER

Training Data

TED (Approx 200 hours)

Fisher (Approx 2000 hours)

Librivox (Approx 1000 hours)

On Librivox clean test 6.48% WER

27 of 47

Part II Future Architectural Variants

28 of 47

I Network Variants

29 of 47

Network Variants: Deep Speech 2 Architecture

Input Features

Convolutional Layers

(Bidirectional) RNN Layer

Softmax Layer

CTC Layer

30 of 47

II CTC Variants

31 of 47

CTC Variants: RNN Transducer

SoftMax

yπ1

yπ2

yπT

1

2

T

Path Probability

32 of 47

CTC Variants: RNN Transducer

h1(5)

h2(5)

hT(5)

Path Probability

33 of 47

CTC Variants: RNN Transducer

h1(5)

h2(5)

hT(5)

Path Probability

Character Probability

34 of 47

CTC Variants: RNN Transducer

h1(5)

h2(5)

hT(5)

Path Probability

Character Probability

RNN Probability

35 of 47

CTC Variants: Sequence-to-Sequence Model with Attention

Encoder (BiRNN)

Decoder(RNN)

p

l

h1

h2

hT

ci

S i-1

Context vector

Attention Module

Decoder hidden state

Annotation vectors

36 of 47

CTC Variants: Sequence-to-Sequence Model with Attention

a” annotation vector

h1=(h1f ,h2f ,h3f , h1b ,h2b ,h3b)

a” annotation vector

h4=(h1f ,h2f ,h3f , h1b ,h2b ,h3b)

a — — a b —

37 of 47

CTC Variants: Sequence-to-Sequence Model with Attention

2st context vector

&

1st hidden state

1st context vector

&

0th hidden state

a a b c c c

a a b c c c

38 of 47

CTC Variants: Sequence-to-Sequence Model with Attention

21

32

23

32

43

97

42

10

65

76

98

11

12

34

65

55

21

32

23

32

43

97

42

10

14

65

(s i-1

hj)

eij

Annotation vector

Decoder hidden state

  • Feedforward neural network
  • Input:
    • si-1 decoder hidden state before ith prediction
    • hj annotation for jth input character
  • Output:
    • eij logit of jth annotation for ith prediction
  • αij normalized weight of jth annotation for ith prediction

  • ci context vector, weighted annotations

39 of 47

Part III Open Speech Corpora

40 of 47

I Open Speech Corpora

41 of 47

Open Speech Corpora: Open, Commercially Usable Corpora

Librivox

VoxForge

16 bit audio input at 16kHz

1000 hours of audio

Read speech

Clean subset

Dirty subset

16 bit audio input at 16kHz

100 hours of audio

Read speech

42 of 47

II Project Common Voice

43 of 47

Project Common Voice: Overview

44 of 47

Project Common Voice: Recording

45 of 47

Project Common Voice: Validating

46 of 47

Part IV Future Directions

47 of 47

Future Directions...

Production Ready Packaging

Evaluating Network Variants

Evaluating

CTC Variants

Hyperparameter

Tuning

Network

Quantization

Other

Languages