Pyctcdecode & Speech2text decoding
Jeremy Lopez & Ray Grossman
Kensho Technologies
January 18, 2022
1
Who we are
2
Jeremy
Lopez
Ray
Grossman
Table of Contents
3
Table of Contents -
Part 1
4
A review of standard S2T architecture
5
A review of standard S2T architecture
6
Start with an audio sample
A review of standard S2T architecture
7
Perform preprocessing to get features for every n-millisecond chunk of audio
Preprocessing
Features
A review of standard S2T architecture
8
Pass the generated per-timestep features through an neural net model
Get back per-timestep logit matrix
Features
Logits
Acoustic Model
A review of standard S2T architecture
9
Logit matrix: Gives softmax logits predicting character probabilities for each slice
A review of standard S2T architecture
10
Logits represent character probabilities- with a twist
A review of standard S2T architecture
11
Pass the generated per-timestep logits to a language model/decoder
Logits
Text
Language Model/Decoder
A review of standard S2T architecture
12
The decoder / language model evaluates paths through the logit matrix
A review of standard S2T architecture
13
A path through the logit matrix produces an output string
Path: Choose 1 character per time slice
A review of standard S2T architecture
14
A path through the logit matrix produces an output string
d
e
c
o
d
i
n
g
…
A review of standard S2T architecture
15
A simple logit-based score combines the probabilities of each character
d
e
c
o
d
i
n
g
…
i: Time index
j: Character index
A review of CTC Encoding
16
A review of CTC Encoding
17
CTC: Connectionist Temporal Classification
A review of CTC Encoding
18
What are the challenges inherent in decoding the logits?
A review of CTC Encoding
19
Need a way to collapse repeated characters without eliminating duplicate letters
A review of CTC encoding - Pad token
20
Add a new character to our alphabet that represents a break between characters
Decoding CTC Output- Greedy solution
21
Decoding CTC Output- Greedy solution
22
To decode CTC-encoded text greedily:
H H E E L L _ L L O
H E L _ L O
H E L L O
Decoding CTC Output- Greedy solution
23
This solution often not optimal
Decoding CTC Output- Greedy solution
24
Consider two things:
Decoding CTC Output- Language Model
25
Decoding CTC Output- Language Model
26
Because of those problems, we need a way to score how likely a given block of text is given:
This will help remove ‘improbable’ parts of the predicted text
Decoding CTC Output- Language Model
27
N-gram language models solve this need
What is an n-gram model?
Used in conjunction with logits - can weight the input of the language model
Decoding CTC Output- Language Model
28
Language Model
The angry brown cat
0.4321
Language Model
Proposed text
Likelihood
Language Model
The angry cat brown
0.0321
Decoding CTC Output- Language Model
29
Probability for a word in some text depends on the N-1 previous words
Decoding CTC Output- Language Model
30
Example: Bigram language model
Decoding CTC Output- Language Model
31
Can incorporate the language model with the logits to assist in decoding
Decoding CTC Output- Better Algorithms
32
Decoding CTC Output - Exact Solution With Language Model
33
So, given a language model, how do we decode logits?
Decoding CTC Output- Language Model
34
We need an intermediate solution!
Decoding CTC Output- Beam Search
35
Decoding CTC Output- Beam Search
36
Beam Search: A Fast Approximate Solution
Step 1. Select the N best characters from the first time slice
F
P
t=1
Decoding CTC Output- Beam Search
37
Step 2. Add a character from time slice 2 to the initial 1-character beams and score.
F
P
t=1
h
e
t=2
e
a
Decoding CTC Output- Beam Search
38
Step 2.5 Rescore text outputs with the language model. Prune the number of kept sequences to our desired beam width.
F
P
t=1
h
e
t=2
e
a
h
e
e
a
t=2.5
Decoding CTC Output- Beam Search
39
Step 3 Continue by adding a third timestep, choosing the N best, and so on
F
P
t=1
h
e
t=2
e
a
h
e
e
a
t=2.5
h
e
e
a
t=3
Decoding CTC Output- Beam Search
40
At a beam width of 1, we are identical to greedy decoding.
At a beam width of infinity, we have an exact solution
Decoding CTC Output- Beam Search Scoring
41
Decoding CTC Output- Beam Search Scoring
42
Beam Search Scoring
To score a beam (i.e. a path) , sum the log probabilities of the characters
Decoding CTC Output- Beam Search Scoring
43
To score text : sum over equivalent beams
Conclusion- Part 1 /Intro - Part 2
44
Conclusion- Part 1 /Intro - Part 2
45
Pyctcdecode
46
Pyctcdecode
47
Pyctcdecode:
Demo code:
Pyctcdecode
48
Pyctcdecode - features
49
Pyctcdecode - features
50
Everything discussed in part 1 is implemented in pyctcdecode
Pyctcdecode - features
51
Also offers …
Pyctcdecode - features
52
Pyctcdecode - features: Hot words
53
Boosting “Hot” Words
Pyctcdecode - features: Hot words
54
Pyctcdecode - features: Speed
55
Beam pruning and caching allow fast performance, comparable to the C++ packages we previously used
Pyctcdecode - features: Speed
56
Pyctcdecode - Potential features?
57
Pyctcdecode - Potential features?
58
Transformer-based/ neural language models are coming into prominence
Pyctcdecode - Getting Started
59
Pyctcdecode - Getting Started
60
Pyctcdecode - Getting Started
61
To start using pyctc-decode effectively, need 4 things:
Pyctcdecode - Getting Started -
1. Dataset
62
Pyctcdecode - Getting Started -
1. Dataset
63
SPGISpeech - 5000 hours of financial audio and associated transcripts
Pyctcdecode - Getting Started -
1. Dataset - SPGISpeech
64
SPGISpeech
Pyctcdecode - Getting Started - Acoustic model
65
Pyctcdecode - Getting Started - Acoustic model
66
Acoustic model - BPE or character based
Pyctcdecode - Getting Started - Acoustic model
67
Pyctcdecode - Getting Started - Language model
68
Pyctcdecode - Getting Started - Language model
69
Kenlm - offers support for n-gram language models
Currently only kenlm models are supported
Pyctcdecode - Getting Started - Language model
70
Matching vocabulary sets is critical
Pyctcdecode - Getting Started - First decoder
71
Pyctcdecode - Getting Started - First decoder
72
Pyctcdecode - Getting Started - First decoder
73
Pyctcdecode - Getting Started - Examples
74
Pyctcdecode - Getting Started - Examples
75
Samples:
Pyctcdecode - Getting Started
76
All of the code used to produce the samples talked about in this presentation are available on Github:
https://github.com/rhgrossman/pyctcdecode_demo
Pyctcdecode - Getting Started - Examples
77
Let’s start with a sample where the LM performs well.
Pyctcdecode - Getting Started - Examples
78
Sample:
Ground truth
Greedy Decoding
LM Decoding
18e82b076319ac52f8a4b391ea345abd/129.wav
Pyctcdecode - Getting Started - Examples
79
Sample:
I also want to remind everyone that any forward-looking statements made during this call are subject to risks and uncertainties, the most important of which are described in our press release and SEC filings.
Ground truth
Greedy Decoding
LM Decoding
18e82b076319ac52f8a4b391ea345abd/129.wav
Pyctcdecode - Getting Started - Examples
80
Sample:
i also want to remind everyone that any forward looking statements made during this call are subject to risks and uncertainties the most important of which are described in our press releafs and s c c filings
I also want to remind everyone that any forward-looking statements made during this call are subject to risks and uncertainties, the most important of which are described in our press release and SEC filings.
Ground truth
Greedy Decoding
LM Decoding
18e82b076319ac52f8a4b391ea345abd/129.wav
Pyctcdecode - Getting Started - Examples
81
Sample:
i also want to remind everyone that any forward looking statements made during this call are subject to risks and uncertainties the most important of which are described in our press releafs and s c c filings
i also want to remind everyone that any forward looking statements made during this call are subject to risks and uncertainties the most important of which are described in our press release and sec filings
I also want to remind everyone that any forward-looking statements made during this call are subject to risks and uncertainties, the most important of which are described in our press release and SEC filings.
Ground truth
Greedy Decoding
LM Decoding
18e82b076319ac52f8a4b391ea345abd/129.wav
Pyctcdecode - Getting Started - Examples
82
Hopefully this gives you an idea of the types of errors LM tend to correct
Pyctcdecode - Getting Started - Examples
83
Now, for a sample where the LM performs less well…
Pyctcdecode - Getting Started - Examples
84
Sample:
Ground truth
Greedy Decoding
LM Decoding
c3567809c19ce6a800677544cd84f88b/4.wav
Pyctcdecode - Getting Started - Examples
85
Sample:
Thank you, Ivan. I'm now going to look in a bit more detail at what is, as Ivan said, a good set of results, with better top line performance, margin expansion and increased cash flow.
Ground truth
Greedy Decoding
LM Decoding
c3567809c19ce6a800677544cd84f88b/4.wav
Pyctcdecode - Getting Started - Examples
86
Sample:
thank you ivan i'm now going to look in a bit more detail at what is as ivan said a good set of results with better topline performance margin expansion and increased cashflow
Thank you, Ivan. I'm now going to look in a bit more detail at what is, as Ivan said, a good set of results, with better top line performance, margin expansion and increased cash flow.
Ground truth
Greedy Decoding
LM Decoding
c3567809c19ce6a800677544cd84f88b/4.wav
Pyctcdecode - Getting Started - Examples
87
Sample:
thank you ivan i'm now going to look in a bit more detail at what is as ivan said a good set of results with better topline performance margin expansion and increased cashflow
thank you even i'm now going to look in a bit more detail at what is as ivan said a good set of results with better top line performance margin expansion and increased cash flow
Thank you, Ivan. I'm now going to look in a bit more detail at what is, as Ivan said, a good set of results, with better top line performance, margin expansion and increased cash flow.
Ground truth
Greedy Decoding
LM Decoding
c3567809c19ce6a800677544cd84f88b/4.wav
Pyctcdecode - Getting Started - Examples
88
Despite these types of errors- language models are quite useful and improve WER on even extensively trained models
Pyctcdecode - Getting Started - Examples
89
How can we address such errors in the LM?
Pyctcdecode - Getting Started - Tuning options
90
Pyctcdecode - Getting Started - Examples
91
Let’s try hotword boosting on our previous sample
Pyctcdecode - Getting Started - Examples
92
Pyctcdecode - Getting Started - Examples
93
Sample:
thank you ivan i'm now going to look in a bit more detail at what is as ivan said a good set of results with better topline performance margin expansion and increased cashflow
Thank you, Ivan. I'm now going to look in a bit more detail at what is, as Ivan said, a good set of results, with better top line performance, margin expansion and increased cash flow.
Ground truth
Greedy Decoding
LM Decoding
c3567809c19ce6a800677544cd84f88b/4.wav
Pyctcdecode - Getting Started - Examples
94
Sample:
thank you ivan i'm now going to look in a bit more detail at what is as ivan said a good set of results with better topline performance margin expansion and increased cashflow
thank you ivan i 'm now going to look in a bit more detail at what is as ivan said a good set of results with better top line performance margin expansion and increased cash flow
Thank you, Ivan. I'm now going to look in a bit more detail at what is, as Ivan said, a good set of results, with better top line performance, margin expansion and increased cash flow.
Ground truth
Greedy Decoding
LM Decoding
c3567809c19ce6a800677544cd84f88b/4.wav
Pyctcdecode - Getting Started - Tuning options
95
Very useful- especially in many common transcription scenarios
Pyctcdecode - Getting Started - Tuning Options
96
Let’s look at our second method of improving language models - parameter tuning
Pyctcdecode - Getting Started - Tuning Options
97
Pyctcdecode offers two tunable params
Pyctcdecode - Getting Started - Tuning Options
98
Keep in mind:
Pyctcdecode - Getting Started - Tuning Options
99
Training set
LM Parameter tuning holdout
Acoustic + LM base training set
Validation set
Pyctcdecode - Getting Started - Examples
100
Sample:
Ground truth
Greedy Decoding
alpha=.7
beta=3.0
538323cb37246bc97d1253b369a7414a/178.wav
alpha=.5
beta=1.0
Pyctcdecode - Getting Started - Examples
101
Sample:
whereas then in the industrial cranes and crane components business units,
Ground truth
Greedy Decoding
alpha=.7
beta=3.0
538323cb37246bc97d1253b369a7414a/178.wav
alpha=.5
beta=1.0
Pyctcdecode - Getting Started - Examples
102
Sample:
whereas then in the industrial cranes and crane components business units,
Ground truth
Greedy Decoding
alpha=.7
beta=3.0
538323cb37246bc97d1253b369a7414a/178.wav
whereas tein a in the industrial grains and crain combonents business units
alpha=.5
beta=1.0
Pyctcdecode - Getting Started - Examples
103
Sample:
whereas then in the industrial cranes and crane components business units,
Ground truth
Greedy Decoding
alpha=.7
beta=3.0
538323cb37246bc97d1253b369a7414a/178.wav
whereas tein a in the industrial grains and crain combonents business units
whereas then in the industrial grains and grain components business units
alpha=.5
beta=1.0
Pyctcdecode - Getting Started - Examples
104
Sample:
whereas then in the industrial cranes and crane components business units,
Ground truth
Greedy Decoding
alpha=.7
beta=3.0
538323cb37246bc97d1253b369a7414a/178.wav
whereas tein a in the industrial grains and crain combonents business units
whereas then in the industrial grains and grain components business units
whereas then in the industrial cranes and grain components business units
alpha=.5
beta=1.0
Pyctcdecode - Getting Started - Tuning Options
105
Clearly, alpha and beta can have a significant impact
Pyctcdecode - Conclusions
106
Pyctcdecode - Questions? Comments?
107