1 of 107

Pyctcdecode & Speech2text decoding

Jeremy Lopez & Ray Grossman

Kensho Technologies

January 18, 2022

1

2 of 107

Who we are

2

Jeremy

Lopez

Ray

Grossman

3 of 107

Table of Contents

  1. Intro to CTC Decoding and beam search (Jeremy)
  2. Pyctcdecode walkthrough and examples (Ray)

3

4 of 107

Table of Contents -

Part 1

  1. Overview of generic S2T pipelines
  2. Review of CTC encoding in S2T
    • Challenge of duplicate characters
    • Pad token
  3. Decoding CTC encoded output
    • Naive solution (Greedy Decoding)
    • Scoring text (Language model)
    • Optimal Solution
  4. Beam search decoding

4

5 of 107

A review of standard S2T architecture

5

6 of 107

A review of standard S2T architecture

6

Start with an audio sample

7 of 107

A review of standard S2T architecture

7

Perform preprocessing to get features for every n-millisecond chunk of audio

Preprocessing

Features

8 of 107

A review of standard S2T architecture

8

Pass the generated per-timestep features through an neural net model

  • Eg. wav2vec, citrinet

Get back per-timestep logit matrix

Features

Logits

Acoustic Model

9 of 107

A review of standard S2T architecture

9

Logit matrix: Gives softmax logits predicting character probabilities for each slice

10 of 107

A review of standard S2T architecture

10

Logits represent character probabilities- with a twist

  • Audio sliced into evenly spaced chunks of time
  • Softmax logits predicted for each slice
  • → Output length proportional to audio length, not text length

11 of 107

A review of standard S2T architecture

11

Pass the generated per-timestep logits to a language model/decoder

  • Returns actual text

Logits

Text

Language Model/Decoder

12 of 107

A review of standard S2T architecture

12

The decoder / language model evaluates paths through the logit matrix

13 of 107

A review of standard S2T architecture

13

A path through the logit matrix produces an output string

Path: Choose 1 character per time slice

14 of 107

A review of standard S2T architecture

14

A path through the logit matrix produces an output string

d

e

c

o

d

i

n

g

15 of 107

A review of standard S2T architecture

15

A simple logit-based score combines the probabilities of each character

d

e

c

o

d

i

n

g

i: Time index

j: Character index

16 of 107

A review of CTC Encoding

16

17 of 107

A review of CTC Encoding

17

CTC: Connectionist Temporal Classification

18 of 107

A review of CTC Encoding

18

What are the challenges inherent in decoding the logits?

  1. Many more timesteps than predicted characters
  2. Duplicated characters
    1. Eg. “LL” in “hello”

19 of 107

A review of CTC Encoding

19

Need a way to collapse repeated characters without eliminating duplicate letters

  • The following should decode identically:

  1. H E L L O
  2. H H E E L L L L O O

  • H E L L O
  • H E L L O

20 of 107

A review of CTC encoding - Pad token

20

Add a new character to our alphabet that represents a break between characters

  • Often denoted “<pad>” or “_”

  • H E L _ L O
  • H H E E L L _ L L O

  • H E L _ L O
  • H E L _ L O

21 of 107

Decoding CTC Output- Greedy solution

21

22 of 107

Decoding CTC Output- Greedy solution

22

To decode CTC-encoded text greedily:

  1. Take the argmax of the character logits at each timestep
  2. Remove repeated characters
  3. Remove the CTC character

H H E E L L _ L L O

H E L _ L O

H E L L O

23 of 107

Decoding CTC Output- Greedy solution

23

This solution often not optimal

  • Homophones
    • Eg. bear/bare, knot/not
  • Misspellings
  • Plethora of other small errors
    • Eg. “f”, “ph”, “gh”, etc. can all give the same sound
    • What if a model wants to output “phor” instead of “for?”

24 of 107

Decoding CTC Output- Greedy solution

24

Consider two things:

  1. Include more information about our language
  2. Use a better decoding strategy

25 of 107

Decoding CTC Output- Language Model

25

26 of 107

Decoding CTC Output- Language Model

26

Because of those problems, we need a way to score how likely a given block of text is given:

  • Logits
  • A training text corpus

This will help remove ‘improbable’ parts of the predicted text

27 of 107

Decoding CTC Output- Language Model

27

N-gram language models solve this need

What is an n-gram model?

  • Probabilistic
  • Given sequence of words size (N-1), predicts the Nth word
  • Can be used to identify improbable sequences of text that need to be changed

Used in conjunction with logits - can weight the input of the language model

28 of 107

Decoding CTC Output- Language Model

28

Language Model

The angry brown cat

0.4321

Language Model

Proposed text

Likelihood

Language Model

The angry cat brown

0.0321

29 of 107

Decoding CTC Output- Language Model

29

Probability for a word in some text depends on the N-1 previous words

30 of 107

Decoding CTC Output- Language Model

30

Example: Bigram language model

  • text = “<start> it was the best of times <end>”
  • P(text) = P(<end>|times) * P(times|of) P(of|best) * P(best|the) * P(was|it) P(it|<start>)

31 of 107

Decoding CTC Output- Language Model

31

Can incorporate the language model with the logits to assist in decoding

  • When scoring a path through the logits, add the language model score:

32 of 107

Decoding CTC Output- Better Algorithms

32

33 of 107

Decoding CTC Output - Exact Solution With Language Model

33

So, given a language model, how do we decode logits?

  • True solution
    • Score every possible path through the logit matrix with the addition of the LM;
    • Combine scores of equivalent paths
    • Take the highest-scoring text

34 of 107

Decoding CTC Output- Language Model

34

We need an intermediate solution!

35 of 107

Decoding CTC Output- Beam Search

35

36 of 107

Decoding CTC Output- Beam Search

36

Beam Search: A Fast Approximate Solution

Step 1. Select the N best characters from the first time slice

F

P

t=1

37 of 107

Decoding CTC Output- Beam Search

37

Step 2. Add a character from time slice 2 to the initial 1-character beams and score.

F

P

t=1

h

e

t=2

e

a

38 of 107

Decoding CTC Output- Beam Search

38

Step 2.5 Rescore text outputs with the language model. Prune the number of kept sequences to our desired beam width.

F

P

t=1

h

e

t=2

e

a

h

e

e

a

t=2.5

39 of 107

Decoding CTC Output- Beam Search

39

Step 3 Continue by adding a third timestep, choosing the N best, and so on

F

P

t=1

h

e

t=2

e

a

h

e

e

a

t=2.5

h

e

e

a

t=3

40 of 107

Decoding CTC Output- Beam Search

40

At a beam width of 1, we are identical to greedy decoding.

At a beam width of infinity, we have an exact solution

41 of 107

Decoding CTC Output- Beam Search Scoring

41

42 of 107

Decoding CTC Output- Beam Search Scoring

42

Beam Search Scoring

To score a beam (i.e. a path) , sum the log probabilities of the characters

43 of 107

Decoding CTC Output- Beam Search Scoring

43

To score text : sum over equivalent beams

44 of 107

Conclusion- Part 1 /Intro - Part 2

44

45 of 107

Conclusion- Part 1 /Intro - Part 2

45

  • CTC encoding is a common way for audio models to output character information
  • Pyctcdecode package
    • Decodes CTC-encoded logits
      • With language model assistance
      • Uses beam search to find the best text

46 of 107

Pyctcdecode

46

47 of 107

Pyctcdecode

47

Pyctcdecode:

Demo code:

  • https://github.com/rhgrossman/pyctcdecode_demo

48 of 107

Pyctcdecode

48

  • Developed at Kensho to decode CTC-encoded logits
  • Python!
  • Simple code, easy to extend with new features
  • Easy to use in common ML workflows

49 of 107

Pyctcdecode - features

49

50 of 107

Pyctcdecode - features

50

Everything discussed in part 1 is implemented in pyctcdecode

  • Uses beam search to determine quasi-optimal output text
  • LM support with KenLM
  • Greedy decoding (no LM)
    • Beam width of 1; highest probability character at each timestep

51 of 107

Pyctcdecode - features

51

Also offers …

52 of 107

Pyctcdecode - features

52

53 of 107

Pyctcdecode - features: Hot words

53

Boosting “Hot” Words

  • Count “hot” words and boost the probabilities for each one
  • Useful when speech conditions might differ from model training
    • Eg. Early 2020 when “coronavirus” became extremely commonplace

54 of 107

Pyctcdecode - features: Hot words

54

55 of 107

Pyctcdecode - features: Speed

55

Beam pruning and caching allow fast performance, comparable to the C++ packages we previously used

  • Reject beams with particularly low probabilities to improve performance
  • Cache some partial results during calculation to continue improving performance

56 of 107

Pyctcdecode - features: Speed

56

57 of 107

Pyctcdecode - Potential features?

57

58 of 107

Pyctcdecode - Potential features?

58

Transformer-based/ neural language models are coming into prominence

    • https://arxiv.org/pdf/2110.03326.pdf
    • Brings its own set of challenges
      • Maintain additional complex model
      • Runtime usually high

59 of 107

Pyctcdecode - Getting Started

59

60 of 107

Pyctcdecode - Getting Started

60

61 of 107

Pyctcdecode - Getting Started

61

To start using pyctc-decode effectively, need 4 things:

  1. Text corpus to train language/acoustic models
  2. Acoustic model - BPE or character based
    1. Eg. Huggingface
  3. Vocab set for the above model
  4. Kenlm- implements n-gram model

62 of 107

Pyctcdecode - Getting Started -

1. Dataset

62

63 of 107

Pyctcdecode - Getting Started -

1. Dataset

63

SPGISpeech - 5000 hours of financial audio and associated transcripts

  • ~5x the size of librispeech!
  • https://datasets.kensho.com/datasets/spgispeech
  • https://arxiv.org/abs/2104.02014

64 of 107

Pyctcdecode - Getting Started -

1. Dataset - SPGISpeech

64

SPGISpeech

  • Company earnings calls
  • Split into 5-15 second chunks to allow easy training of speech recognition
  • Over 50,000 unique speakers
    • Variety of accents
  • Val set of manual transcription
    • Polished orthography
    • Well normalized

65 of 107

Pyctcdecode - Getting Started - Acoustic model

65

66 of 107

Pyctcdecode - Getting Started - Acoustic model

66

Acoustic model - BPE or character based

  • Pyctcdecode handles BPE encoding
  • Need the vocabulary set- if BPE, just need a list of characters that existed in the text used to create the BPE

67 of 107

Pyctcdecode - Getting Started - Acoustic model

67

68 of 107

Pyctcdecode - Getting Started - Language model

68

69 of 107

Pyctcdecode - Getting Started - Language model

69

Kenlm - offers support for n-gram language models

  • python -m pip install pypi-kenlm

Currently only kenlm models are supported

  • Can feature request if desired

70 of 107

Pyctcdecode - Getting Started - Language model

70

Matching vocabulary sets is critical

  • If there is even a slight mismatch, can lead to very strange behavior
    • Eg. if the “-” character not included in one
      • cash-flow -> cashflow

71 of 107

Pyctcdecode - Getting Started - First decoder

71

72 of 107

Pyctcdecode - Getting Started - First decoder

72

73 of 107

Pyctcdecode - Getting Started - First decoder

73

74 of 107

Pyctcdecode - Getting Started - Examples

74

75 of 107

Pyctcdecode - Getting Started - Examples

75

Samples:

  • Generated using the demo code

  • Tried to pick a spread of examples from those where the LM helps a lot to those where it does not

  • Hopefully, these give you a good base on how LM work in practice and how to tune them

76 of 107

Pyctcdecode - Getting Started

76

All of the code used to produce the samples talked about in this presentation are available on Github:

https://github.com/rhgrossman/pyctcdecode_demo

77 of 107

Pyctcdecode - Getting Started - Examples

77

Let’s start with a sample where the LM performs well.

78 of 107

Pyctcdecode - Getting Started - Examples

78

Sample:

Ground truth

Greedy Decoding

LM Decoding

18e82b076319ac52f8a4b391ea345abd/129.wav

79 of 107

Pyctcdecode - Getting Started - Examples

79

Sample:

I also want to remind everyone that any forward-looking statements made during this call are subject to risks and uncertainties, the most important of which are described in our press release and SEC filings.

Ground truth

Greedy Decoding

LM Decoding

18e82b076319ac52f8a4b391ea345abd/129.wav

80 of 107

Pyctcdecode - Getting Started - Examples

80

Sample:

i also want to remind everyone that any forward looking statements made during this call are subject to risks and uncertainties the most important of which are described in our press releafs and s c c filings

I also want to remind everyone that any forward-looking statements made during this call are subject to risks and uncertainties, the most important of which are described in our press release and SEC filings.

Ground truth

Greedy Decoding

LM Decoding

18e82b076319ac52f8a4b391ea345abd/129.wav

81 of 107

Pyctcdecode - Getting Started - Examples

81

Sample:

i also want to remind everyone that any forward looking statements made during this call are subject to risks and uncertainties the most important of which are described in our press releafs and s c c filings

i also want to remind everyone that any forward looking statements made during this call are subject to risks and uncertainties the most important of which are described in our press release and sec filings

I also want to remind everyone that any forward-looking statements made during this call are subject to risks and uncertainties, the most important of which are described in our press release and SEC filings.

Ground truth

Greedy Decoding

LM Decoding

18e82b076319ac52f8a4b391ea345abd/129.wav

82 of 107

Pyctcdecode - Getting Started - Examples

82

Hopefully this gives you an idea of the types of errors LM tend to correct

83 of 107

Pyctcdecode - Getting Started - Examples

83

Now, for a sample where the LM performs less well…

84 of 107

Pyctcdecode - Getting Started - Examples

84

Sample:

Ground truth

Greedy Decoding

LM Decoding

c3567809c19ce6a800677544cd84f88b/4.wav

85 of 107

Pyctcdecode - Getting Started - Examples

85

Sample:

Thank you, Ivan. I'm now going to look in a bit more detail at what is, as Ivan said, a good set of results, with better top line performance, margin expansion and increased cash flow.

Ground truth

Greedy Decoding

LM Decoding

c3567809c19ce6a800677544cd84f88b/4.wav

86 of 107

Pyctcdecode - Getting Started - Examples

86

Sample:

thank you ivan i'm now going to look in a bit more detail at what is as ivan said a good set of results with better topline performance margin expansion and increased cashflow

Thank you, Ivan. I'm now going to look in a bit more detail at what is, as Ivan said, a good set of results, with better top line performance, margin expansion and increased cash flow.

Ground truth

Greedy Decoding

LM Decoding

c3567809c19ce6a800677544cd84f88b/4.wav

87 of 107

Pyctcdecode - Getting Started - Examples

87

Sample:

thank you ivan i'm now going to look in a bit more detail at what is as ivan said a good set of results with better topline performance margin expansion and increased cashflow

thank you even i'm now going to look in a bit more detail at what is as ivan said a good set of results with better top line performance margin expansion and increased cash flow

Thank you, Ivan. I'm now going to look in a bit more detail at what is, as Ivan said, a good set of results, with better top line performance, margin expansion and increased cash flow.

Ground truth

Greedy Decoding

LM Decoding

c3567809c19ce6a800677544cd84f88b/4.wav

88 of 107

Pyctcdecode - Getting Started - Examples

88

Despite these types of errors- language models are quite useful and improve WER on even extensively trained models

  • 6.1% WER improvement on Kensho internal val set
    • With large acoustic model - 100k hours for acoustic training

  • Doesn’t change the fact they still introduce errors

89 of 107

Pyctcdecode - Getting Started - Examples

89

How can we address such errors in the LM?

90 of 107

Pyctcdecode - Getting Started - Tuning options

90

  1. Boost hotwords for things that may be specific to our prediction domain
    1. Eg. Names for people on a call
    2. Words like ‘covid’ that may be common but not in training corpus
  2. Adjust parameters to increase or decrease the weight on the language model vs the logits.
    • Alpha and beta, in the package

91 of 107

Pyctcdecode - Getting Started - Examples

91

Let’s try hotword boosting on our previous sample

92 of 107

Pyctcdecode - Getting Started - Examples

92

93 of 107

Pyctcdecode - Getting Started - Examples

93

Sample:

thank you ivan i'm now going to look in a bit more detail at what is as ivan said a good set of results with better topline performance margin expansion and increased cashflow

Thank you, Ivan. I'm now going to look in a bit more detail at what is, as Ivan said, a good set of results, with better top line performance, margin expansion and increased cash flow.

Ground truth

Greedy Decoding

LM Decoding

c3567809c19ce6a800677544cd84f88b/4.wav

94 of 107

Pyctcdecode - Getting Started - Examples

94

Sample:

thank you ivan i'm now going to look in a bit more detail at what is as ivan said a good set of results with better topline performance margin expansion and increased cashflow

thank you ivan i 'm now going to look in a bit more detail at what is as ivan said a good set of results with better top line performance margin expansion and increased cash flow

Thank you, Ivan. I'm now going to look in a bit more detail at what is, as Ivan said, a good set of results, with better top line performance, margin expansion and increased cash flow.

Ground truth

Greedy Decoding

LM Decoding

c3567809c19ce6a800677544cd84f88b/4.wav

95 of 107

Pyctcdecode - Getting Started - Tuning options

95

Very useful- especially in many common transcription scenarios

  • Call with known speaker names

  • Adjusting your model to new language trends without fully retraining

  • Adjusting your model to a corpus not included in training

96 of 107

Pyctcdecode - Getting Started - Tuning Options

96

Let’s look at our second method of improving language models - parameter tuning

97 of 107

Pyctcdecode - Getting Started - Tuning Options

97

Pyctcdecode offers two tunable params

  • Alpha: weight for language model during shallow fusion
  • Beta: Constant. weight for length score adjustment during scoring

98 of 107

Pyctcdecode - Getting Started - Tuning Options

98

Keep in mind:

  • These should be tuned on a holdout set that is NOT your validation set
    • If you tune these parameters on your validation set, then you will receive inflated WER scores
      • Leaked information to LM

  • As alpha & beta -> 0 the predictions converge to the beam decoding

99 of 107

Pyctcdecode - Getting Started - Tuning Options

99

Training set

LM Parameter tuning holdout

Acoustic + LM base training set

Validation set

100 of 107

Pyctcdecode - Getting Started - Examples

100

Sample:

Ground truth

Greedy Decoding

alpha=.7

beta=3.0

538323cb37246bc97d1253b369a7414a/178.wav

alpha=.5

beta=1.0

101 of 107

Pyctcdecode - Getting Started - Examples

101

Sample:

whereas then in the industrial cranes and crane components business units,

Ground truth

Greedy Decoding

alpha=.7

beta=3.0

538323cb37246bc97d1253b369a7414a/178.wav

alpha=.5

beta=1.0

102 of 107

Pyctcdecode - Getting Started - Examples

102

Sample:

whereas then in the industrial cranes and crane components business units,

Ground truth

Greedy Decoding

alpha=.7

beta=3.0

538323cb37246bc97d1253b369a7414a/178.wav

whereas tein a in the industrial grains and crain combonents business units

alpha=.5

beta=1.0

103 of 107

Pyctcdecode - Getting Started - Examples

103

Sample:

whereas then in the industrial cranes and crane components business units,

Ground truth

Greedy Decoding

alpha=.7

beta=3.0

538323cb37246bc97d1253b369a7414a/178.wav

whereas tein a in the industrial grains and crain combonents business units

whereas then in the industrial grains and grain components business units

alpha=.5

beta=1.0

104 of 107

Pyctcdecode - Getting Started - Examples

104

Sample:

whereas then in the industrial cranes and crane components business units,

Ground truth

Greedy Decoding

alpha=.7

beta=3.0

538323cb37246bc97d1253b369a7414a/178.wav

whereas tein a in the industrial grains and crain combonents business units

whereas then in the industrial grains and grain components business units

whereas then in the industrial cranes and grain components business units

alpha=.5

beta=1.0

105 of 107

Pyctcdecode - Getting Started - Tuning Options

105

Clearly, alpha and beta can have a significant impact

  • Again, ensure no leakage on validation

106 of 107

Pyctcdecode - Conclusions

106

  • Pyctcdecode efficiently decodes CTC-encoded logits
  • Python
  • Use it!

107 of 107

Pyctcdecode - Questions? Comments?

107