1 of 60

Recitation 9

(CTC Decoding and Beam Search)

Gabrial, Harini & Quentin

2 of 60

HW1P2 vs. HW3P2

HW1P2

Sequence Classification for phoneme recognition.
Time-synchronous outputs.

HW3P2

Sequence to Sequence with Order Synchrony
Training: Using CTC Loss to deal with…

The problem of alignment
The problem of repetition

Inference

Greedy Search
Beam Search

3 of 60

Training

4 of 60

Problem Set-up

Inputs and targets are not time-aligned

|Y| ≠ |X|
|Y| and |X| not proportional

However, they are order-aligned
We will compress predicted sequences
We need to be able to yield repeating outputs after compression

5 of 60

Model

x(T-1)

x(T)

x(0)

x(1)

…

y(0)

Model

y(1)

Model

y(T-1)

Model

y(T)

x(1)

Model

y(1)

x(1)

Model

y(1)

Someone says “zoo”

We want to transcribe “zoo” at these time steps

Z

O

…

Inputs

Probabilities

Target

(Training)

Two problems

(1) time alignment and (2) repetition

One Asset

(1) order alignment

6 of 60

Model

x(T-1)

x(T)

x(0)

x(1)

…

y(0)

Model

y(1)

Model

y(T-1)

Model

y(T)

x(1)

Model

y(1)

x(1)

Model

y(1)

Someone says “zoo”

We want to transcribe “zoo” at these time steps

Z

O

…

?

…

Inputs

Probabilities

Target

All time-aligned targets for Target

?

…

?

…

Use CTC to get time-aligned targets

(Training)

7 of 60

The problem of repetition…

8 of 60

(Training)

A note on repetition…

“zzzooo”

“zoo”

“zo”

“zzzzzzzzo”

“zo” (🙁)

9 of 60

(Training)

A note on repetition…

“zzzooo”

“zoo”

“zo”

“zzzzzzzzo”

“zo” (🙁)

“zzz%oo%o”

“zo%o”

“zo”

“zzzzzzzzo”

“zoo” (😀)

“zzzooo”

“zoo”

“zo”

zzzzzzzo”

“zo” (🙁)

To account for repetitions, introduce a “break” character (“%”)

10 of 60

The problem of alignment…

11 of 60

(Training)

A note on alignment…

“z%o%oo”

“z%oo%o”

…

“zooo%o”

“zoo”(🙁)

12 of 60

(Training)

A note on alignment…

“z%o%oo”

“z%oo%o”

…

“zooo%o”

“zoo”(🙁)

If we use a time-aligned target, we can compute a differentiable loss.

But which alignment should we choose?

13 of 60

We could use the Viterbi algorithm to find the most probable path and use that to calculate the loss. However, there is a better option…

Using the asset of order alignment, as well as some additional “rules”, we can actually use ALL POSSIBLE ALIGNMENTS and calculate an EXPECTED LOSS over them.

14 of 60

We could use the Viterbi algorithm to find the most probable path and use that to calculate the loss. However, there is a better option…

Using the asset of order alignment, as well as some additional “rules”, we can actually use ALL POSSIBLE ALIGNMENTS and calculate an EXPECTED LOSS over them.

1. Alignments b/t X and Y are monotonic (once you advance to the next input, you may only keep the corresponding output or advance to the next output)

2. Alignment of X to Y is many-to-one (there may be many input elements aligning to a single output element).

3. Length of Y ≤ Length of X

15 of 60

Given: The “break” character, order alignment, and some additional rules…

Task: Find possible alignments, their probabilities, then calculate a differentiable loss

We could use the Viterbi algorithm to find the most probable path and use that to calculate the loss. However, there is a better option…

Using the asset of order alignment, as well as some additional “rules”, we can actually use ALL POSSIBLE ALIGNMENTS and calculate an EXPECTED LOSS over them.

16 of 60

(Training)

We want to transcribe “zoo” at these time steps

Z

O

Target

y(0)

y(1)

y(T-1)

y(T)

y(1)

…

Probabilities

17 of 60

(Training)

We want to transcribe “zoo” at these time steps

Z

O

Target

P(“a”)

P(“b”)

.

P(“z”)

P(“a”)

P(“b”)

.

P(“z”)

P(“a”)

P(“b”)

.

P(“z”)

P(“a”)

P(“b”)

.

P(“z”)

P(“a”)

P(“b”)

.

P(“z”)

P(“a”)

P(“b”)

.

P(“z”)

…

Probabilities

18 of 60

(Training)

We want to transcribe “zoo” at these time steps

Z

O

Target

P(“%”)

P(“z”)

P(“%”)

P(“o”)

P(“%”)

P(“o”)

P(“%”)

P(“z”)

P(“%”)

P(“o”)

P(“%”)

P(“o”)

P(“%”)

P(“z”)

P(“%”)

P(“o”)

P(“%”)

P(“o”)

P(“%”)

P(“z”)

P(“%”)

P(“o”)

P(“%”)

P(“o”)

P(“%”)

P(“z”)

P(“%”)

P(“o”)

P(“%”)

P(“o”)

P(“%”)

P(“z”)

P(“%”)

P(“o”)

P(“%”)

P(“o”)

P(“%”)

Probabilities

Limit to the characters in the Target with breaks inserted

…

19 of 60

%

z

%

<start>

<end>

“zoo”

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

Not all of these are “viable” paths/alignments. Which ones are?

(Training)

20 of 60

%

z

%

<start>

<end>

“zoo”

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

Let’s start from the outer edges and work our way in…

(Training)

21 of 60

%

z

%

<start>

<end>

“zoo”

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

Let’s start from the outer edges and work our way in…

(Training)

22 of 60

%

z

%

<start>

<end>

“zoo”

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

Let’s start from the outer edges and work our way in…

(Training)

23 of 60

%

z

%

<start>

<end>

“zoo”

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

Let’s start from the outer edges and work our way in…

Dotted lines are paths that may seem viable at first, but are not!

(Training)

This path won’t work, because it would skip the “z” (not monotonic)

24 of 60

%

z

%

<start>

<end>

“zoo”

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

Let’s start from the outer edges and work our way in…

(Training)

25 of 60

%

z

%

<start>

<end>

“zoo”

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

Let’s start from the outer edges and work our way in…

(Training)

26 of 60

%

z

%

<start>

<end>

“zoo”

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

Let’s start from the outer edges and work our way in…

Dotted lines are paths that may seem viable at first, but are not!

(Training)

This path won’t work, because it would skip the first “o”

27 of 60

%

z

%

<start>

<end>

“zoo”

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

Let’s start from the outer edges and work our way in…

(Training)

28 of 60

%

z

%

<start>

<end>

“zoo”

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

Let’s start from the outer edges and work our way in…

(Training)

29 of 60

%

z

%

<start>

<end>

“zoo”

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

Let’s start from the outer edges and work our way in…

This path won’t work because you won’t be able to get to the final character “in time” based on the sequence length.

Dotted lines are paths that may seem viable at first, but are not!

(Training)

30 of 60

%

z

%

<start>

<end>

“zoo”

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

These are the viable alignments

(skips bolded)

(Training)

31 of 60

%

z

%

<start>

<end>

“zoo”

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

These are the viable alignments

(Training)

32 of 60

%

z

%

<start>

<end>

“zoo”

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

At any given node, you can calculate the posterior probability of reaching that node at that time step using the product of the probability of reaching from the “forward” and “backward” passes

(See Lecture)

This is where dynamic programming comes in handy!

(Training)

33 of 60

%

z

%

<start>

<end>

“zoo”

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

We can calculate the probability of ALL ALIGNMENTS using the forward/backward method and use these to compute an EXPECTED LOSS.

(Training)

34 of 60

%

z

%

<start>

<end>

“zoo”

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

z

%

o

%

o

%

We can calculate the probability of ALL ALIGNMENTS using the forward/backward method and use these to compute an EXPECTED LOSS.

(Training)

NOW WE CAN TRAIN

(YAY!)

35 of 60

Training Procedure

With Connectionist Temporal Classification(CTC)

Define model (e.g., deep bidirectional LSTM)
Initialize network to output targets + “break” character
Pass training instances through network to obtain probability distribution over labels/symbols
Construct graph/table of “viable” alignments
Compute probabilities of alignments using Forward/Backward Algorithm (see Lecture)
Compute Expected Divergence over all alignments (see Lecture)
Propagate gradients backward and update parameters

36 of 60

Model

x(T-1)

x(T)

x(0)

x(1)

…

y(0)

Model

y(1)

Model

y(T-1)

Model

y(T)

x(1)

Model

y(1)

x(1)

Model

y(1)

Someone says “zoo”

We want to transcribe “zoo” at these time steps

Z

O

…

?

…

Inputs

Probabilities

Target

All time-aligned targets for Target

?

…

?

…

Use CTC to get time-aligned targets

(Training)

37 of 60

Inference

38 of 60

Q: What’s different at inference?

39 of 60

Q: What’s different at inference?

A: No target

A: No sequence “rules”

40 of 60

Q: What’s different at inference?

A: No target

A: No sequence “rules”

The “tree” is about to get very large :/ (gulp)

41 of 60

What are our options?

Greedy Search

Exhaustive Search (not really possible)

Beam Search

42 of 60

Inference

Greedy Search

Greedy Search is an easy-to-implement option for CTC decoding at inference time
Greedy Search simply selects the most probable time step at each time-step
Although this method is easy to implement and fast, it has the disadvantage of missing out on high-probability (score) overall paths due to it’s greedy search

�

43 of 60

Y_probs

Greedy Search

DECODED STRING

?

44 of 60

Y_probs

Greedy Search

A

45 of 60

Y_probs

Greedy Search

A

B

46 of 60

Y_probs

Greedy Search

A

B

A

47 of 60

Y_probs

Greedy Search

A

B

A

B

48 of 60

Y_probs

Greedy Search

DECODED STRING

A B A B

0.391*0.341*0.402*0.358 = 0.0191884642

49 of 60

Inference

Beam Search

To have better decoding than Greedy Search, but keep the method feasible at the same time, we can choose to “explore” top-k paths at each time-step
By exploring more than one most-probable output sequences at each time-step, we will reach a sub-optimal path that is likely to be better than the Greedy Search strategy
By limiting our exploration options to a specific Beam Width - k, we also ensure that the computation is tractable, as opposed to the Exhaustive Search strategy

�

50 of 60

Y_probs

Beam Search

DECODED STRING

?

Seq Len	4
Symbol set	{‘-’, ‘A’, ‘B’, ‘C’}
Beam Width	3

Parameters

51 of 60

Scores

Beam Search

Possible Paths	Calculate Score	Score
-	1*0.140	0.140
A	1*0.391	0.391
B	1*0.197	0.197
C	1*0.271	0.271

Best Paths	Score
_	_
_	_
_	_

Old

52 of 60

Scores

Beam Search

Possible Paths	Calculate Score	Score
-	1*0.140	0.140
A	1*0.391	0.391
B	1*0.197	0.197
C	1*0.271	0.271

Best Paths	Score
_	_
_	_
_	_

Best Paths	Score
A	0.391
B	0.197
C	0.271

Old

New

53 of 60

Scores

Beam Search

Possible Paths	Calculate Score	Score
A-	0.391*0.257	0.10048700
B-	0.197*0.257	0.05062900
C-	0.271*0.257	0.06964700
AA -> A	0.391*0.096	0.03753600
AB	0.391*0.341	0.13333100
AC	0.391*0.305	0.11925500
BA	0.197*0.096	0.01891200
BB -> B	0.197*0.341	0.06717700
BC	0.197*0.305	0.06008500
CA	0.271*0.096	0.02601600
CB	0.271*0.341	0.09241100
CC -> C	0.271*0.305	0.08265500

Best Paths	Score
A	0.391
B	0.197
C	0.271

Old

54 of 60

Scores

Beam Search

Possible Paths	Calculate Score	Score
A-	0.391*0.257	0.10048700
B-	0.197*0.257	0.05062900
C-	0.271*0.257	0.06964700
AA -> A	0.391*0.096	0.03753600
AB	0.391*0.341	0.13333100
AC	0.391*0.305	0.11925500
BA	0.197*0.096	0.01891200
BB -> B	0.197*0.341	0.06717700
BC	0.197*0.305	0.06008500
CA	0.271*0.096	0.02601600
CB	0.271*0.341	0.09241100
CC -> C	0.271*0.305	0.08265500

Best Paths	Score
A	0.391
B	0.197
C	0.271

Best Paths	Score
A-	0.10048700
AB	0.13333100
AC	0.11925500

Old

New

55 of 60

Scores

Beam Search

Possible Paths	Calculate Score	Score
A–-	0.100487*0.248	0.0249207760
AB-	0.133331*0.248	0.0330660880
AC-	0.119255*0.248	0.0295752400
A-A -> AA	0.100487*0.402	0.0403957740
ABA	0.133331*0.402	0.0535990620
ACA	0.119255*0.402	0.0479405100
A-B -> AB	0.100487*0.267	0.0268300290
ABB -> AB	0.133331*0.267	0.0355993770
ACB	0.119255*0.267	0.0318410850
A-C -> AC	0.100487*0.083	0.0083404210
ABC	0.133331*0.083	0.0110664730
ACC -> AC	0.119255*0.083	0.0098981650

Best Paths	Score
A-	0.10048700
AB	0.13333100
AC	0.11925500

Old

56 of 60

Scores

Beam Search

Possible Paths	Calculate Score	Score
A–-	0.100487*0.248	0.0249207760
AB-	0.133331*0.248	0.0330660880
AC-	0.119255*0.248	0.0295752400
A-A -> AA	0.100487*0.402	0.0403957740
ABA	0.133331*0.402	0.0535990620
ACA	0.119255*0.402	0.0479405100
A-B -> AB	0.100487*0.267	0.0268300290
ABB -> AB	0.133331*0.267	0.0355993770
ACB	0.119255*0.267	0.0318410850
A-C -> AC	0.100487*0.083	0.0083404210
ABC	0.133331*0.083	0.0110664730
ACC -> AC	0.119255*0.083	0.0098981650

Best Paths	Score
A-	0.10048700
AB	0.13333100
AC	0.11925500

Best Paths	Score
ABA	0.0535990620
ACA	0.0479405100
AB	0.0624294060

Old

New

57 of 60

Scores

Beam Search

Possible Paths	Calculate Score	Score
ABA- -> ABA	0.0535990620*0.149	0.0079862602380
ACA- -> ACA	0.0479405100*0.149	0.0071431359900
AB- -> AB	0.0624294060*0.149	0.0093019814940
ABAA -> ABA	0.0535990620*0.336	0.0180092848320
ACAA -> ACA	0.0479405100*0.336	0.0161080113600
ABA	0.0624294060*0.336	0.0209762804160
ABAB	0.0535990620*0.358	0.0191884641960
ACAB	0.0479405100*0.358	0.0171627025800
ABB -> AB	0.0624294060*0.358	0.0223497273480
ABAC	0.0535990620*0.157	0.0084150527340
ACAC	0.0479405100*0.157	0.0075266600700
ABC	0.0624294060*0.157	0.0098014167420

Best Paths	Score
ABA	0.0535990620
ACA	0.0479405100
AB	0.0624294060

Old

58 of 60

Scores

Beam Search

Possible Paths	Calculate Score	Score
ABA- -> ABA	0.0535990620*0.149	0.0079862602380
ACA- -> ACA	0.0479405100*0.149	0.0071431359900
AB- -> AB	0.0624294060*0.149	0.0093019814940
ABAA -> ABA	0.0535990620*0.336	0.0180092848320
ACAA -> ACA	0.0479405100*0.336	0.0161080113600
ABA	0.0624294060*0.336	0.0209762804160
ABAB	0.0535990620*0.358	0.0191884641960
ACAB	0.0479405100*0.358	0.0171627025800
ABB -> AB	0.0624294060*0.358	0.0223497273480
ABAC	0.0535990620*0.157	0.0084150527340
ACAC	0.0479405100*0.157	0.0075266600700
ABC	0.0624294060*0.157	0.0098014167420

Best Paths	Score
ABA	0.0535990620
ACA	0.0479405100
AB	0.0624294060

Best Paths	Score
ABA	0.04697182548
AB	0.031651708842
ACA	0.02325114735

Old

New

59 of 60

Beam Search

60 of 60

Beam Search