Recitation 9
(CTC Decoding and Beam Search)
Gabrial, Harini & Quentin
HW1P2 vs. HW3P2
HW1P2
HW3P2
Training
Problem Set-up
Model
x(T-1)
x(T)
x(0)
x(1)
…
y(0)
Model
y(1)
Model
y(T-1)
Model
y(T)
x(1)
Model
y(1)
x(1)
Model
y(1)
Someone says “zoo”
We want to transcribe “zoo” at these time steps
Z
O
O
…
…
Inputs
Probabilities
Target
(Training)
Two problems
(1) time alignment and (2) repetition
One Asset
(1) order alignment
Model
x(T-1)
x(T)
x(0)
x(1)
…
y(0)
Model
y(1)
Model
y(T-1)
Model
y(T)
x(1)
Model
y(1)
x(1)
Model
y(1)
Someone says “zoo”
We want to transcribe “zoo” at these time steps
Z
O
O
…
…
?
?
?
?
?
?
…
Inputs
Probabilities
Target
All time-aligned targets for Target
?
?
?
?
?
?
…
?
?
?
?
?
?
…
Use CTC to get time-aligned targets
(Training)
The problem of repetition…
(Training)
A note on repetition…
“zzzooo”
“zoo”
“zo”
“zzzzzzzzo”
“zo” (🙁)
(Training)
A note on repetition…
“zzzooo”
“zoo”
“zo”
“zzzzzzzzo”
“zo” (🙁)
“zzz%oo%o”
“zo%o”
“zo”
“zzzzzzzzo”
“zoo” (😀)
“zzzooo”
“zoo”
“zo”
zzzzzzzo”
“zo” (🙁)
To account for repetitions, introduce a “break” character (“%”)
The problem of alignment…
(Training)
A note on alignment…
“z%o%oo”
“z%oo%o”
…
“zooo%o”
“zoo”(🙁)
(Training)
A note on alignment…
“z%o%oo”
“z%oo%o”
…
“zooo%o”
“zoo”(🙁)
If we use a time-aligned target, we can compute a differentiable loss.
But which alignment should we choose?
We could use the Viterbi algorithm to find the most probable path and use that to calculate the loss. However, there is a better option…
Using the asset of order alignment, as well as some additional “rules”, we can actually use ALL POSSIBLE ALIGNMENTS and calculate an EXPECTED LOSS over them.
We could use the Viterbi algorithm to find the most probable path and use that to calculate the loss. However, there is a better option…
Using the asset of order alignment, as well as some additional “rules”, we can actually use ALL POSSIBLE ALIGNMENTS and calculate an EXPECTED LOSS over them.
1. Alignments b/t X and Y are monotonic (once you advance to the next input, you may only keep the corresponding output or advance to the next output)
2. Alignment of X to Y is many-to-one (there may be many input elements aligning to a single output element).
3. Length of Y ≤ Length of X
Given: The “break” character, order alignment, and some additional rules…
Task: Find possible alignments, their probabilities, then calculate a differentiable loss
We could use the Viterbi algorithm to find the most probable path and use that to calculate the loss. However, there is a better option…
Using the asset of order alignment, as well as some additional “rules”, we can actually use ALL POSSIBLE ALIGNMENTS and calculate an EXPECTED LOSS over them.
(Training)
We want to transcribe “zoo” at these time steps
Z
O
O
Target
y(0)
y(1)
y(T-1)
y(T)
y(1)
y(1)
…
Probabilities
(Training)
We want to transcribe “zoo” at these time steps
Z
O
O
Target
P(“a”)
P(“b”)
.
.
.
P(“z”)
P(“a”)
P(“b”)
.
.
.
P(“z”)
P(“a”)
P(“b”)
.
.
.
P(“z”)
P(“a”)
P(“b”)
.
.
.
P(“z”)
P(“a”)
P(“b”)
.
.
.
P(“z”)
P(“a”)
P(“b”)
.
.
.
P(“z”)
…
Probabilities
(Training)
We want to transcribe “zoo” at these time steps
Z
O
O
Target
P(“%”)
P(“z”)
P(“%”)
P(“o”)
P(“%”)
P(“o”)
P(“%”)
P(“%”)
P(“z”)
P(“%”)
P(“o”)
P(“%”)
P(“o”)
P(“%”)
P(“%”)
P(“z”)
P(“%”)
P(“o”)
P(“%”)
P(“o”)
P(“%”)
P(“%”)
P(“z”)
P(“%”)
P(“o”)
P(“%”)
P(“o”)
P(“%”)
P(“%”)
P(“z”)
P(“%”)
P(“o”)
P(“%”)
P(“o”)
P(“%”)
P(“%”)
P(“z”)
P(“%”)
P(“o”)
P(“%”)
P(“o”)
P(“%”)
Probabilities
Limit to the characters in the Target with breaks inserted
…
%
z
%
<start>
<end>
“zoo”
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
Not all of these are “viable” paths/alignments. Which ones are?
(Training)
%
z
%
<start>
<end>
“zoo”
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
Let’s start from the outer edges and work our way in…
(Training)
%
z
%
<start>
<end>
“zoo”
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
Let’s start from the outer edges and work our way in…
(Training)
%
z
%
<start>
<end>
“zoo”
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
Let’s start from the outer edges and work our way in…
(Training)
%
z
%
<start>
<end>
“zoo”
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
Let’s start from the outer edges and work our way in…
Dotted lines are paths that may seem viable at first, but are not!
(Training)
This path won’t work, because it would skip the “z” (not monotonic)
%
z
%
<start>
<end>
“zoo”
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
Let’s start from the outer edges and work our way in…
(Training)
%
z
%
<start>
<end>
“zoo”
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
Let’s start from the outer edges and work our way in…
(Training)
%
z
%
<start>
<end>
“zoo”
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
Let’s start from the outer edges and work our way in…
Dotted lines are paths that may seem viable at first, but are not!
(Training)
This path won’t work, because it would skip the first “o”
%
z
%
<start>
<end>
“zoo”
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
Let’s start from the outer edges and work our way in…
(Training)
%
z
%
<start>
<end>
“zoo”
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
Let’s start from the outer edges and work our way in…
(Training)
%
z
%
<start>
<end>
“zoo”
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
Let’s start from the outer edges and work our way in…
This path won’t work because you won’t be able to get to the final character “in time” based on the sequence length.
Dotted lines are paths that may seem viable at first, but are not!
(Training)
%
z
%
<start>
<end>
“zoo”
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
These are the viable alignments
(skips bolded)
(Training)
%
z
%
<start>
<end>
“zoo”
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
These are the viable alignments
(Training)
%
z
%
<start>
<end>
“zoo”
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
At any given node, you can calculate the posterior probability of reaching that node at that time step using the product of the probability of reaching from the “forward” and “backward” passes
(See Lecture)
This is where dynamic programming comes in handy!
(Training)
%
z
%
<start>
<end>
“zoo”
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
We can calculate the probability of ALL ALIGNMENTS using the forward/backward method and use these to compute an EXPECTED LOSS.
(Training)
%
z
%
<start>
<end>
“zoo”
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
%
z
%
o
%
o
%
We can calculate the probability of ALL ALIGNMENTS using the forward/backward method and use these to compute an EXPECTED LOSS.
(Training)
NOW WE CAN TRAIN
(YAY!)
Training Procedure
With Connectionist Temporal Classification(CTC)
Model
x(T-1)
x(T)
x(0)
x(1)
…
y(0)
Model
y(1)
Model
y(T-1)
Model
y(T)
x(1)
Model
y(1)
x(1)
Model
y(1)
Someone says “zoo”
We want to transcribe “zoo” at these time steps
Z
O
O
…
…
?
?
?
?
?
?
…
Inputs
Probabilities
Target
All time-aligned targets for Target
?
?
?
?
?
?
…
?
?
?
?
?
?
…
Use CTC to get time-aligned targets
(Training)
Inference
Q: What’s different at inference?
Q: What’s different at inference?
A: No target
A: No sequence “rules”
Q: What’s different at inference?
A: No target
A: No sequence “rules”
The “tree” is about to get very large :/ (gulp)
What are our options?
Inference
Greedy Search
�
Y_probs
Greedy Search
DECODED STRING
?
Y_probs
Greedy Search
A
Y_probs
Greedy Search
A
B
Y_probs
Greedy Search
A
B
A
Y_probs
Greedy Search
A
B
A
B
Y_probs
Greedy Search
DECODED STRING
A B A B
0.391*0.341*0.402*0.358 = 0.0191884642
Inference
Beam Search
�
Y_probs
Beam Search
DECODED STRING
?
Seq Len | 4 |
Symbol set | {‘-’, ‘A’, ‘B’, ‘C’} |
Beam Width | 3 |
Parameters
Scores
Beam Search
Possible Paths | Calculate Score | Score |
- | 1*0.140 | 0.140 |
A | 1*0.391 | 0.391 |
B | 1*0.197 | 0.197 |
C | 1*0.271 | 0.271 |
Best Paths | Score |
_ | _ |
_ | _ |
_ | _ |
Old
Scores
Beam Search
Possible Paths | Calculate Score | Score |
- | 1*0.140 | 0.140 |
A | 1*0.391 | 0.391 |
B | 1*0.197 | 0.197 |
C | 1*0.271 | 0.271 |
Best Paths | Score |
_ | _ |
_ | _ |
_ | _ |
Best Paths | Score |
A | 0.391 |
B | 0.197 |
C | 0.271 |
Old
New
Scores
Beam Search
Possible Paths | Calculate Score | Score |
A- | 0.391*0.257 | 0.10048700 |
B- | 0.197*0.257 | 0.05062900 |
C- | 0.271*0.257 | 0.06964700 |
AA -> A | 0.391*0.096 | 0.03753600 |
AB | 0.391*0.341 | 0.13333100 |
AC | 0.391*0.305 | 0.11925500 |
BA | 0.197*0.096 | 0.01891200 |
BB -> B | 0.197*0.341 | 0.06717700 |
BC | 0.197*0.305 | 0.06008500 |
CA | 0.271*0.096 | 0.02601600 |
CB | 0.271*0.341 | 0.09241100 |
CC -> C | 0.271*0.305 | 0.08265500 |
Best Paths | Score |
A | 0.391 |
B | 0.197 |
C | 0.271 |
Old
Scores
Beam Search
Possible Paths | Calculate Score | Score |
A- | 0.391*0.257 | 0.10048700 |
B- | 0.197*0.257 | 0.05062900 |
C- | 0.271*0.257 | 0.06964700 |
AA -> A | 0.391*0.096 | 0.03753600 |
AB | 0.391*0.341 | 0.13333100 |
AC | 0.391*0.305 | 0.11925500 |
BA | 0.197*0.096 | 0.01891200 |
BB -> B | 0.197*0.341 | 0.06717700 |
BC | 0.197*0.305 | 0.06008500 |
CA | 0.271*0.096 | 0.02601600 |
CB | 0.271*0.341 | 0.09241100 |
CC -> C | 0.271*0.305 | 0.08265500 |
Best Paths | Score |
A | 0.391 |
B | 0.197 |
C | 0.271 |
Best Paths | Score |
A- | 0.10048700 |
AB | 0.13333100 |
AC | 0.11925500 |
Old
New
Scores
Beam Search
Possible Paths | Calculate Score | Score |
A–- | 0.100487*0.248 | 0.0249207760 |
AB- | 0.133331*0.248 | 0.0330660880 |
AC- | 0.119255*0.248 | 0.0295752400 |
A-A -> AA | 0.100487*0.402 | 0.0403957740 |
ABA | 0.133331*0.402 | 0.0535990620 |
ACA | 0.119255*0.402 | 0.0479405100 |
A-B -> AB | 0.100487*0.267 | 0.0268300290 |
ABB -> AB | 0.133331*0.267 | 0.0355993770 |
ACB | 0.119255*0.267 | 0.0318410850 |
A-C -> AC | 0.100487*0.083 | 0.0083404210 |
ABC | 0.133331*0.083 | 0.0110664730 |
ACC -> AC | 0.119255*0.083 | 0.0098981650 |
Best Paths | Score |
A- | 0.10048700 |
AB | 0.13333100 |
AC | 0.11925500 |
Old
Scores
Beam Search
Possible Paths | Calculate Score | Score |
A–- | 0.100487*0.248 | 0.0249207760 |
AB- | 0.133331*0.248 | 0.0330660880 |
AC- | 0.119255*0.248 | 0.0295752400 |
A-A -> AA | 0.100487*0.402 | 0.0403957740 |
ABA | 0.133331*0.402 | 0.0535990620 |
ACA | 0.119255*0.402 | 0.0479405100 |
A-B -> AB | 0.100487*0.267 | 0.0268300290 |
ABB -> AB | 0.133331*0.267 | 0.0355993770 |
ACB | 0.119255*0.267 | 0.0318410850 |
A-C -> AC | 0.100487*0.083 | 0.0083404210 |
ABC | 0.133331*0.083 | 0.0110664730 |
ACC -> AC | 0.119255*0.083 | 0.0098981650 |
Best Paths | Score |
A- | 0.10048700 |
AB | 0.13333100 |
AC | 0.11925500 |
Best Paths | Score |
ABA | 0.0535990620 |
ACA | 0.0479405100 |
AB | 0.0624294060 |
Old
New
Scores
Beam Search
Possible Paths | Calculate Score | Score |
ABA- -> ABA | 0.0535990620*0.149 | 0.0079862602380 |
ACA- -> ACA | 0.0479405100*0.149 | 0.0071431359900 |
AB- -> AB | 0.0624294060*0.149 | 0.0093019814940 |
ABAA -> ABA | 0.0535990620*0.336 | 0.0180092848320 |
ACAA -> ACA | 0.0479405100*0.336 | 0.0161080113600 |
ABA | 0.0624294060*0.336 | 0.0209762804160 |
ABAB | 0.0535990620*0.358 | 0.0191884641960 |
ACAB | 0.0479405100*0.358 | 0.0171627025800 |
ABB -> AB | 0.0624294060*0.358 | 0.0223497273480 |
ABAC | 0.0535990620*0.157 | 0.0084150527340 |
ACAC | 0.0479405100*0.157 | 0.0075266600700 |
ABC | 0.0624294060*0.157 | 0.0098014167420 |
Best Paths | Score |
ABA | 0.0535990620 |
ACA | 0.0479405100 |
AB | 0.0624294060 |
Old
Scores
Beam Search
Possible Paths | Calculate Score | Score |
ABA- -> ABA | 0.0535990620*0.149 | 0.0079862602380 |
ACA- -> ACA | 0.0479405100*0.149 | 0.0071431359900 |
AB- -> AB | 0.0624294060*0.149 | 0.0093019814940 |
ABAA -> ABA | 0.0535990620*0.336 | 0.0180092848320 |
ACAA -> ACA | 0.0479405100*0.336 | 0.0161080113600 |
ABA | 0.0624294060*0.336 | 0.0209762804160 |
ABAB | 0.0535990620*0.358 | 0.0191884641960 |
ACAB | 0.0479405100*0.358 | 0.0171627025800 |
ABB -> AB | 0.0624294060*0.358 | 0.0223497273480 |
ABAC | 0.0535990620*0.157 | 0.0084150527340 |
ACAC | 0.0479405100*0.157 | 0.0075266600700 |
ABC | 0.0624294060*0.157 | 0.0098014167420 |
Best Paths | Score |
ABA | 0.0535990620 |
ACA | 0.0479405100 |
AB | 0.0624294060 |
Best Paths | Score |
ABA | 0.04697182548 |
AB | 0.031651708842 |
ACA | 0.02325114735 |
Old
New
Beam Search
Beam Search