How to build end-to-end recognition system: best practices
1
“apple”
deepsystems.ai
Review from
deepsystems.ai
Task Description
Labelling unsegmented sequence data. i.e. training data is not pre-segmented.
2
Neural Network
Output:“apple”
Input: image with text
Input: sound wave of me saying “apple”
Neural Network
Output:“apple”
Example: speech recognition
Example: image ocr
deepsystems.ai
Task Description
Labelling unsegmented sequence data. i.e. training data is not pre-segmented.
We can not pre-segment input data because:
3
“a”
“p”
“p”
“l”
“e”
“a”
“p”
“p”
“l”
“e”
deepsystems.ai
Links
Alex Graves. CTC Loss: http://www.cs.toronto.edu/~graves/icml_2006.pdf
Keras example: image_ocr.ipynb
4
deepsystems.ai
Big picture
5
deepsystems.ai
Image OCR: model architecture
High-level overview
6
deepsystems.ai
7
input image
CNN feature extraction
LSTM Net
Decoding algorithm
“apple”
Image features
deepsystems.ai
Image OCR: model architecture
Detailed overview
8
deepsystems.ai
9
input image
64*128*3
deepsystems.ai
10
input image
CNN feature extraction
4*8*4
64*128*3
deepsystems.ai
11
input image
CNN feature extraction
4*8*4
64*128*3
Reshape
16*8
deepsystems.ai
12
input image
CNN feature extraction
4*8*4
64*128*3
Reshape
16*8
deepsystems.ai
13
input image
CNN feature extraction
4*8*4
64*128*3
Reshape
16*8
deepsystems.ai
14
input image
CNN feature extraction
4*8*4
64*128*3
Reshape
16*8
deepsystems.ai
15
input image
CNN feature extraction
4*8*4
64*128*3
Reshape
16*8
deepsystems.ai
16
input image
CNN feature extraction
4*8*4
64*128*3
Reshape
16*8
deepsystems.ai
17
input image
CNN feature extraction
4*8*4
64*128*3
Reshape
16*8
deepsystems.ai
18
input image
CNN feature extraction
4*8*4
64*128*3
Reshape
16*8
deepsystems.ai
19
input image
CNN feature extraction
4*8*4
64*128*3
Reshape
16*8
deepsystems.ai
20
input image
4*8*4
64*128*3
Reshape
16*8
lstm
16*1
CNN feature extraction
deepsystems.ai
21
input image
4*8*4
64*128*3
Reshape
16*8
lstm
FC+SM
16*1
6*1
This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)
CNN feature extraction
deepsystems.ai
22
input image
4*8*4
64*128*3
Reshape
16*8
lstm
FC+SM
16*1
6*1
This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)
In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}
CNN feature extraction
deepsystems.ai
23
input image
4*8*4
64*128*3
Reshape
16*8
lstm
FC+SM
16*1
6*1
This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)
In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}
|Alphabet| = 6
CNN feature extraction
deepsystems.ai
24
input image
4*8*4
64*128*3
Reshape
16*8
lstm
FC+SM
16*1
6*1
This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)
In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}
|Alphabet| = 6
“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.
CNN feature extraction
deepsystems.ai
25
input image
4*8*4
64*128*3
Reshape
16*8
lstm
FC+SM
16*1
6*1
This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)
In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}
|Alphabet| = 6
“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.
-probability of observing “a” at time1
CNN feature extraction
deepsystems.ai
26
input image
4*8*4
64*128*3
Reshape
16*8
lstm
FC+SM
16*1
6*1
This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)
In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}
|Alphabet| = 6
“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.
-probability of observing “e” at time1
CNN feature extraction
deepsystems.ai
27
input image
4*8*4
64*128*3
Reshape
16*8
lstm
FC+SM
16*1
6*1
This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)
In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}
|Alphabet| = 6
“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.
-probability of observing “l” at time1
CNN feature extraction
deepsystems.ai
28
input image
4*8*4
64*128*3
Reshape
16*8
lstm
FC+SM
16*1
6*1
This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)
In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}
|Alphabet| = 6
“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.
-probability of observing “p” at time1
CNN feature extraction
deepsystems.ai
29
input image
4*8*4
64*128*3
Reshape
16*8
lstm
FC+SM
16*1
6*1
This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)
In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}
|Alphabet| = 6
“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.
-probability of observing “z” at time1
CNN feature extraction
deepsystems.ai
30
input image
4*8*4
64*128*3
Reshape
16*8
lstm
FC+SM
16*1
6*1
This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)
In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}
|Alphabet| = 6
“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.
-probability of observing “-” (blank) at time1
CNN feature extraction
deepsystems.ai
31
input image
4*8*4
64*128*3
Reshape
16*8
lstm
lstm
FC+SM
16*1
16*1
6*1
CNN feature extraction
deepsystems.ai
32
input image
4*8*4
64*128*3
Reshape
16*8
lstm
lstm
FC+SM
FC+SM
16*1
16*1
6*1
6*1
CNN feature extraction
deepsystems.ai
33
input image
4*8*4
64*128*3
Reshape
16*8
lstm
lstm
lstm
lstm
lstm
lstm
lstm
lstm
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
16*1
16*1
16*1
16*1
16*1
16*1
16*1
16*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
CNN feature extraction
deepsystems.ai
34
input image
4*8*4
64*128*3
Reshape
16*8
lstm
lstm
lstm
lstm
lstm
lstm
lstm
lstm
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
16*1
16*1
16*1
16*1
16*1
16*1
16*1
16*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
We have 8 network outputs at different times that are conditionally independent
CNN feature extraction
deepsystems.ai
35
input image
4*8*4
64*128*3
Reshape
16*8
lstm
lstm
lstm
lstm
lstm
lstm
lstm
lstm
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
16*1
16*1
16*1
16*1
16*1
16*1
16*1
16*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
We have 8 network outputs at different times that are conditionally independent
Note: We designed simplified neural network to have 8 outputs. It means that we can not recognize more than 8 characters per image.
CNN feature extraction
deepsystems.ai
36
input image
4*8*4
64*128*3
Reshape
16*8
lstm
lstm
lstm
lstm
lstm
lstm
lstm
lstm
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
16*1
16*1
16*1
16*1
16*1
16*1
16*1
16*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
We have 8 network outputs at different times that are conditionally independent
Note: We designed simplified neural network to have 8 outputs. It means that we can not recognize more than 8 characters per image.
In practice, number of outputs can reach 32, 64 or more. The choice will depend on the specific task.
CNN feature extraction
deepsystems.ai
37
input image
4*8*4
64*128*3
Reshape
16*8
lstm
lstm
lstm
lstm
lstm
lstm
lstm
lstm
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
16*1
16*1
16*1
16*1
16*1
16*1
16*1
16*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
Decoding algorithm
CNN feature extraction
deepsystems.ai
38
input image
4*8*4
64*128*3
Reshape
16*8
lstm
lstm
lstm
lstm
lstm
lstm
lstm
lstm
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
16*1
16*1
16*1
16*1
16*1
16*1
16*1
16*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
Decoding algorithm
“apple”
CNN feature extraction
deepsystems.ai
39
input image
4*8*4
64*128*3
Reshape
16*8
lstm
lstm
lstm
lstm
lstm
lstm
lstm
lstm
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
16*1
16*1
16*1
16*1
16*1
16*1
16*1
16*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
Decoding algorithm
“apple”
How does decoding algorithm work?
CNN feature extraction
deepsystems.ai
Image OCR: model architecture
Decoding algorithm
40
deepsystems.ai
41
6*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
deepsystems.ai
42
6*1
Find most probable symbol
“a”
6*1
6*1
6*1
6*1
6*1
6*1
6*1
deepsystems.ai
43
6*1
Find most probable symbol
“a”
6*1
Find most probable symbol
“p”
6*1
6*1
6*1
6*1
6*1
6*1
deepsystems.ai
44
6*1
Find most probable symbol
“a”
6*1
Find most probable symbol
“p”
6*1
Find most probable symbol
“-”
6*1
Find most probable symbol
“p”
6*1
Find most probable symbol
“l”
6*1
Find most probable symbol
“-”
6*1
Find most probable symbol
“e”
6*1
Find most probable symbol
“e”
deepsystems.ai
45
6*1
Find most probable symbol
“a”
6*1
Find most probable symbol
“p”
6*1
Find most probable symbol
“-”
6*1
Find most probable symbol
“p”
6*1
Find most probable symbol
“l”
6*1
Find most probable symbol
“-”
6*1
Find most probable symbol
“e”
6*1
Find most probable symbol
“e”
“ap-pl-ee”
deepsystems.ai
46
6*1
Find most probable symbol
“a”
6*1
Find most probable symbol
“p”
6*1
Find most probable symbol
“-”
6*1
Find most probable symbol
“p”
6*1
Find most probable symbol
“l”
6*1
Find most probable symbol
“-”
6*1
Find most probable symbol
“e”
6*1
Find most probable symbol
“e”
“ap-pl-ee”
“ap-pl-e”
Remove repeated symbols
deepsystems.ai
47
6*1
“apple”
Find most probable symbol
“a”
6*1
Find most probable symbol
“p”
6*1
Find most probable symbol
“-”
6*1
Find most probable symbol
“p”
6*1
Find most probable symbol
“l”
6*1
Find most probable symbol
“-”
6*1
Find most probable symbol
“e”
6*1
Find most probable symbol
“e”
“ap-pl-ee”
“ap-pl-e”
Remove repeated symbols
Remove “blanks”
deepsystems.ai
48
6*1
“apple”
Find most probable symbol
“a”
6*1
Find most probable symbol
“p”
6*1
Find most probable symbol
“-”
6*1
Find most probable symbol
“p”
6*1
Find most probable symbol
“l”
6*1
Find most probable symbol
“-”
6*1
Find most probable symbol
“e”
6*1
Find most probable symbol
“e”
“ap-pl-ee”
“ap-pl-e”
Remove repeated symbols
Remove “blanks”
Let’s define this block of operations as map function B, that simply removes repeated symbols from path and removes “blanks”
deepsystems.ai
49
6*1
“apple”
Find most probable symbol
“a”
6*1
Find most probable symbol
“p”
6*1
Find most probable symbol
“-”
6*1
Find most probable symbol
“p”
6*1
Find most probable symbol
“l”
6*1
Find most probable symbol
“-”
6*1
Find most probable symbol
“e”
6*1
Find most probable symbol
“e”
“ap-pl-ee”
“ap-pl-e”
Remove repeated symbols
Remove “blanks”
Let’s define this block of operations as map function B, that simply removes repeated symbols from path and removes “blanks”
“Best path decoding” algorithm is defined by this sequence of operations (most popular, very simple and easy to implement).
deepsystems.ai
50
6*1
“apple”
Find most probable symbol
“a”
6*1
Find most probable symbol
“p”
6*1
Find most probable symbol
“-”
6*1
Find most probable symbol
“p”
6*1
Find most probable symbol
“l”
6*1
Find most probable symbol
“-”
6*1
Find most probable symbol
“e”
6*1
Find most probable symbol
“e”
“ap-pl-ee”
“ap-pl-e”
Remove repeated symbols
Remove “blanks”
Let’s define this block of operations as map function B, that simply removes repeated symbols from path and removes “blanks”
“Best path decoding” algorithm is defined by this sequence of operations (most popular, very simple and easy to implement).
Note: there are a few other algorithms in literature.
deepsystems.ai
51
51
input image
4*8*4
64*128*3
Reshape
16*8
lstm
lstm
lstm
lstm
lstm
lstm
lstm
lstm
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
16*1
16*1
16*1
16*1
16*1
16*1
16*1
16*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
Decoding algorithm
“apple”
CNN feature extraction
deepsystems.ai
Image OCR: model architecture
Training: CTC Loss
52
deepsystems.ai
53
53
input image
4*8*4
64*128*3
Reshape
16*8
lstm
lstm
lstm
lstm
lstm
lstm
lstm
lstm
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
FC+SM
16*1
16*1
16*1
16*1
16*1
16*1
16*1
16*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
6*1
CTC Loss
CNN feature extraction
Ground truth labeling “apple”
deepsystems.ai
54
Thank you
Our Website:
Products:
Our team is looking for business partners to make exciting deep learning solutions.
Outsource projects:
deepsystems.ai