1 of 54

How to build end-to-end recognition system: best practices

1

“apple”

deepsystems.ai

Review from

deepsystems.ai

2 of 54

Task Description

Labelling unsegmented sequence data. i.e. training data is not pre-segmented.

2

Neural Network

Output:“apple”

Input: image with text

Input: sound wave of me saying “apple”

Neural Network

Output:“apple”

Example: speech recognition

Example: image ocr

deepsystems.ai

3 of 54

Task Description

Labelling unsegmented sequence data. i.e. training data is not pre-segmented.

We can not pre-segment input data because:

It is too time consuming
It is too expensive
It is impossible in most cases

3

“a”

“p”

“l”

“e”

“a”

“p”

“l”

“e”

deepsystems.ai

4 of 54

Links

Alex Graves. CTC Loss: http://www.cs.toronto.edu/~graves/icml_2006.pdf

Keras example: image_ocr.ipynb

4

deepsystems.ai

5 of 54

Big picture

5

Google: voice search

Baidu: Deep Speech

Dropbox: document scanner

deepsystems.ai

6 of 54

Image OCR: model architecture

High-level overview

6

deepsystems.ai

7 of 54

7

input image

CNN feature extraction

LSTM Net

Decoding algorithm

“apple”

Image features

deepsystems.ai

Firstly, image is feeded to CNN to extract image features. The next step is to apply Recurrent neural network to these features followed by the special decoding algorithm. This decoding algorithm takes lstm outputs from each time step and produces the final labeling.

Yuri: Max, i have a question. Here we can see several popular deep learning components such as CNN and LSTM. But there is a well-known product called Tesseract OCR Engine with more than 15 years history. Why can not we use it?

I suppose that all who faced this task tried this product first. From our experience, this “out of the box” solution gave poor performance for such tasks as number plate recognition and text recognition from images. Accuracy for number plate recognition task was around 10 percent, while our inhouse solution gave us around 98 percent.

8 of 54

Image OCR: model architecture

Detailed overview

8

deepsystems.ai

9 of 54

9

input image

64*128*3

20

input image

4*8*4

64*128*3

Reshape

16*8

lstm

16*1

CNN feature extraction

deepsystems.ai

21 of 54

21

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

CNN feature extraction

deepsystems.ai

22 of 54

22

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

CNN feature extraction

deepsystems.ai

23 of 54

23

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

|Alphabet| = 6

CNN feature extraction

deepsystems.ai

24 of 54

24

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

|Alphabet| = 6

“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.

CNN feature extraction

deepsystems.ai

25 of 54

25

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

|Alphabet| = 6

“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.

-probability of observing “a” at time1

CNN feature extraction

deepsystems.ai

26 of 54

26

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

|Alphabet| = 6

“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.

-probability of observing “e” at time1

CNN feature extraction

deepsystems.ai

27 of 54

27

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

|Alphabet| = 6

“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.

-probability of observing “l” at time1

CNN feature extraction

deepsystems.ai

28 of 54

28

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

|Alphabet| = 6

“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.

-probability of observing “p” at time1

CNN feature extraction

deepsystems.ai

29 of 54

29

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

|Alphabet| = 6

“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.

-probability of observing “z” at time1

CNN feature extraction

deepsystems.ai

30 of 54

30

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

|Alphabet| = 6

“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.

-probability of observing “-” (blank) at time1

CNN feature extraction

deepsystems.ai

31 of 54

31

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

CNN feature extraction

deepsystems.ai

32 of 54

32

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

CNN feature extraction

deepsystems.ai

33 of 54

33

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

CNN feature extraction

deepsystems.ai

34 of 54

34

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

We have 8 network outputs at different times that are conditionally independent

CNN feature extraction

deepsystems.ai

35 of 54

35

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

We have 8 network outputs at different times that are conditionally independent

Note: We designed simplified neural network to have 8 outputs. It means that we can not recognize more than 8 characters per image.

CNN feature extraction

deepsystems.ai

36 of 54

36

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

We have 8 network outputs at different times that are conditionally independent

Note: We designed simplified neural network to have 8 outputs. It means that we can not recognize more than 8 characters per image.

In practice, number of outputs can reach 32, 64 or more. The choice will depend on the specific task.

CNN feature extraction

deepsystems.ai

In practice, the number of outputs can reach 32, 64 or more. The choice will depend on the specific task. Also in production it is better to use multilayered bidirectional LSTM. But this simple example explains only most important concepts.

Yuri: Listen, the number of outputs is fixed by design. Can we replace LSTM with more simple neural network, for example CNN?

Yes, of course we can. But we will obtain lower accuracy. The usage of LSTM gives us few benefits. Firstly, the model “looks” at image from left to right and it has memory. it allows us to take into account entire sequence of “readed” symbols. Secondly, this recurrent model learns to model language implicitly while training. When the model is not sure about which symbol it should predict, language understanding will help to think out an answer.

37 of 54

37

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

Decoding algorithm

CNN feature extraction

deepsystems.ai

38 of 54

38

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

Decoding algorithm

“apple”

CNN feature extraction

deepsystems.ai

39 of 54

39

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

Decoding algorithm

“apple”

How does decoding algorithm work?

CNN feature extraction

deepsystems.ai

40 of 54

Image OCR: model architecture

Decoding algorithm

40

deepsystems.ai

41 of 54

41

6*1

deepsystems.ai

42 of 54

42

6*1

Find most probable symbol

“a”

6*1

deepsystems.ai

43 of 54

43

6*1

Find most probable symbol

“a”

6*1

Find most probable symbol

“p”

6*1

deepsystems.ai

44 of 54

44

6*1

Find most probable symbol

“a”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“l”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“e”

6*1

Find most probable symbol

“e”

deepsystems.ai

45 of 54

45

6*1

Find most probable symbol

“a”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“l”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“e”

6*1

Find most probable symbol

“e”

“ap-pl-ee”

deepsystems.ai

46 of 54

46

6*1

Find most probable symbol

“a”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“l”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“e”

6*1

Find most probable symbol

“e”

“ap-pl-ee”

“ap-pl-e”

Remove repeated symbols

deepsystems.ai

47 of 54

47

6*1

“apple”

Find most probable symbol

“a”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“l”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“e”

6*1

Find most probable symbol

“e”

“ap-pl-ee”

“ap-pl-e”

Remove repeated symbols

Remove “blanks”

deepsystems.ai

48 of 54

48

6*1

“apple”

Find most probable symbol

“a”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“l”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“e”

6*1

Find most probable symbol

“e”

“ap-pl-ee”

“ap-pl-e”

Remove repeated symbols

Remove “blanks”

Let’s define this block of operations as map function B, that simply removes repeated symbols from path and removes “blanks”

deepsystems.ai

49 of 54

49

6*1

“apple”

Find most probable symbol

“a”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“l”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“e”

6*1

Find most probable symbol

“e”

“ap-pl-ee”

“ap-pl-e”

Remove repeated symbols

Remove “blanks”

Let’s define this block of operations as map function B, that simply removes repeated symbols from path and removes “blanks”

“Best path decoding” algorithm is defined by this sequence of operations (most popular, very simple and easy to implement).

deepsystems.ai

50 of 54

50

6*1

“apple”

Find most probable symbol

“a”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“l”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“e”

6*1

Find most probable symbol

“e”

“ap-pl-ee”

“ap-pl-e”

Remove repeated symbols

Remove “blanks”

Let’s define this block of operations as map function B, that simply removes repeated symbols from path and removes “blanks”

“Best path decoding” algorithm is defined by this sequence of operations (most popular, very simple and easy to implement).

Note: there are a few other algorithms in literature.

deepsystems.ai

51 of 54

51

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

Decoding algorithm

“apple”

CNN feature extraction

deepsystems.ai

52 of 54

Image OCR: model architecture

Training: CTC Loss

52

deepsystems.ai

53 of 54

53

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

CTC Loss

CNN feature extraction

Ground truth labeling “apple”

deepsystems.ai

54 of 54

54

Thank you

deepsystems.ai

Our Website:

supervise.ly

Products:

Our team is looking for business partners to make exciting deep learning solutions.

Outsource projects:

movix.ai

Dataset management, annotation and preparation service

Interactive, lstm-based movie recommender system

deepsystems.ai