1 of 54

How to build end-to-end recognition system: best practices

1

“apple”

deepsystems.ai

Review from

deepsystems.ai

2 of 54

Task Description

Labelling unsegmented sequence data. i.e. training data is not pre-segmented.

2

Neural Network

Output:“apple”

Input: image with text

Input: sound wave of me saying “apple”

Neural Network

Output:“apple”

Example: speech recognition

Example: image ocr

deepsystems.ai

3 of 54

Task Description

Labelling unsegmented sequence data. i.e. training data is not pre-segmented.

We can not pre-segment input data because:

  • It is too time consuming
  • It is too expensive
  • It is impossible in most cases

3

“a”

“p”

“p”

“l”

“e”

“a”

“p”

“p”

“l”

“e”

deepsystems.ai

4 of 54

Links

4

deepsystems.ai

5 of 54

Big picture

5

deepsystems.ai

6 of 54

Image OCR: model architecture

High-level overview

6

deepsystems.ai

7 of 54

7

input image

CNN feature extraction

LSTM Net

Decoding algorithm

“apple”

Image features

deepsystems.ai

8 of 54

Image OCR: model architecture

Detailed overview

8

deepsystems.ai

9 of 54

9

input image

64*128*3

deepsystems.ai

10 of 54

10

input image

CNN feature extraction

4*8*4

64*128*3

deepsystems.ai

11 of 54

11

input image

CNN feature extraction

4*8*4

64*128*3

Reshape

16*8

deepsystems.ai

12 of 54

12

input image

CNN feature extraction

4*8*4

64*128*3

Reshape

16*8

deepsystems.ai

13 of 54

13

input image

CNN feature extraction

4*8*4

64*128*3

Reshape

16*8

deepsystems.ai

14 of 54

14

input image

CNN feature extraction

4*8*4

64*128*3

Reshape

16*8

deepsystems.ai

15 of 54

15

input image

CNN feature extraction

4*8*4

64*128*3

Reshape

16*8

deepsystems.ai

16 of 54

16

input image

CNN feature extraction

4*8*4

64*128*3

Reshape

16*8

deepsystems.ai

17 of 54

17

input image

CNN feature extraction

4*8*4

64*128*3

Reshape

16*8

deepsystems.ai

18 of 54

18

input image

CNN feature extraction

4*8*4

64*128*3

Reshape

16*8

deepsystems.ai

19 of 54

19

input image

CNN feature extraction

4*8*4

64*128*3

Reshape

16*8

deepsystems.ai

20 of 54

20

input image

4*8*4

64*128*3

Reshape

16*8

lstm

16*1

CNN feature extraction

deepsystems.ai

21 of 54

21

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

CNN feature extraction

deepsystems.ai

22 of 54

22

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

CNN feature extraction

deepsystems.ai

23 of 54

23

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

|Alphabet| = 6

CNN feature extraction

deepsystems.ai

24 of 54

24

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

|Alphabet| = 6

“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.

CNN feature extraction

deepsystems.ai

25 of 54

25

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

|Alphabet| = 6

“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.

-probability of observing “a” at time1

CNN feature extraction

deepsystems.ai

26 of 54

26

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

|Alphabet| = 6

“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.

-probability of observing “e” at time1

CNN feature extraction

deepsystems.ai

27 of 54

27

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

|Alphabet| = 6

“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.

-probability of observing “l” at time1

CNN feature extraction

deepsystems.ai

28 of 54

28

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

|Alphabet| = 6

“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.

-probability of observing “p” at time1

CNN feature extraction

deepsystems.ai

29 of 54

29

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

|Alphabet| = 6

“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.

-probability of observing “z” at time1

CNN feature extraction

deepsystems.ai

30 of 54

30

input image

4*8*4

64*128*3

Reshape

16*8

lstm

FC+SM

16*1

6*1

This is the probability distribution of observing alphabet symbols at time1 (time1 - first lstm step)

In our simple example: Alphabet={“a”, “e”, “l”, “p”, “z”, “-”}

|Alphabet| = 6

“-” is a special symbol (blank) that we always should add to the alphabet. It will be further understood what it is used for.

-probability of observing “-” (blank) at time1

CNN feature extraction

deepsystems.ai

31 of 54

31

input image

4*8*4

64*128*3

Reshape

16*8

lstm

lstm

FC+SM

16*1

16*1

6*1

CNN feature extraction

deepsystems.ai

32 of 54

32

input image

4*8*4

64*128*3

Reshape

16*8

lstm

lstm

FC+SM

FC+SM

16*1

16*1

6*1

6*1

CNN feature extraction

deepsystems.ai

33 of 54

33

input image

4*8*4

64*128*3

Reshape

16*8

lstm

lstm

lstm

lstm

lstm

lstm

lstm

lstm

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

16*1

16*1

16*1

16*1

16*1

16*1

16*1

16*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

CNN feature extraction

deepsystems.ai

34 of 54

34

input image

4*8*4

64*128*3

Reshape

16*8

lstm

lstm

lstm

lstm

lstm

lstm

lstm

lstm

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

16*1

16*1

16*1

16*1

16*1

16*1

16*1

16*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

We have 8 network outputs at different times that are conditionally independent

CNN feature extraction

deepsystems.ai

35 of 54

35

input image

4*8*4

64*128*3

Reshape

16*8

lstm

lstm

lstm

lstm

lstm

lstm

lstm

lstm

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

16*1

16*1

16*1

16*1

16*1

16*1

16*1

16*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

We have 8 network outputs at different times that are conditionally independent

Note: We designed simplified neural network to have 8 outputs. It means that we can not recognize more than 8 characters per image.

CNN feature extraction

deepsystems.ai

36 of 54

36

input image

4*8*4

64*128*3

Reshape

16*8

lstm

lstm

lstm

lstm

lstm

lstm

lstm

lstm

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

16*1

16*1

16*1

16*1

16*1

16*1

16*1

16*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

We have 8 network outputs at different times that are conditionally independent

Note: We designed simplified neural network to have 8 outputs. It means that we can not recognize more than 8 characters per image.

In practice, number of outputs can reach 32, 64 or more. The choice will depend on the specific task.

CNN feature extraction

deepsystems.ai

37 of 54

37

input image

4*8*4

64*128*3

Reshape

16*8

lstm

lstm

lstm

lstm

lstm

lstm

lstm

lstm

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

16*1

16*1

16*1

16*1

16*1

16*1

16*1

16*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

Decoding algorithm

CNN feature extraction

deepsystems.ai

38 of 54

38

input image

4*8*4

64*128*3

Reshape

16*8

lstm

lstm

lstm

lstm

lstm

lstm

lstm

lstm

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

16*1

16*1

16*1

16*1

16*1

16*1

16*1

16*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

Decoding algorithm

“apple”

CNN feature extraction

deepsystems.ai

39 of 54

39

input image

4*8*4

64*128*3

Reshape

16*8

lstm

lstm

lstm

lstm

lstm

lstm

lstm

lstm

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

16*1

16*1

16*1

16*1

16*1

16*1

16*1

16*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

Decoding algorithm

“apple”

How does decoding algorithm work?

CNN feature extraction

deepsystems.ai

40 of 54

Image OCR: model architecture

Decoding algorithm

40

deepsystems.ai

41 of 54

41

6*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

deepsystems.ai

42 of 54

42

6*1

Find most probable symbol

“a”

6*1

6*1

6*1

6*1

6*1

6*1

6*1

deepsystems.ai

43 of 54

43

6*1

Find most probable symbol

“a”

6*1

Find most probable symbol

“p”

6*1

6*1

6*1

6*1

6*1

6*1

deepsystems.ai

44 of 54

44

6*1

Find most probable symbol

“a”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“l”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“e”

6*1

Find most probable symbol

“e”

deepsystems.ai

45 of 54

45

6*1

Find most probable symbol

“a”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“l”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“e”

6*1

Find most probable symbol

“e”

“ap-pl-ee”

deepsystems.ai

46 of 54

46

6*1

Find most probable symbol

“a”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“l”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“e”

6*1

Find most probable symbol

“e”

“ap-pl-ee”

“ap-pl-e”

Remove repeated symbols

deepsystems.ai

47 of 54

47

6*1

“apple”

Find most probable symbol

“a”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“l”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“e”

6*1

Find most probable symbol

“e”

“ap-pl-ee”

“ap-pl-e”

Remove repeated symbols

Remove “blanks”

deepsystems.ai

48 of 54

48

6*1

“apple”

Find most probable symbol

“a”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“l”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“e”

6*1

Find most probable symbol

“e”

“ap-pl-ee”

“ap-pl-e”

Remove repeated symbols

Remove “blanks”

Let’s define this block of operations as map function B, that simply removes repeated symbols from path and removes “blanks”

deepsystems.ai

49 of 54

49

6*1

“apple”

Find most probable symbol

“a”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“l”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“e”

6*1

Find most probable symbol

“e”

“ap-pl-ee”

“ap-pl-e”

Remove repeated symbols

Remove “blanks”

Let’s define this block of operations as map function B, that simply removes repeated symbols from path and removes “blanks”

“Best path decoding” algorithm is defined by this sequence of operations (most popular, very simple and easy to implement).

deepsystems.ai

50 of 54

50

6*1

“apple”

Find most probable symbol

“a”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“p”

6*1

Find most probable symbol

“l”

6*1

Find most probable symbol

“-”

6*1

Find most probable symbol

“e”

6*1

Find most probable symbol

“e”

“ap-pl-ee”

“ap-pl-e”

Remove repeated symbols

Remove “blanks”

Let’s define this block of operations as map function B, that simply removes repeated symbols from path and removes “blanks”

“Best path decoding” algorithm is defined by this sequence of operations (most popular, very simple and easy to implement).

Note: there are a few other algorithms in literature.

deepsystems.ai

51 of 54

51

51

input image

4*8*4

64*128*3

Reshape

16*8

lstm

lstm

lstm

lstm

lstm

lstm

lstm

lstm

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

16*1

16*1

16*1

16*1

16*1

16*1

16*1

16*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

Decoding algorithm

“apple”

CNN feature extraction

deepsystems.ai

52 of 54

Image OCR: model architecture

Training: CTC Loss

52

deepsystems.ai

53 of 54

53

53

input image

4*8*4

64*128*3

Reshape

16*8

lstm

lstm

lstm

lstm

lstm

lstm

lstm

lstm

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

FC+SM

16*1

16*1

16*1

16*1

16*1

16*1

16*1

16*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

6*1

CTC Loss

CNN feature extraction

Ground truth labeling “apple”

deepsystems.ai

54 of 54

54

Thank you

Our Website:

Products:

Our team is looking for business partners to make exciting deep learning solutions.

Outsource projects:

  • Dataset management, annotation and preparation service
  • Interactive, lstm-based movie recommender system

deepsystems.ai