1 of 38

Document and Content Analysis

Lecture 06 - Document Image Analysis

�Faisal Shafait

26.05.2011

2 of 38

How to convert a document image into editable text?

Optical Character Recognition (OCR)

We will learn how OCR works in the next four lectures!

3 of 38

How to convert a document image into editable text?

Optical Character Recognition (OCR)

We will learn how OCR works in the next four lectures!

4 of 38

How to convert a document image into editable text?

Optical Character Recognition (OCR)

We will learn how OCR works in the next four lectures!

5 of 38

A typical scanned book page

6 of 38

Character Recognition

  • Which character is this:
  • What is this: Withsha its

Isolated character recognition can be done as a standard pattern recognition problem, but a lot more needs to be done for a complete OCR system

7 of 38

Character Recognition

  • Which character is this:
  • What is this: Withsha its
  • Isolated character recognition can be done as a standard pattern recognition problem, but a lot more needs to be done for a complete OCR system

8 of 38

Flow chart for OCR

9 of 38

Binarization

  • Scanners capture a greyscale/color document
  • Most of the OCR systems work on binary images
  • Binarization is an important first step in most of the document analysis systems

10 of 38

Effect of binarization on OCR

11 of 38

Binarization algorithms

  • The goal of binarization algorithm is to define a threshold.
  • Two main classes:
    • Global binarization��
    • Local binarization

12 of 38

Global Binarization

  • Just set

13 of 38

Global Binarization

  • set

14 of 38

Otsu Global Thresholding

Let be the normalized histogram of the image

15 of 38

Otsu Global Thresholding

16 of 38

Local Adaptive Thresholding

  • Adapt to local variations in intensity by taking a window around each pixel

White (1983):

Niblack (1986):

Sauvola (2000):

17 of 38

Sauvola Local Thresholding

18 of 38

Local Vs Global Thresholding

  • Global Thresholding methods are:
    • Fast
    • Give good results when illumination over a page is uniform
    • Fail when there are local changes in illumination
  • Local Thresholding methods are:
    • Slow
    • Adapt to local changes in illumination
    • Perform well for both uniform and non-uniform illumination

19 of 38

Shafait Binarization (2008)

  • Use integral images for computing local thresholds
  • Local mean and variance can be computed in linear time
  • Same performance as local thresholding in time close to global thresholding

20 of 38

Connected Component Analysis

21 of 38

Connected Component Analysis

  • Scan the image row by row
  • When a black pixel is encountered, assign it a label:
    • If left neighbor pixel is white, a new label is assigned to the current black pixel
    • If left neighbor is black, its label is copied to the current pixel
  • If the upper neighbor pixel is black, merge the label of the current pixel and that of upper neighbor

22 of 38

Connected Component Analysis Example

23 of 38

Other Pre-processing Tasks

  • Orientation detection
  • Marginal noise removal
  • Skew correction

24 of 38

Orientation Detection

25 of 38

Marginal Noise Removal

26 of 38

Skew Correction

27 of 38

Page Segmentation

28 of 38

Incorrect Page Segmentation

29 of 38

Incorrect Page Segmentation

30 of 38

Page Segmentation Algorithms

  • Run-length Smearing Algorithm (1982)
  • Recursive X-Y Cuts (1984)
  • Whitespace Analysis (1994)
  • Docstrum (1993)
  • Voronoi (1998)
  • RAST (2002) by Thomas Breuel

31 of 38

Run-Length Smearing Algorithm

  • Works on binary image
  • White pixels represented by 0 and black by 1
  • A binary sequence x is changed into y:
    • 1's in x remain unchanged in y
    • 0's in x are changed to 1's in y if the number of adjacent 0's in x is less than or equal to a pre-defined threshold T.
  • This process is first repeated row-wise and then column-wise to get two distinct images
  • The two images are combined using AND op.

32 of 38

Run-Length Smearing Algorithm

  • A smooth final bitmap is obtained by again smearing in horizontal direction.
  • Connected components in the final bitmap correspond to segments in the image.

33 of 38

Run-Length Smearing Algorithm

(a) Original Image

(b) Horizontally Smeared Image

(c) Vertically Smeared Image

34 of 38

Run-Length Smearing Algorithm

(d) Final Image after Smoothing

(e) Identified text regions

35 of 38

Recursive X-Y Cut Algorithm

  • Recursive analysis of projection profiles
  • Projection profiles are obtained in two directions:
    • Horizontal: Project the image on the y-axis.
      • The length of the projection is equal to the height of the image
      • The value at each index of projection is equal to the number of black pixels in that row of the image
    • Vertical: Project the image on the x-axis.
      • The length of the projection is equal to the width of the image
      • The value at each index of projection is equal to the number of black pixels in that column of the image

36 of 38

Horizontal Projection

37 of 38

Recursive X-Y Cut Algorithm

  • Recursive analysis of projection profiles
  • Compute horizontal and vertical projection profiles of the image.
  • Compute largest (zero-)valleys in the horizontal and vertical projections
  • Split the image in the direction of larger valley into two images if
  • Stop when the image can not be split further

38 of 38

Things to remember

  • Otsu Thresholding
  • Sauvola Thresholding
  • Connected Component Analysis
  • Run-Length Smearing Algorithm
  • Recursive X-Y Cut Algorithm