1 of 27

History of OCR

2 of 27

Agenda

  • Walk through major inflection points in OCR history
    • Pipelines
    • Architectures
    • Data
  • Top models, libraries and companies
    • Open source
    • Commercial
  • Nemotron OCR
    • Pipeline breakdown

3 of 27

OCR - Optical Character Recognition

  • OCR is the process of turning pixels into machine readable text
  • Used in many business contexts
    • Faxes sent as images and then turned back into text
    • Mining past information from historical documents
    • Great for automation and organizing information

4 of 27

OCR History - MNIST Origins

  • Character classification was the first piece to be meaningfully solved
  • Connected components analysis
  • Convolutional neural networks dominated

5 of 27

OCR History - Hand engineered forms

  • Classifying characters came first, but still very difficult to locate text
  • At this stage many systems bypassed this problem by having rigid forms with strict expectations of the position of text
  • Form Creation Guide��

6 of 27

OCR History - Early Segmentation

  • Initial attempts at finding characters were very shoddy and manually designed
  • MSER(Maximally Stable Extremal Regions)
    • Sweep thresholds and see which connected components persist
    • Filter based on custom criteria

7 of 27

OCR History - Tesseract

  • Started at HP Labs in the 1980’s
  • Open source in 2005
  • Sponsored by Google 2006
  • Full custom tuned pipeline

8 of 27

OCR History - ABBYY

  • Strong commercial counterpart to Tesseract, licensed in many major scanners
  • Introduced layout analysis, automatic form extraction
  • Strong multilingual support, even allowed users to make their own alphabets and add characters and vocabulary
  • Uses ML but still significantly hand engineered

9 of 27

Research - Neural Networks

  • ~2014 neural network methods start winning out over hand crafted approaches in virtually every aspect
  • Originally used simple sliding window binary text/no-text classifiers to find word regions

10 of 27

Research - Neural Networks

  • Eventually true object detection approaches like region proposal networks beat the more custom pipelines

11 of 27

Research - Early Data

  • Research moved from character recognition to word recognition
    • Very useful to have the extra context
    • Using likely co-occurring letters increased performance
  • Synthetic and scraped data drove the progress here

12 of 27

Research - Early Data

  • The first systems trained on this had 90k word vocabulary
  • This limited the system by not being able to output individual characters

13 of 27

Research - Early Data

  • Hand labeled datasets were very limited, not large enough to train on

14 of 27

Modern Pipeline

  • Word Detector - Locates word bounding boxes
  • (Optional)Text rectification - rotates or warps text to be flat
  • Word Recognition - Turns bounding boxes into text

15 of 27

Research - Improved Detectors

  • Sometimes bounding boxes are too rigid for text
  • Complex layout, curved text causes problems
  • Some systems use segmentation methods that then get transformed for the recognizer

16 of 27

Research - Improved Recognizers

  • Instead of doing word level classifiers it became useful to do sequence of character output. This made the system much more flexible and able to output arbitrary text
  • Convolutional Recurrent Neural Networks became a strong baseline
    • Sliding window of CNN creates a sequence of features passed to an RNN
  • Borrowed connectionist temporal classification loss from speech recognition

17 of 27

CTC Loss

18 of 27

Research - Transformers

  • Strong transformers took over from CRNN and CTC
  • TrOCR showed that full line recognition could be done without messy CTC because it had a more global view of the crops

19 of 27

Nemotron OCR

  • FOTS Detection
    • Pass image through a pretrained image backbone with a feature pyramid network design
    • Reduces image resolution 4x
    • In this reduced space classify if every pixel is within a text box or not
    • Simultaneously for every pixel inside of the box regress distance to top, left, right, bottom and rotation
  • Using the regions from detector, grid resample to make rectangular and fix rotation
  • Instead of passing the raw pixels to the recognizer, pass the reused features from the detector
    • This creates a nice shared representation from the image backbone and gives the added benefit of giving the recognizer a slightly more global view than just seeing the regions pixels
  • Relational model
    • Takes in information from the recognized regions, bounding box coordinates and then predicts which text belongs to the same paragraph, which text belongs on the same line and what reading order within the line they are
    • Considers its k=16 nearest neighbors

20 of 27

Nemotron Pipeline Diagram

Detection

Input Image (H x W x 3)

Feature Pyramid Network

Feature maps

(H/4 x W/4 x feature size)

21 of 27

Pixel Classification

Feature maps

(H/4 x W/4 x feature size)

Pixel class

22 of 27

Box regression

Feature maps

(H/4 x W/4 x feature size)

Pixelwise box regression

For every pixel predicted inside of a text box regress top, left, bottom, right and rotation

23 of 27

Angle Prediction

Feature maps

(H/4 x W/4 x feature size)

Pixelwise box regression

For every pixel predicted inside of a text box regress top, left, bottom, right and rotation

24 of 27

Recognition Transformer

  • Recognition model is a 4 layer transformer encoder
  • A crop is taken from the detection models features and resampled to be 8 feature pixels tall and sequence length wide
  • Outputs a prediction for every sequence length
  • If word is shorter than sequence length then it just predicts a bunch of nulls

25 of 27

Global Relational Model

Feature maps

(H/4 x W/4 x feature size)

Recognition Features

Geometry Features

Relational Model

Pairwise k=16 nearest neighbor

�Next word in line

Line after

Line confidence

26 of 27

Global Relational Model

27 of 27

Vision Language Models

  • The first truly end to end approach
  • Image in, text out with no intermediate goals
  • Many recent significant models down this line of research
    • CLIP, GPT4o, Kosmos2.5, LLaVA, Donut
  • Optionally could prompt the model to trigger different behaviors
    • Answer a question about the document, OCR, analyze charts, translate, summarize
  • https://arxiv.org/pdf/2309.11419
  • https://h2o.ai/platform/mississippi/
  • https://github.com/opendatalab/OmniDocBench
  • Downfall is they are very slow
    • Even on fastest GPU’s still only ~3 pages per second
  • Very expensive in terms of memory per resolution