2 of 27

Agenda

Walk through major inflection points in OCR history

Pipelines
Architectures
Data

Top models, libraries and companies

Open source
Commercial

Nemotron OCR

Pipeline breakdown

3 of 27

OCR - Optical Character Recognition

OCR is the process of turning pixels into machine readable text
Used in many business contexts

Faxes sent as images and then turned back into text
Mining past information from historical documents
Great for automation and organizing information

4 of 27

OCR History - MNIST Origins

Character classification was the first piece to be meaningfully solved
Connected components analysis
Convolutional neural networks dominated

5 of 27

OCR History - Hand engineered forms

Classifying characters came first, but still very difficult to locate text
At this stage many systems bypassed this problem by having rigid forms with strict expectations of the position of text
Form Creation Guide��

6 of 27

OCR History - Early Segmentation

Initial attempts at finding characters were very shoddy and manually designed
MSER(Maximally Stable Extremal Regions)

Sweep thresholds and see which connected components persist
Filter based on custom criteria

7 of 27

OCR History - Tesseract

Started at HP Labs in the 1980’s
Open source in 2005
Sponsored by Google 2006
Full custom tuned pipeline

8 of 27

OCR History - ABBYY

Strong commercial counterpart to Tesseract, licensed in many major scanners
Introduced layout analysis, automatic form extraction
Strong multilingual support, even allowed users to make their own alphabets and add characters and vocabulary
Uses ML but still significantly hand engineered

9 of 27

Research - Neural Networks

~2014 neural network methods start winning out over hand crafted approaches in virtually every aspect
Originally used simple sliding window binary text/no-text classifiers to find word regions

10 of 27

Research - Neural Networks

Eventually true object detection approaches like region proposal networks beat the more custom pipelines

11 of 27

Research - Early Data

Research moved from character recognition to word recognition

Very useful to have the extra context
Using likely co-occurring letters increased performance

Synthetic and scraped data drove the progress here

12 of 27

Research - Early Data

The first systems trained on this had 90k word vocabulary
This limited the system by not being able to output individual characters

13 of 27

Research - Early Data

Hand labeled datasets were very limited, not large enough to train on

14 of 27

Modern Pipeline

Word Detector - Locates word bounding boxes
(Optional)Text rectification - rotates or warps text to be flat
Word Recognition - Turns bounding boxes into text

15 of 27

Research - Improved Detectors

Sometimes bounding boxes are too rigid for text
Complex layout, curved text causes problems
Some systems use segmentation methods that then get transformed for the recognizer

16 of 27

Research - Improved Recognizers

Instead of doing word level classifiers it became useful to do sequence of character output. This made the system much more flexible and able to output arbitrary text
Convolutional Recurrent Neural Networks became a strong baseline

Sliding window of CNN creates a sequence of features passed to an RNN

Borrowed connectionist temporal classification loss from speech recognition

17 of 27

CTC Loss

https://distill.pub/2017/ctc/

18 of 27

Research - Transformers

Strong transformers took over from CRNN and CTC
TrOCR showed that full line recognition could be done without messy CTC because it had a more global view of the crops

19 of 27

Nemotron OCR

FOTS Detection

Pass image through a pretrained image backbone with a feature pyramid network design
Reduces image resolution 4x
In this reduced space classify if every pixel is within a text box or not
Simultaneously for every pixel inside of the box regress distance to top, left, right, bottom and rotation

Using the regions from detector, grid resample to make rectangular and fix rotation
Instead of passing the raw pixels to the recognizer, pass the reused features from the detector

This creates a nice shared representation from the image backbone and gives the added benefit of giving the recognizer a slightly more global view than just seeing the regions pixels

Relational model

Takes in information from the recognized regions, bounding box coordinates and then predicts which text belongs to the same paragraph, which text belongs on the same line and what reading order within the line they are
Considers its k=16 nearest neighbors

20 of 27

Nemotron Pipeline Diagram

Detection

Input Image (H x W x 3)

Feature Pyramid Network

Feature maps

(H/4 x W/4 x feature size)

21 of 27

Pixel Classification

Feature maps

(H/4 x W/4 x feature size)

Pixel class

22 of 27

Box regression

Feature maps

(H/4 x W/4 x feature size)

Pixelwise box regression

For every pixel predicted inside of a text box regress top, left, bottom, right and rotation

23 of 27

Angle Prediction

Feature maps

(H/4 x W/4 x feature size)

Pixelwise box regression

For every pixel predicted inside of a text box regress top, left, bottom, right and rotation

24 of 27

Recognition Transformer

Recognition model is a 4 layer transformer encoder
A crop is taken from the detection models features and resampled to be 8 feature pixels tall and sequence length wide
Outputs a prediction for every sequence length
If word is shorter than sequence length then it just predicts a bunch of nulls

25 of 27

Global Relational Model

Feature maps

(H/4 x W/4 x feature size)

Recognition Features

Geometry Features

Relational Model

Pairwise k=16 nearest neighbor

�Next word in line

Line after

Line confidence

26 of 27

Global Relational Model

27 of 27

Vision Language Models

The first truly end to end approach
Image in, text out with no intermediate goals
Many recent significant models down this line of research

CLIP, GPT4o, Kosmos2.5, LLaVA, Donut

Optionally could prompt the model to trigger different behaviors

Answer a question about the document, OCR, analyze charts, translate, summarize

https://arxiv.org/pdf/2309.11419
https://h2o.ai/platform/mississippi/
https://github.com/opendatalab/OmniDocBench
Downfall is they are very slow

Even on fastest GPU’s still only ~3 pages per second

Very expensive in terms of memory per resolution