1 of 21

A practical approach for deep learning based transcription of handwritten historical protocols

Lars Ailo Bongo, Bjørn-Richard Pedersen, Tim Alexander Teige, Nikita Shvetsov, Johan Ravn, Einar Holsbø, Trygve Andersen, Gunnar Thorvaldsen

Machine learning

SQL

Graph database

Conflict of interest: Medsensio AS, 3StepBio AS

2 of 21

We are transcribing the full count 1950 Norwegian population census

Handwritten text in Norwegian

  • 801.000 questionnaires
  • 1.6M scans
  • 26 columns
  • 6.8M rows
  • Split into 177M images

Can be linked to other datasets

Number codes

Birthplaces

Person names

3 of 21

Handwritten digit recognition is a textbook example, so it should be easy to do?

4 of 21

Many cloud platforms provide handwritten text recognition services

5 of 21

We need a practical approach for transcription

  • Human level precision: 98-99%
  • Useful recall: 95%
  • Build on powerful services
    • SQL
    • Machine learning
    • Graph databases
  • Reduce operation overhead
  • Low cost
  • Secure data storage and processing

6 of 21

We evaluated four approaches for 3-digit code classification

Train model on labeled 3-digit dataset

Train model on labeled 3-digit dataset, split into single digits

Train model on labeled 3-digit dataset

Train using MNIST data

7 of 21

Deep learning: train a model using labeled images

39.592 manually labeled 3-digit codes

Training set: 85%

Test set: 15%

Deep learning model

( , 944)

8 of 21

Deep learning: tune a model using labeled 3-digit images

39.592 manually transcribed 3-digit codes

Training set: 85%

Test set: 15%

Deep learning model

( , ?)

(944, 0.96)

(944)

9 of 21

Deep learning: train a 1-digit model using labeled 3-digit images

39.592 manually labeled 3-digit codes

Training set: 85%

Test set: 15%

Deep learning model

( , 9)

Split image into 100.959 splits

( , 4)

( , 4)

10 of 21

We evaluated four approaches for 3-digit code classification

Train model on labeled 3-digit dataset

Train model on labeled 3-digit dataset, split into single digits

Train model on labeled 3-digit dataset

Train using MNIST data

Low accuracy

Low accuracy, high cost, still need to preprocess images

11 of 21

The average precision and recall are good for 1-digit and 3-digit classification

98% precision ⇒ 98% recall

99% precision ⇒ 95% recall

We can use the model in production

98% precision ⇒ 96% recall

99% precision ⇒ 94% recall

Promising potential for non-number images

12 of 21

It is cheap and fast to train a model using a PC or on a commercial cloud

  • 3-digit model: ~20 minutes on a gaming PC (with NVIDIA GTX1080)
  • 1-digit model: ~10 minutes

  • On Amazon Web Services the cost is about $10-20 for our models

13 of 21

But the codes are not evenly distributed

Cumulative distribution

Occupation codes ordered by frequency (more → less)

10-most frequent = 75% of all codes

81-most frequent = 95% of all codes

14 of 21

We can improve our preprocessing and better tune the model

Issue

Solution

Conversion to black and white images loose information

Use grayscale image

Position of digits in 1-digit images varies

Center images

Digits are partially outside cell box

Not solved

The model hyperparameters were not tuned

Use autotuning cloud services (AWS SageMaker)

15 of 21

We will use the annotated 3-digit numbers to train models for other code fields

We will train models with valid digits and known distribution using randomly selected digits from our training set

Name

Number of digits

Valid range

Validation

Position in household

1

1-3

Central statistics

Resident

1

1-5

Central statistics

Marriage status

1

1-7

Central statistic

Employment

1

1-9

Second occupation

3

Lower education

1

1-9

Central statistic

Higher education

3

Central statistic

Nationality

2

16 of 21

We will evaluate the transferability of the 3-digit approach to handwritten birthplaces

Initial results using naive pre-processing ⇒ 92% accuracy for 25 most frequent classes

17 of 21

The dataset is too large for manual verification of the transcribed digits, but we can still verify the transcription

Strategy:

  1. Check that codes are valid
  2. Compare job distribution to known distribution of codes
  3. Use other fields (head of household) to verify transcribed codes

Technical needs:

  • Select fields
  • Calculate statistics
  • Keep track of verified and invalid fields
  • ⇒ Databases and SQL works well

18 of 21

Model training is just a small part of the production pipeline

Census table cell images

Select training set

Manual labeling (GUI)

Labeled training data

Image preprocessing

Model training and tuning

Model

Labeled training data

Validation queries

Validated transcribed data

Transcribed data

Transcription log

Inference

Transcribed data

Census table cell images

M

19 of 21

Our current production setup uses in-house servers

NFS filesystem: images

Scripts

Training dataset

Verification database

Research dataset

Machine learning

models

Statistical analysis

20 of 21

We plan to utilize cloud platform services

Relational storage

SQL

Managed machine learning

Statistical analysis

New services

Graph database

Graph queries

21 of 21

Lessons learned

  • Digit recognition is a solved problem...
  • ...but cannot be used out of the box

  • Image preprocessing is important
  • Easy and cheap to build own model
  • Data validation strategy is important for production

  • Same approach can be applied for other digit codes and handwritten text

Contact information: Open access data:

lars.ailo.bongo@uit.no https://doi.org/10.18710/OYIH83

https://www.rhd.uit.no/ Open source code:

http://hdl.cs.uit.no/ https://github.com/uit-hdl/rhd-codes