1 of 21

A practical approach for deep learning based transcription of handwritten historical protocols

Lars Ailo Bongo, Bjørn-Richard Pedersen, Tim Alexander Teige, Nikita Shvetsov, Johan Ravn, Einar Holsbø, Trygve Andersen, Gunnar Thorvaldsen

Machine learning

SQL

Graph database

Conflict of interest: Medsensio AS, 3StepBio AS

2 of 21

We are transcribing the full count 1950 Norwegian population census

Handwritten text in Norwegian

801.000 questionnaires
1.6M scans
26 columns
6.8M rows
Split into 177M images

Can be linked to other datasets

Number codes

Birthplaces

Person names

3 of 21

Handwritten digit recognition is a textbook example, so it should be easy to do?

4 of 21

Many cloud platforms provide handwritten text recognition services

5 of 21

We need a practical approach for transcription

Human level precision: 98-99%
Useful recall: 95%

Build on powerful services

SQL
Machine learning
Graph databases

Reduce operation overhead
Low cost
Secure data storage and processing

6 of 21

We evaluated four approaches for 3-digit code classification

Train model on labeled 3-digit dataset

Train model on labeled 3-digit dataset, split into single digits

Train model on labeled 3-digit dataset

Train using MNIST data

7 of 21

Deep learning: train a model using labeled images

39.592 manually labeled 3-digit codes

Training set: 85%

Test set: 15%

Deep learning model

( , 944)

8 of 21

Deep learning: tune a model using labeled 3-digit images

39.592 manually transcribed 3-digit codes

Training set: 85%

Test set: 15%

Deep learning model

( , ?)

(944, 0.96)

(944)

9 of 21

Deep learning: train a 1-digit model using labeled 3-digit images

39.592 manually labeled 3-digit codes

Training set: 85%

Test set: 15%

Deep learning model

( , 9)

Split image into 100.959 splits

( , 4)

10 of 21

We evaluated four approaches for 3-digit code classification

Train model on labeled 3-digit dataset

Train model on labeled 3-digit dataset, split into single digits

Train model on labeled 3-digit dataset

Train using MNIST data

Low accuracy

Low accuracy, high cost, still need to preprocess images

11 of 21

The average precision and recall are good for 1-digit and 3-digit classification

98% precision ⇒ 98% recall

99% precision ⇒ 95% recall

We can use the model in production

98% precision ⇒ 96% recall

99% precision ⇒ 94% recall

Promising potential for non-number images

12 of 21

It is cheap and fast to train a model using a PC or on a commercial cloud

3-digit model: ~20 minutes on a gaming PC (with NVIDIA GTX1080)
1-digit model: ~10 minutes

On Amazon Web Services the cost is about $10-20 for our models

13 of 21

But the codes are not evenly distributed

Cumulative distribution

Occupation codes ordered by frequency (more → less)

10-most frequent = 75% of all codes

81-most frequent = 95% of all codes

14 of 21

We can improve our preprocessing and better tune the model

Issue	Solution
Conversion to black and white images loose information	Use grayscale image
Position of digits in 1-digit images varies	Center images
Digits are partially outside cell box	Not solved
The model hyperparameters were not tuned	Use autotuning cloud services (AWS SageMaker)

15 of 21

We will use the annotated 3-digit numbers to train models for other code fields

We will train models with valid digits and known distribution using randomly selected digits from our training set

Name	Number of digits	Valid range	Validation
Position in household	1	1-3	Central statistics
Resident	1	1-5	Central statistics
Marriage status	1	1-7	Central statistic
Employment	1	1-9
Second occupation	3
Lower education	1	1-9	Central statistic
Higher education	3		Central statistic
Nationality	2

16 of 21

We will evaluate the transferability of the 3-digit approach to handwritten birthplaces

Initial results using naive pre-processing ⇒ 92% accuracy for 25 most frequent classes

17 of 21

The dataset is too large for manual verification of the transcribed digits, but we can still verify the transcription

Strategy:

Check that codes are valid
Compare job distribution to known distribution of codes
Use other fields (head of household) to verify transcribed codes

Technical needs:

Select fields
Calculate statistics
Keep track of verified and invalid fields
⇒ Databases and SQL works well

18 of 21

Model training is just a small part of the production pipeline

Census table cell images

Select training set

Manual labeling (GUI)

Labeled training data

Image preprocessing

Model training and tuning

Model

Labeled training data

Validation queries

Validated transcribed data

Transcribed data

Transcription log

Inference

Transcribed data

Census table cell images

19 of 21

Our current production setup uses in-house servers

NFS filesystem: images

Scripts

Training dataset

Verification database

Research dataset

Machine learning

models

Statistical analysis

20 of 21

We plan to utilize cloud platform services

Relational storage

SQL

Managed machine learning

Statistical analysis

New services

Graph database

Graph queries

21 of 21

Lessons learned

Digit recognition is a solved problem...
...but cannot be used out of the box

Image preprocessing is important
Easy and cheap to build own model
Data validation strategy is important for production

Same approach can be applied for other digit codes and handwritten text

Contact information: Open access data:

lars.ailo.bongo@uit.no https://doi.org/10.18710/OYIH83

https://www.rhd.uit.no/ Open source code:

http://hdl.cs.uit.no/ https://github.com/uit-hdl/rhd-codes