A practical approach for deep learning based transcription of handwritten historical protocols
Lars Ailo Bongo, Bjørn-Richard Pedersen, Tim Alexander Teige, Nikita Shvetsov, Johan Ravn, Einar Holsbø, Trygve Andersen, Gunnar Thorvaldsen
Machine learning
SQL
Graph database
Conflict of interest: Medsensio AS, 3StepBio AS
We are transcribing the full count 1950 Norwegian population census
Handwritten text in Norwegian
Can be linked to other datasets
Number codes
Birthplaces
Person names
Handwritten digit recognition is a textbook example, so it should be easy to do?
Many cloud platforms provide handwritten text recognition services
We need a practical approach for transcription
We evaluated four approaches for 3-digit code classification
Train model on labeled 3-digit dataset
Train model on labeled 3-digit dataset, split into single digits
Train model on labeled 3-digit dataset
Train using MNIST data
Deep learning: train a model using labeled images
39.592 manually labeled 3-digit codes
Training set: 85%
Test set: 15%
Deep learning model
( , 944)
Deep learning: tune a model using labeled 3-digit images
39.592 manually transcribed 3-digit codes
Training set: 85%
Test set: 15%
Deep learning model
( , ?)
(944, 0.96)
(944)
Deep learning: train a 1-digit model using labeled 3-digit images
39.592 manually labeled 3-digit codes
Training set: 85%
Test set: 15%
Deep learning model
( , 9)
Split image into 100.959 splits
( , 4)
( , 4)
We evaluated four approaches for 3-digit code classification
Train model on labeled 3-digit dataset
Train model on labeled 3-digit dataset, split into single digits
Train model on labeled 3-digit dataset
Train using MNIST data
Low accuracy
Low accuracy, high cost, still need to preprocess images
The average precision and recall are good for 1-digit and 3-digit classification
98% precision ⇒ 98% recall
99% precision ⇒ 95% recall
We can use the model in production
98% precision ⇒ 96% recall
99% precision ⇒ 94% recall
Promising potential for non-number images
It is cheap and fast to train a model using a PC or on a commercial cloud
But the codes are not evenly distributed
Cumulative distribution
Occupation codes ordered by frequency (more → less)
10-most frequent = 75% of all codes
81-most frequent = 95% of all codes
We can improve our preprocessing and better tune the model
Issue | Solution |
Conversion to black and white images loose information | Use grayscale image |
Position of digits in 1-digit images varies | Center images |
Digits are partially outside cell box | Not solved |
The model hyperparameters were not tuned | Use autotuning cloud services (AWS SageMaker) |
We will use the annotated 3-digit numbers to train models for other code fields
We will train models with valid digits and known distribution using randomly selected digits from our training set
Name | Number of digits | Valid range | Validation |
Position in household | 1 | 1-3 | Central statistics |
Resident | 1 | 1-5 | Central statistics |
Marriage status | 1 | 1-7 | Central statistic |
Employment | 1 | 1-9 | |
Second occupation | 3 | | |
Lower education | 1 | 1-9 | Central statistic |
Higher education | 3 | | Central statistic |
Nationality | 2 | | |
We will evaluate the transferability of the 3-digit approach to handwritten birthplaces
Initial results using naive pre-processing ⇒ 92% accuracy for 25 most frequent classes
The dataset is too large for manual verification of the transcribed digits, but we can still verify the transcription
Strategy:
Technical needs:
Model training is just a small part of the production pipeline
Census table cell images
Select training set
Manual labeling (GUI)
Labeled training data
Image preprocessing
Model training and tuning
Model
Labeled training data
Validation queries
Validated transcribed data
Transcribed data
Transcription log
Inference
Transcribed data
Census table cell images
M
Our current production setup uses in-house servers
NFS filesystem: images
Scripts
Training dataset
Verification database
Research dataset
Machine learning
models
Statistical analysis
We plan to utilize cloud platform services
Relational storage
SQL
Managed machine learning
Statistical analysis
New services
Graph database
Graph queries
Lessons learned
Contact information: Open access data:
lars.ailo.bongo@uit.no https://doi.org/10.18710/OYIH83
https://www.rhd.uit.no/ Open source code: