1 of 13

Qurator.ai: Digital curation technologies for libraries

Clemens Neudecker, Mike Gerber, Kai Labusch, �Felix Ostrowski, Vahid Rezanezhad, Robin Schaefer

ai4lam | FF21 | 15 March 2022

2 of 13

Berlin State Library (SBB)

Established 1661 in Berlin (Kingdom of Prussia)
Largest research library in Germany�(+25M media objects, ~2.5 PetaBytes digital storage)
Forms part of the LAM legal entity �Prussian Cultural Heritage Foundation (SPK)

https://staatsbibliothek-berlin.de/ �

In-house Digitization Center since 2007
Digital collections give access to ~20m pages digitized documents (mostly Public Domain licensed)

https://digital.staatsbibliothek-berlin.de/

3 of 13

Qurator.ai @ SBB

Qurator.ai – The platform for intelligent content solutions (BMBF, 2019 - 2022)
SBB is responsible for the sub-project 10: “AI for digitized cultural heritage”
Our main goal: to improve the quality and efficiency of (digitized) document curation �
Development of open source tools�https://github.com/qurator-spk
Publication of open datasets�https://zenodo.org/communities/stabi
Releases of trained models�https://qurator-data.de/
Interactive demos via SBB LAB �https://lab.sbb.berlin/

4 of 13

Image Preprocessing: Binarization

Binarization (i.e. the conversion of colour/grayscale images to black or white pixels) can increase the contrast between background (paper) and foreground (ink) and help remove defects, noise etc.
OCR engines use binarized images for text recognition
Training of autoencoder model for document image binarization

�

https://github.com/qurator-spk/sbb_binarization

5 of 13

Document Layout Analysis

High-quality analysis of document layout is key for subsequent text recognition
Training of multiple ResNet50-U-Net models for pixelwise segmentation�
1st iteration (“pure” ML)

some problems with headings,�drop capitals, reading order�

2nd iteration (“hybrid” ML + heuristics)

additional heuristics with�improvements for textlines�and reading order detection

https://github.com/qurator-spk/eynollah

Text regions

Text lines

6 of 13

Image (Similarity) Search

Document layout analysis provides information about image content in the digitized documents
We extracted ~600,000 images from scanned documents
We trained an image classification �model on the basis of ImageNet
ROI within image using YOLO v3
Approximate nearest neighbour �search is used to find similar images
Alternative search and browse�access to digitised collections�

https://github.com/qurator-spk/sbb_images

7 of 13

OCR / Text Recognition

OCR for historical documents is hard (old fonts, complex layouts, defects and damages, historical spelling)
Thanks to deep learning OCR (Calamari) and public GT datasets (GT4HistOCR), nearly �error-free OCR is now possible!
A single (language independent) OCR model can be used both for Fraktur and Antigua (also mixed)
Initial evaluations show reductions of �Character-Error-Rate from ~20% to ~2%�

https://github.com/qurator-spk/ocrd_calamari

8 of 13

OCR Postcorrection

Even with highly accurate OCR, there remain a few recognition errors
Idea: train a machine translation model to “translate” OCR errors to correct words
Challenges:

retain historical spelling variants
avoid introducing new errors

Two-step model (seq2seq LSTM):

First, detect the parts of text with errors �(this helps artificially increase the error �density in the input for step two)
Translate (i.e. correct) errors in the OCR text

Relative OCR accuracy improvement: 18%

https://github.com/qurator-spk/sbb_ocr_postcorrection

9 of 13

Named Entity Recognition

Named Entity Recognition (NER) is used to identify proper names of persons, locations, organizations in unstructured text (here: OCR results)
Unsupervised Pre-Training of BERT model on the digitized historical documents
Supervised Training of BERT model for NER with labeled data for German NER
Results are state of the art with f1 score of 85.6%

�

https://github.com/qurator-spk/sbb_ner

10 of 13

Named Entity Disambiguation and Linking

Entities recognized by NER can be ambiguous
Example: “Paris is in France” �- Paris the city or Paris (Hilton) the person?
Necessary to determine the correct entity by context
Establishing a knowledge base for comparison based on Wikidata/Wikipedia�(harvesting of all articles for the corresponding categories)
Training of a “context-comparison” BERT embeddings model that decides for a given entity in the OCR text whether it is similar to a Wikipedia lemma
Enrichment of the OCR text with links to Wikidata IDs and geo-coordinates for toponyms��

https://github.com/qurator-spk/sbb_ned

11 of 13

Data Annotation

neat (named entity annotation tool) for data annotation
Simple, browser based Javascript tool�(no installation or admin rights required)
TSV (tab-separated-values)

internal working format

Embed image snippets via �IIIF Image API to support annotation�
neat can also be used for OCR correction �or transcription (e.g. to create GT)

https://github.com/qurator-spk/neat

12 of 13

Future Work

Follow-up project „Mensch - Maschine - Kultur” (2022 - 2024) with four sub-projects:

Multi-modal methods for document layout analysis and�OCR for (historic) Asian languages (CrossAsia)
Image extraction, classification and analysis
Metadata enrichment and AI-supported cataloguing
Curation and ethical issues in cultural AI datasets

13 of 13

Thank you for your attention!

Questions?

Clemens Neudecker, Mike Gerber, Kai Labusch, �Felix Ostrowski, Vahid Rezanezhad, Robin Schaefer

ai4lam | FF21 | 15 March 2022