1 of 13

Qurator.ai: Digital curation technologies for libraries

Clemens Neudecker, Mike Gerber, Kai Labusch, �Felix Ostrowski, Vahid Rezanezhad, Robin Schaefer

ai4lam | FF21 | 15 March 2022

2 of 13

Berlin State Library (SBB)

  • Established 1661 in Berlin (Kingdom of Prussia)
  • Largest research library in Germany�(+25M media objects, ~2.5 PetaBytes digital storage)
  • Forms part of the LAM legal entity �Prussian Cultural Heritage Foundation (SPK)

https://staatsbibliothek-berlin.de/

  • In-house Digitization Center since 2007
  • Digital collections give access to ~20m pages digitized documents (mostly Public Domain licensed)

https://digital.staatsbibliothek-berlin.de/

3 of 13

Qurator.ai @ SBB

  • Qurator.ai – The platform for intelligent content solutions (BMBF, 2019 - 2022)
  • SBB is responsible for the sub-project 10: “AI for digitized cultural heritage
  • Our main goal: to improve the quality and efficiency of (digitized) document curation �
  • Development of open source tools�https://github.com/qurator-spk
  • Publication of open datasets�https://zenodo.org/communities/stabi
  • Releases of trained models�https://qurator-data.de/
  • Interactive demos via SBB LAB �https://lab.sbb.berlin/

4 of 13

Image Preprocessing: Binarization

  • Binarization (i.e. the conversion of colour/grayscale images to black or white pixels) can increase the contrast between background (paper) and foreground (ink) and help remove defects, noise etc.
  • OCR engines use binarized images for text recognition
  • Training of autoencoder model for document image binarization

https://github.com/qurator-spk/sbb_binarization

5 of 13

Document Layout Analysis

  • High-quality analysis of document layout is key for subsequent text recognition
  • Training of multiple ResNet50-U-Net models for pixelwise segmentation�
  • 1st iteration (“pure” ML)
    • some problems with headings,�drop capitals, reading order�
  • 2nd iteration (“hybrid” ML + heuristics)
    • additional heuristics with�improvements for textlines�and reading order detection

https://github.com/qurator-spk/eynollah

Text regions

Text lines

6 of 13

Image (Similarity) Search

  • Document layout analysis provides information about image content in the digitized documents
  • We extracted ~600,000 images from scanned documents
  • We trained an image classification �model on the basis of ImageNet
  • ROI within image using YOLO v3
  • Approximate nearest neighbour �search is used to find similar images
  • Alternative search and browse�access to digitised collections�

https://github.com/qurator-spk/sbb_images

7 of 13

OCR / Text Recognition

  • OCR for historical documents is hard (old fonts, complex layouts, defects and damages, historical spelling)
  • Thanks to deep learning OCR (Calamari) and public GT datasets (GT4HistOCR), nearly �error-free OCR is now possible!
  • A single (language independent) OCR model can be used both for Fraktur and Antigua (also mixed)
  • Initial evaluations show reductions of �Character-Error-Rate from ~20% to ~2%�

https://github.com/qurator-spk/ocrd_calamari

8 of 13

OCR Postcorrection

  • Even with highly accurate OCR, there remain a few recognition errors
  • Idea: train a machine translation model to “translate” OCR errors to correct words
  • Challenges:
    • retain historical spelling variants
    • avoid introducing new errors
  • Two-step model (seq2seq LSTM):
    • First, detect the parts of text with errors �(this helps artificially increase the error �density in the input for step two)
    • Translate (i.e. correct) errors in the OCR text
  • Relative OCR accuracy improvement: 18%

https://github.com/qurator-spk/sbb_ocr_postcorrection

9 of 13

Named Entity Recognition

  • Named Entity Recognition (NER) is used to identify proper names of persons, locations, organizations in unstructured text (here: OCR results)
  • Unsupervised Pre-Training of BERT model on the digitized historical documents
  • Supervised Training of BERT model for NER with labeled data for German NER
  • Results are state of the art with f1 score of 85.6%

https://github.com/qurator-spk/sbb_ner

10 of 13

Named Entity Disambiguation and Linking

  • Entities recognized by NER can be ambiguous
  • Example: “Paris is in France” �- Paris the city or Paris (Hilton) the person?
  • Necessary to determine the correct entity by context
  • Establishing a knowledge base for comparison based on Wikidata/Wikipedia�(harvesting of all articles for the corresponding categories)
  • Training of a “context-comparison” BERT embeddings model that decides for a given entity in the OCR text whether it is similar to a Wikipedia lemma
  • Enrichment of the OCR text with links to Wikidata IDs and geo-coordinates for toponyms��

https://github.com/qurator-spk/sbb_ned

11 of 13

Data Annotation

  • neat (named entity annotation tool) for data annotation
  • Simple, browser based Javascript tool�(no installation or admin rights required)
  • TSV (tab-separated-values)

internal working format

  • Embed image snippets via �IIIF Image API to support annotation�
  • neat can also be used for OCR correction �or transcription (e.g. to create GT)

https://github.com/qurator-spk/neat

12 of 13

Future Work

  • Follow-up project „Mensch - Maschine - Kultur” (2022 - 2024) with four sub-projects:

    • Multi-modal methods for document layout analysis and�OCR for (historic) Asian languages (CrossAsia)
    • Image extraction, classification and analysis
    • Metadata enrichment and AI-supported cataloguing
    • Curation and ethical issues in cultural AI datasets

13 of 13

Thank you for your attention!

Questions?

Clemens Neudecker, Mike Gerber, Kai Labusch, �Felix Ostrowski, Vahid Rezanezhad, Robin Schaefer

ai4lam | FF21 | 15 March 2022