1 of 10

Clemens Neudecker

Staatsbibliothek zu Berlin - Preußischer Kulturbesitz

Harmonizing Workflows in HTR/OCR Publication Pipelines of Textual Heritage

February 15, 2023, Berlin

2 of 10

Background and History

  • Many digitisation projects were funded by the DFG over the last decades
  • Typically, only scans were produced and images published, but no full text
  • The reason is that the quality of the results from automated OCR technologies for historical prints was still considered insufficient or the OCR simply being too costly
  • DFG Workshop in March 2014 led to the launch of a coordinated initiative to develop OCR technologies for historical prints and to prepare the OCR processing of the VD projects
  • The VD projects (VD16, VD17, VD18) contain all prints from German-speaking countries from 1501-1800 and are the main target for OCR-D (according to the DFG)

3 of 10

Project Phases

  • Funded under the DFG programme “e-Research Technologies”, OCR-D was structured in 3 phases, with changing scopes and consortium partners:
  • Phase I (2014-2016) established a coordination project (HAB, BBAW, BSB → SBB), identified requirements, and prepared a call for proposals to develop solutions
  • Phase II (2017-2020) continued the coordination consortium (HAB, BBAW, SBB, +KIT), formulating a specification and reference implementation, and added another 8 module projects, each particular one concerned with the development of technical tools for a particular step in the OCR workflow
  • Phase III (2020-2023) continued the coordination consortium (HAB, BBAW, SBB, KIT → SUB-UGOE, GWDG), as well as 3 module projects and 4 new implementation projects that are piloting OCR-D in different real-life production scenarios
  • See also https://ocr-d.de/en/about

4 of 10

Principles

  • First and foremost, OCR-D is designed to be fully open, transparent and participative
  • All tools and technologies produced by OCR-D are open source (Apache Software License 2.0 to allow all reuse including commercial products and environments)
  • The guiding specifications and requirements as well as all software tools are fully open on GitHub so that everyone, also external to the project partners, can access all information, engage in the discussions and contribute to the development
  • Strive for maximum flexibility, configurability and interchangeability of components to allow the tailoring of the most suitable combination of processings steps and parameters for a given input
  • We try to make all documentation available in German and English
  • See also https://ocr-d.de/

5 of 10

Components (1/3)

  • OCR-D is being developed based on a set of guiding technical specifications
    • METS was established as the main data exchange format, since all digitisation projects that were funded by the DFG so far have to provide a METS according to common and established standards, since this is mandatory for the DFG-Viewer (according to the DFG Digitisation Guidelines)
    • PAGE-XML was chosen as the main format for OCR and layout processing, since it can hold richer information than e.g. ALTO (but ALTO has been catching up in recent years)
    • Tools in OCR-D are individual command-line based software modules that need to adhere to a common interface and documented and described in a JSON file, preferably with Docker container
    • Python (>3.6) is the main programming language and the target systems is Ubuntu Linux LTS (20.04)
    • Ground Truth data should follow a detailed (multi-level) set of guidelines and use PAGE-XML
    • Web-API as a REST-based way to call and interact with tools is currently work-in-progress
  • See also https://ocr-d.de/en/dev

6 of 10

Components (2/3)

7 of 10

Components (3/3)

  • OCR-D software modules can contain multiple Processors that can be arranged into workflows
  • A Processor in OCR-D performs one particular task in a workflow
  • There are numerous Processors available, including common open source OCR tools:
    • Image enhancement (binarization, cropping, denoising, deskewing, dewarping)
    • Font classification
    • Layout analysis (region and line segmentation, reading order detection)
    • Text recognition (Tesseract, Kraken, Calamari, OCRopus)
    • Post-correction (automated)
    • Evaluation (Text and Layout)
    • Format conversion (PAGE-XML, ALTO, hOCR, TEI, TSV)
    • Long-term preservation and archiving

8 of 10

Ground Truth

  • All Ground Truth in OCR-D follows the GT Guidelines
  • Aletheia and neat are the preferred tools for annotation and transcription
  • Ground Truth is using the PAGE-XML format for text and layout description
  • There are three levels of detail in the transcription, from simple to fully featured
  • Transcriptions can use the full Unicode, MUFI and PUA code points
  • OCR-D maintains a list of encoding conventions for special characters
  • Ground Truth is published in a specified OCRD-ZIP format according to a BagIt profile
  • Ground Truth is semantically labelled according to a schema from PRImA
  • A Ground Truth repository gives access and provides search functionality
  • See also https://ocr-d.de/en/data

9 of 10

How to get involved

  • OCR-D offers a multitude of options for participation
    • Gitter/Matrix Chat: Participate in discussions and ask your own questions
    • GitHub: Follow the progress of development, raise issues or contribute code
    • Wiki: Read documentation and tutorials or add your own findings and experiments
    • OCR-D TechCall: Take part in technical discussions about OCR-D tools and issues; �every second Wednesday, 2-3 pm (Berlin Time)
    • OCR-D GT-Call: Take part in discussions about Ground Truth and other data;�every second Thursday, 1-2 pm (Berlin Time)
    • OCR-D Forum: Bring your questions and ideas and join discussions on using OCR(-D);�every first Friday in a month, 10-11 am (Berlin Time)
  • See also https://ocr-d.de/en/community
  • Publications and Presentations on OCR-D: https://ocr-d.de/en/publications
  • OCR-D Technology Watch: https://www.zotero.org/groups/418719/ocr-d

10 of 10

Thank you for the attention!

Questions?