1 of 10

Clemens Neudecker

Staatsbibliothek zu Berlin - Preußischer Kulturbesitz

Harmonizing Workflows in HTR/OCR Publication Pipelines of Textual Heritage

February 15, 2023, Berlin

2 of 10

Background and History

Many digitisation projects were funded by the DFG over the last decades
Typically, only scans were produced and images published, but no full text
The reason is that the quality of the results from automated OCR technologies for historical prints was still considered insufficient or the OCR simply being too costly
DFG Workshop in March 2014 led to the launch of a coordinated initiative to develop OCR technologies for historical prints and to prepare the OCR processing of the VD projects
The VD projects (VD16, VD17, VD18) contain all prints from German-speaking countries from 1501-1800 and are the main target for OCR-D (according to the DFG)

Project Phases

Funded under the DFG programme “e-Research Technologies”, OCR-D was structured in 3 phases, with changing scopes and consortium partners:
Phase I (2014-2016) established a coordination project (HAB, BBAW, BSB → SBB), identified requirements, and prepared a call for proposals to develop solutions
Phase II (2017-2020) continued the coordination consortium (HAB, BBAW, SBB, +KIT), formulating a specification and reference implementation, and added another 8 module projects, each particular one concerned with the development of technical tools for a particular step in the OCR workflow
Phase III (2020-2023) continued the coordination consortium (HAB, BBAW, SBB, KIT → SUB-UGOE, GWDG), as well as 3 module projects and 4 new implementation projects that are piloting OCR-D in different real-life production scenarios
See also https://ocr-d.de/en/about

Principles

First and foremost, OCR-D is designed to be fully open, transparent and participative
All tools and technologies produced by OCR-D are open source (Apache Software License 2.0 to allow all reuse including commercial products and environments)
The guiding specifications and requirements as well as all software tools are fully open on GitHub so that everyone, also external to the project partners, can access all information, engage in the discussions and contribute to the development
Strive for maximum flexibility, configurability and interchangeability of components to allow the tailoring of the most suitable combination of processings steps and parameters for a given input
We try to make all documentation available in German and English
See also https://ocr-d.de/

Components (1/3)

METS was established as the main data exchange format, since all digitisation projects that were funded by the DFG so far have to provide a METS according to common and established standards, since this is mandatory for the DFG-Viewer (according to the DFG Digitisation Guidelines)
PAGE-XML was chosen as the main format for OCR and layout processing, since it can hold richer information than e.g. ALTO (but ALTO has been catching up in recent years)
Tools in OCR-D are individual command-line based software modules that need to adhere to a common interface and documented and described in a JSON file, preferably with Docker container
Python (>3.6) is the main programming language and the target systems is Ubuntu Linux LTS (20.04)
Ground Truth data should follow a detailed (multi-level) set of guidelines and use PAGE-XML
Web-API as a REST-based way to call and interact with tools is currently work-in-progress

Components (2/3)

OCR-D provides a reference implementation of the specifications with wrappers and additional utilities and helper tools in Python https://github.com/OCR-D/core
OCR-D & Community provide a distribution of all software and technical components in a single GitHub repository https://github.com/OCR-D/core and Docker via https://hub.docker.com/u/ocrd
There is documentation for setting up a local installation of OCR-D https://ocr-d.de/en/setup
There is documentation on using OCR-D https://ocr-d.de/en/user_guide
There is documentation for the individual components and their combination into workflows https://ocr-d.de/en/workflows
A benchmarking environment (work-in-progress) provides information to assess performance criteria for distinct document types https://ocr-d.de/quiver-frontend/

Components (3/3)

OCR-D software modules can contain multiple Processors that can be arranged into workflows
A Processor in OCR-D performs one particular task in a workflow
There are numerous Processors available, including common open source OCR tools:

Ground Truth

All Ground Truth in OCR-D follows the GT Guidelines
Aletheia and neat are the preferred tools for annotation and transcription
Ground Truth is using the PAGE-XML format for text and layout description
There are three levels of detail in the transcription, from simple to fully featured
Transcriptions can use the full Unicode, MUFI and PUA code points
OCR-D maintains a list of encoding conventions for special characters
Ground Truth is published in a specified OCRD-ZIP format according to a BagIt profile
Ground Truth is semantically labelled according to a schema from PRImA
A Ground Truth repository gives access and provides search functionality
See also https://ocr-d.de/en/data

How to get involved

Gitter/Matrix Chat: Participate in discussions and ask your own questions
GitHub: Follow the progress of development, raise issues or contribute code
Wiki: Read documentation and tutorials or add your own findings and experiments
OCR-D TechCall: Take part in technical discussions about OCR-D tools and issues; �every second Wednesday, 2-3 pm (Berlin Time)
OCR-D GT-Call: Take part in discussions about Ground Truth and other data;�every second Thursday, 1-2 pm (Berlin Time)
OCR-D Forum: Bring your questions and ideas and join discussions on using OCR(-D);�every first Friday in a month, 10-11 am (Berlin Time)