Digitisation and Computational Analysis of Newspapers: �A Roadmap
Clemens Neudecker (@cneudecker)
Digital Methods in History and Economics
14 October 2021
In this talk I will look back at the last decade of newspaper digitisation and (computational) analysis following the launch of Europeana Newspapers, and summarize the main developments and achievements since then.
Thanks to advances in machine learning, the technical means for newspaper digitisation have significantly improved. In combination with the continued provision of large quantities of historical newspapers by libraries, and the keen interest of the research community in digital newspapers, new opportunities but also challenges have opened up.
Against that background, this talk aims to formulate some key elements and next steps of a roadmap for the digitisation and computational analysis of newspapers.
The Past
© Elias Siebert: �State of the Archive
⏑ — —— —1 für das Viertellarr 6—⏑
—Urhalt D— — —— — — — —
— —— — e—— — — — —W—elnWedchole ich noch
— — — — — —ige Unterſtuzung beim A—cch haln
FWBeſtellm Dele—al Tiageozahlung �Siſtoriſche Lotizen der Borzein uee.
Tuſſche Krifß 7 —Defer brit ſollen in der Folge die intereſſet m
Den Begebenheiten, die ſich — er Begſrate
Der Churfürſ——bcchaupt uDann D—
ſchiedene Ditaſteri“ ffälziſche Laudrecht
—— —Toch jett biee—— — n h 2 —
Sthalt——7 — —— ⏑ — — — 1⏑4—
—— — — — — — — —ice ⏑ —
— 7 7— —— —— D — 2 Je ——— Vergnu⸗
Fer Dieſe En Namen Lectiia geführt, welch —D——
Von der alten—7 nigene 5 A — ꝓ— T—— — — E —C
F —eL wringen lein. E —uCſo ndegrindet tneückiche . “ — r—
Achdem ſein. Sohn ſich R n emderf igen Nnche ſer
—— — n h 2 — einheim iſt ſehr alt ud Eccheint ſchon in!
78 — — 8 Jahrhn dere Daß die Nömer hict cii
NCTuñg Chabt, iſt wohl Ebezweifeinn 2
Fer Dieſe En Namen Lectiia geführt, welch —D——
— n—StcuF — leLi ringen ſein. E—
ſ Adegrndet U di —— 22 — Wüctiche —
chdem ſein Sohn f— ————— — u —— — —
— —Weinhem n Eigenttu — — e —c
— —— —m 3ac— — —— 7 uch Aiſer Heinric——
— —— Uunätte auzulegen, me ————— — 4 ⏑ ñ—
— ſfalzgraf Ruprecht Ill. erlaubte 140
With deep learning, nearly error-free OCR results for historical newspapers are possible using ocrd_calamari with a model trained on the GT4HistOCR dataset!
Alas, state-of-the-art OCR engines require already pre-segmented regions and text lines…
The Present
Oceanic Exchanges�https://oceanicexchanges.org/
impresso�https://impresso-project.ch/
NewsEye�https://www.newseye.eu/
More than a Feeling�https://media-sentiment.uni-leipzig.de/
SoNAR (IDH)�https://sonar.fh-potsdam.de/
Siamese�http://lab.kb.nl/tool/siamese
CHRONReader
https://lab.kb.nl/tool/chronreader
Newspaper Navigator�https://news-navigator.labs.loc.gov/
Living with Machines�https://livingwithmachines.ac.uk/
Dagstuhl
The Future?
Newspapers as Data
Inspirational US project: “Collections as Data” (https://collectionsasdata.github.io/)
Principles:
Various data models and standards are currently in use for digital newspaper content such as e.g. METS/ALTO, PAGE-XML, TEI, IIIF-Manifest…will we need all of them - or possibly even more?
II. Multimodal Models for Layout Analysis and OCR
To further develop and improve document image analysis and recognition for newspapers, we require robust machine learning tools and trained models that take the multimodal dimensions of newspaper content into account by combining
III. Open and Comprehensive Ground Truth Datasets
To improve and assess the quality of digitised newspaper data, we want datasets of historic newspaper images with layout GT that are
IV. Collaborative Annotation and Curation Environments
Crowdsourcing and open web-based tools for collaborative data curation with users or focus groups can support many of these tasks
V. Evaluation Framework for OCR and Layout Analysis
Common evaluation methods and metrics for OCR do not capture the quality of the layout analysis (and may even lead to wrong results or misinterpretations).
Thank you for your attention!�Questions?
Clemens Neudecker (@cneudecker)
Digital Methods in History and Economics
14 October 2021