1 of 30

Digitisation and Computational Analysis of Newspapers: �A Roadmap

Clemens Neudecker (@cneudecker)

Digital Methods in History and Economics

14 October 2021

2 of 30

In this talk I will look back at the last decade of newspaper digitisation and (computational) analysis following the launch of Europeana Newspapers, and summarize the main developments and achievements since then.

Thanks to advances in machine learning, the technical means for newspaper digitisation have significantly improved. In combination with the continued provision of large quantities of historical newspapers by libraries, and the keen interest of the research community in digital newspapers, new opportunities but also challenges have opened up.

Against that background, this talk aims to formulate some key elements and next steps of a roadmap for the digitisation and computational analysis of newspapers.

3 of 30

The Past

4 of 30

© Elias Siebert: �State of the Archive

5 of 30

6 of 30

7 of 30

8 of 30

⏑ — —— —1 für das Viertellarr 6—⏑

—Urhalt D— — —— — — — —

— —— — e—— — — — —W—elnWedchole ich noch

— — — — — —ige Unterſtuzung beim A—cch haln

FWBeſtellm Dele—al Tiageozahlung �Siſtoriſche Lotizen der Borzein uee.

Tuſſche Krifß 7 —Defer brit ſollen in der Folge die intereſſet m

Den Begebenheiten, die ſich — er Begſrate

Der Churfürſ——bcchaupt uDann D—

ſchiedene Ditaſteri“ ffälziſche Laudrecht

—— —Toch jett biee—— — n h 2 —

Sthalt——7 — —— ⏑ — — — 1⏑4—

—— — — — — — — —ice ⏑ —

— 7 7— —— —— D — 2 Je ——— Vergnu⸗

Fer Dieſe En Namen Lectiia geführt, welch —D——

Von der alten—7 nigene 5 A — ꝓ— T—— — — E —C

F —eL wringen lein. E —uCſo ndegrindet tneückiche . “ — r—

Achdem ſein. Sohn ſich R n emderf igen Nnche ſer

—— — n h 2 — einheim iſt ſehr alt ud Eccheint ſchon in!

78 — — 8 Jahrhn dere Daß die Nömer hict cii

NCTuñg Chabt, iſt wohl Ebezweifeinn 2

Fer Dieſe En Namen Lectiia geführt, welch —D——

— n—StcuF — leLi ringen ſein. E—

ſ Adegrndet U di —— 22 — Wüctiche —

chdem ſein Sohn f— ————— — u —— — —

— —Weinhem n Eigenttu — — e —c

— —— —m 3ac— — —— 7 uch Aiſer Heinric——

— —— Uunätte auzulegen, me ————— — 4 ⏑ ñ—

— ſfalzgraf Ruprecht Ill. erlaubte 140

9 of 30

With deep learning, nearly error-free OCR results for historical newspapers are possible using ocrd_calamari with a model trained on the GT4HistOCR dataset!

Alas, state-of-the-art OCR engines require already pre-segmented regions and text lines…

10 of 30

11 of 30

12 of 30

The Present

13 of 30

Oceanic Exchanges�https://oceanicexchanges.org/

  • Transatlantic Challenge project with partners from USA, EU, UK, MX
  • Goal: Tracing Global Information Networks In Historical Newspaper Repositories, 1840-1914
  • Compilation of a huge cross-national newspaper corpus (~200M pages)
  • Detection of reprinted sections
  • Bad OCR quality and inconsistent metadata
  • Mapping of Metadata: �The Atlas of Digitised Newspapers�https://www.digitisednewspapers.net

14 of 30

impressohttps://impresso-project.ch/

  • SNF project incl. CH, LU, LI
  • Compilation of a multilingual news corpus
  • Content enrichment (named entities, topic modeling)
  • Design and implementation of an advanced GUI/app incl. various data visualizations

15 of 30

NewsEye�https://www.newseye.eu/

  • EU H2020 project with multilingual newspaper corpus
  • Improvement of text recognition and
  • Article separation
  • Semantic text enrichment �(named entities, topics, etc.)
  • Personal Research Assistant

16 of 30

More than a Feelinghttps://media-sentiment.uni-leipzig.de/

  • DFG Priority Program "SPP 1859 – Experience and Expectation"
  • Extraction and analysis of stock market data from historical newspapers
  • Creation of a daily sentiment index which can be used to measure investors’ expectations
  • Table detection and recognition
  • Text Mining for Sentiment

17 of 30

SoNAR (IDH)https://sonar.fh-potsdam.de/

  • DFG “e-Research technologies” project
  • Extraction of named entities from digitised newspapers (OCR)
  • Linking of entities with knowledge base (Wikidata, GND)
  • Visualization and Exploration of Historical Social Network Graph
  • Use cases from digital history

18 of 30

Siamesehttp://lab.kb.nl/tool/siamese

  • KB Lab project
  • Extraction of visual content (ads) from newspapers
  • Clustering by similarity
  • Content-based information retrieval

CHRONReader

https://lab.kb.nl/tool/chronreader

  • KB Lab project
  • Find images in newspapers based on captions, time period or categories
  • Content-based information retrieval

19 of 30

Newspaper Navigator�https://news-navigator.labs.loc.gov/

  • Library of Congress Labs project by LoC innovator in residence Ben Lee
  • Extraction of visual content for 16,358,041 newspaper pages
  • Publication of source code and dataset for machine learning
  • Visual search interface

20 of 30

Living with Machineshttps://livingwithmachines.ac.uk/

  • Major UK machine learning project
  • “rethinking the impact of technology on the lives of ordinary people during the Industrial Revolution”
  • Newspapers as source content
  • (Meta)Data Visualization (PressPicker)
  • Jupyter notebooks for working with geospatial information
  • Crowdsourcing

21 of 30

22 of 30

Dagstuhl

23 of 30

The Future?

24 of 30

Newspapers as Data

Inspirational US project: “Collections as Data” (https://collectionsasdata.github.io/)

Principles:

  • encourage computational use of digitized and born digital collections
  • guided by ongoing ethical commitments
  • lower barriers to use
  • designed for everyone serve no one
  • helps others find a path to doing the work
  • openly accessible by default
  • value interoperability
  • work transparently in order to develop trustworthy, long‑lived collections
  • describe the data considered in scope
  • ongoing process and that does not necessarily conclude with a final version

25 of 30

  1. Data Formats, Interoperability and Standards

Various data models and standards are currently in use for digital newspaper content such as e.g. METS/ALTO, PAGE-XML, TEI, IIIF-Manifest…will we need all of them - or possibly even more?

  • How to describe the materiality (e.g. size) of newspapers in digital form?
  • How to encode “deep” content structures - e.g. opinion vs. information?
  • Also, metadata should always include provenance of processing (what OCR, what version and configuration) for transparency
  • Guidelines are needed that aid in the use and application of such standards on the basis of real world examples

26 of 30

II. Multimodal Models for Layout Analysis and OCR

To further develop and improve document image analysis and recognition for newspapers, we require robust machine learning tools and trained models that take the multimodal dimensions of newspaper content into account by combining

  • recent advances in AI and machine learning in
  • methods from computer vision with
  • natural language processing constrained by
  • heuristics and rule-based methods

27 of 30

III. Open and Comprehensive Ground Truth Datasets

To improve and assess the quality of digitised newspaper data, we want datasets of historic newspaper images with layout GT that are

  • sufficiently large in size, diverse and ideally provide representative coverage of the wide variety of historical newspapers
  • contain granular and high quality annotations for layout classes, text content, semantic enrichments
  • openly available and free-to-use/reuse, with documentation and provenance
  • consider and document ethically problematic content and implications

28 of 30

IV. Collaborative Annotation and Curation Environments

Crowdsourcing and open web-based tools for collaborative data curation with users or focus groups can support many of these tasks

  • Correcting or transcribing the text
  • Correcting or marking the layout and structure
  • Annotating named entities (person names, locations, events etc.)
  • Tagging and annotating images or other graphical elements
  • Enriching geolocations with geo-coordinates

29 of 30

V. Evaluation Framework for OCR and Layout Analysis

Common evaluation methods and metrics for OCR do not capture the quality of the layout analysis (and may even lead to wrong results or misinterpretations).

  • Especially with newspapers and their often complex and multi-column layouts, it is essential to also evaluate layout accuracy and not only OCR
  • Even different implementations of the same evaluation metric (CER) can sometimes give surprisingly different results
  • For purposes of text and data mining, evaluation of the reading order is paramount - but we lack widespread established methods and definitions
  • Many parameters must be considered, like misses, merges, miss-classifications etc. - the definition of evaluation scenarios can help

30 of 30

Thank you for your attention!�Questions?

Clemens Neudecker (@cneudecker)

Digital Methods in History and Economics

14 October 2021