1 of 33

Bridging Literature and Information Science

2017/12/06�Digital Humanities Austria, University of Innsbruck

Phillip Ströbel, M.A.

@CLingophil

2 of 33

Bridges ...

3 of 33

Bridging Literature and Information Science

4 of 33

Why?

  • We love challenges!
  • Knowledge gain!
    • methods/techniques
    • social/cultural
  • Push the boundaries!

5 of 33

Agenda

  • Building bridges
  • Text+Berg
    • Processing
    • Applications
  • The impresso project
  • Outlook

6 of 33

Text+Berg project (www.textberg.ch)

  • Started 2008
    • And will finish in 2018 :-(
  • Digitisation of the Yearbooks of the Swiss Alpine Club
    • Scanning
    • OCR
    • Article segmentation
    • Tokenisation
    • Part-of-Speech tagging
      • incl. lemmatisation
    • Named entity recognition
    • Elliptic compound detection
    • Marking time expressions
    • Code-switching detection
    • Parsing
    • Sentence alignment
    • Word alignment

7 of 33

Some numbers

  • Current release spans 152 years (1864 - 2016)
  • > 100,000 pages
  • in a table:

token (in millions)

types

types (lower)

lemma types

unk

de

22.81

763,000

722,000

294,000

790,000

fr

14.68

290,000

266,000

62,000

565,000

it

0.8

62,000

57,000

22,000

61,000

rm

0.05

13,000

12,000

-

-

en

0.03

6500

6000

3700

2600

gsw

0.02

4700

4500

-

-

8 of 33

Processing pipeline - OLR & OCR

9 of 33

Processing pipeline - OCR → HTML

10 of 33

Processing steps - Tokenisation, PoS tagging, NER, Parsing

11 of 33

Code-Switching Detection

12 of 33

Applications - KOKOS

  • Kollaboratives Korrektursystem
  • 21,247 pages with texts from 1864 - 1899 → 5.5 mio tokens
  • 256,410 corrections within 7 months

13 of 33

GeoKOKOS

  • Same thing, but for named entities

14 of 33

Multilinguality (LINK)

15 of 33

How Language Shapes Geography

  • Thesis by Curdin Derungs (2014)

16 of 33

Discourse Semantics

  • Dissertation project by Patricia Scheurer
  • Tour reports as subcorpus
  • Check which adjectives mountains are described with and how this changed over the last 150 years
  • → find metaphors for mountains
    • e.g. “pyramid” as metaphor for mountain
    • 779 occurrences

17 of 33

18 of 33

… well, not quite! Imagine you are a historian ...

  • … and that you have a research question!

19 of 33

The modern historian!

20 of 33

impresso

  • Media Monitoring of the Past: Mining 200 Years of Historical Newspapers
  • http://impresso-project.ch/

21 of 33

Consortium

  • 3 applicants (EPFL-DHLAB, C2DH, UZH-CL)
  • 8 associated partners (AEV, BnL, SNL, SWA, infoclio, UNIL, LeTemps, NZZ)
  • friends

10h25

22 of 33

Synergies

Designers & developers:

Mission: connect

Contribution: design and visualization expertise, interface development

Benefit: tangible products used by many people

Computational Linguists & Digital Humanists:

Mission: research

Contribution: research in NLP/DH, algorithms, tool implementation

Benefit: research

(Digital) Historians:

Mission: research

Contribution: research questions & methodology, needs, participation in co-design

Benefit: tools to support historical research

Journalists & Publishers

Mission: inform

Contribution: sources, newspaper expertise,

journalist and user needs

Benefit: enriched sources, open tools

Archives & Libraries

Mission: preservation, valorisation

Contribution: sources, user needs

Benefit: enriched sources, open tools, support for prototype deployment

23 of 33

Objective 1: Historical media monitoring tool suite

How to adapt NLP tools to historical texts?

  • Development of multilingual and time-specific NLP components

OCR post-correction, lexical processing, named entity processing, topic modeling

  • Systematic performance assessment

summative and formative evaluation, shared task organized within NLP community

  • Building of a fully traceable and interoperable historical semantic knowledge base

semantically indexed, structured and linked data

24 of 33

Objective 2: Visualization interface and visual analytics

How to explore complex and vast amounts of data?

  • Visualization interface beyond keyword based search, to accommodate text analysis research tools and allow users to use the system in a reflexive way.
  • Principle of co-design: designers, historians and computational linguists will work in close collaboration

25 of 33

Objective 3: Digital history

Investigating the impact of new tooling on historical research and scholarship

  • Methodological and epistemological questions

source criticism & digital scholarship (how to handle digital biases)

  • Teaching digital history

usage of the developed tools in the classroom

  • Historical use case

resistance to the European idea

26 of 33

Some numbers ...

755

25,000,000

54 TB

so far ...

42

27 of 33

Languages, pages, words

Word estimate (~3,500 words per page):

de → ~56 billion words

fr → ~26 billion words

7.6 mio

16 mio

28 of 33

OCR

Bur Befretttîtigkr frati. ©in befannter5ßrofeffor aus Bern mürbe gefragt, ob er unfere petitionitnterfcfjreibeitmoïïe. ©r antmortete:,,Sd) untergeidjne.BerffüchteratS eS je|tift, fann eS nach ©im füfjrungbeS grauenftintmrecfjtSnicfjt merben." B5ir netjmen an, ber gefefjrtefrerr, ein fc|arffinntgerSurift, tjabe in ©e« banfen tjiugugefügt:„B5er roeiß, ob bie B3eft burefj ben ©in« ftuß ber grau nidjt beffer mirb." ©ieferScann mitfunS affo ©efegenfjeitgeben gu geigen, WaS mir feiften fönnen. ©aS ift nicfjt afs reefjt unb billig. ©ie nteiftenScänner aber molten unS nidjt gur 5ßrobe gu« laffen, fonbern fjaften an ihrem altererbtenBorurteitfeft.

29 of 33

Geographical distribution (very preliminary)

20

1

28

8

8

12

2

5

55

4

5

38

19

5

1

n=233

30 of 33

… and other interesting facts!

  • name changes
  • periodicity
  • geographic outreach
  • various political orientations

31 of 33

Processing

32 of 33

My Research within impresso

  • Topic modeling
    • dynamic
    • cross-lingual
    • relational

33 of 33

Outlook

  • Acquiring more relevant source texts (e.g. Wiener Diarium)
  • Foster interdisciplinarity
  • New way of working together with researchers in the humanities
  • Push boundaries!