Bridging Literature and Information Science
2017/12/06�Digital Humanities Austria, University of Innsbruck
Phillip Ströbel, M.A.
@CLingophil
Bridges ...
Bridging Literature and Information Science
Why?
Agenda
Text+Berg project (www.textberg.ch)
Some numbers
| token (in millions) | types | types (lower) | lemma types | unk |
de | 22.81 | 763,000 | 722,000 | 294,000 | 790,000 |
fr | 14.68 | 290,000 | 266,000 | 62,000 | 565,000 |
it | 0.8 | 62,000 | 57,000 | 22,000 | 61,000 |
rm | 0.05 | 13,000 | 12,000 | - | - |
en | 0.03 | 6500 | 6000 | 3700 | 2600 |
gsw | 0.02 | 4700 | 4500 | - | - |
Processing pipeline - OLR & OCR
Processing pipeline - OCR → HTML
Processing steps - Tokenisation, PoS tagging, NER, Parsing
Code-Switching Detection
Applications - KOKOS
GeoKOKOS
Multilinguality (LINK)
How Language Shapes Geography
Discourse Semantics
… well, not quite! Imagine you are a historian ...
The modern historian!
impresso
Consortium
10h25
Synergies
Designers & developers:
Mission: connect
Contribution: design and visualization expertise, interface development
Benefit: tangible products used by many people
Computational Linguists & Digital Humanists:
Mission: research
Contribution: research in NLP/DH, algorithms, tool implementation
Benefit: research
(Digital) Historians:
Mission: research
Contribution: research questions & methodology, needs, participation in co-design
Benefit: tools to support historical research
Journalists & Publishers
Mission: inform
Contribution: sources, newspaper expertise,
journalist and user needs
Benefit: enriched sources, open tools
Archives & Libraries
Mission: preservation, valorisation
Contribution: sources, user needs
Benefit: enriched sources, open tools, support for prototype deployment
Objective 1: Historical media monitoring tool suite
How to adapt NLP tools to historical texts?
OCR post-correction, lexical processing, named entity processing, topic modeling
summative and formative evaluation, shared task organized within NLP community
semantically indexed, structured and linked data
Objective 2: Visualization interface and visual analytics
How to explore complex and vast amounts of data?
Objective 3: Digital history
Investigating the impact of new tooling on historical research and scholarship
source criticism & digital scholarship (how to handle digital biases)
usage of the developed tools in the classroom
resistance to the European idea
Some numbers ...
755
25,000,000
54 TB
so far ...
42
Languages, pages, words
Word estimate (~3,500 words per page):
de → ~56 billion words
fr → ~26 billion words
7.6 mio
16 mio
OCR
Bur Befretttîtigkr frati. ©in befannter5ßrofeffor aus Bern mürbe gefragt, ob er unfere petitionitnterfcfjreibeitmoïïe. ©r antmortete:,,Sd) untergeidjne.BerffüchteratS eS je|tift, fann eS nach ©im füfjrungbeS grauenftintmrecfjtSnicfjt merben." B5ir netjmen an, ber gefefjrtefrerr, ein fc|arffinntgerSurift, tjabe in ©e« banfen tjiugugefügt:„B5er roeiß, ob bie B3eft burefj ben ©in« ftuß ber grau nidjt beffer mirb." ©ieferScann mitfunS affo ©efegenfjeitgeben gu geigen, WaS mir feiften fönnen. ©aS ift nicfjt afs reefjt unb billig. ©ie nteiftenScänner aber molten unS nidjt gur 5ßrobe gu« laffen, fonbern fjaften an ihrem altererbtenBorurteitfeft.
Geographical distribution (very preliminary)
20
1
28
8
8
12
2
5
55
4
5
38
19
5
1
n=233
… and other interesting facts!
Processing
My Research within impresso
Outlook