JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 56

GLOBAL DIGITAL HUMANITIES

TEXT-MINING MULTILINGUAL LITERARY CORPORA

2 of 56

Computational Criticism has a language problem

Despite the ability of text mining techniques to analyze hundreds of thousands of texts, there still exists no reliable way to compare statistics between languages

From a literary perspective, this means that phenomena that take place across nations, or between languages, are never fully explored through computation

For example, the Enlightenment, Modernism, and the Renaissance are all aesthetic, political, and (for our purposes) literary movements that include objects in a number of (European) languages

Current no digital humanities project is capable of exploring any phenomena in a global context: world literature remains impossible

3 of 56

Anglo-Centrism

Most corpora of texts are in English: this means that most computational critical work is also done in English creating a global disparity of both resources and methods

Because most corpora are in English, most methods are developed primarily for English language texts, creating difficulties with NLP models and particularly non-English character sets.

Roopika Risam (2016) argues that non-English Digital Humanities (particularly the Global South) suffers as a result of this deficit

For example, even the French Bibliotheque Nationale corpus numbers 120,000 volumes; the Literary Lab collective English corpora alone numbers over 540,000 values

4 of 56

Anglo-Centrism: Hathi Trust

5 of 56

Global Digital Humanities: Aligning Results

This project seeks to find ways of aligning quantitative results across multi-lingual corpora

While individual words are irreducible in their given languages, the summary statistics generated by quantitative literary criticism can be made comparable

But we lack methods for aligning these results and making them compatible across corpora

6 of 56

Alignment is NOT translation

The act of translation does violence to the text by substituting words and concepts that may not be identical, or even meaningfully similar

Rybicki (2012) shows how translator signals in single author corpora can derail a stylometric analysis for authorship

Despite the improvements in computational translation, it is impossible to automate even a base level translation such that meaning, let alone the textual nuances, are retained

7 of 56

Alignment is NOT translation

The translation makes the text brutal, replacing words and phrases that are not necessarily identical or similar in meaning

Rybicki (2012) explains how interpreter signals in one author's authors can hamper the analysis of the standard measurement of composition

Despite improvements in the translation of calculations, it is not even possible to automate the basic translation, so the value, including text colors, is retained.

8 of 56

Global Collaborations

Over the past several years, the Literary Lab has joined with other Digital Humanities organizations to explore multi-lingual projects:

The unexpected turn: narrative analysis of early 20^th century short stories in newspapers and journals in Greek and English with Anastasia Natsina (University of Crete, Greece)

Sententious Sentences in English and German with Fotis Jannidis (Universität Würzburg, Germany)

Literature/Littérature with Alexandre Gefen and Marianne Reboul (CNRS, France)

9 of 56

Multilingual Alignment: 3 Experiments

Network Analysis

Topic Models

Word Embeddings / Vector Models

10 of 56

Dramatic Structure

1: Networks

11 of 56

Networks abstract structure from the language of the text

Dramatic networks connect characters (nodes) by how many lines of dialogue they exchange (edges)

These visualizations retain coherence across languages

We can also extract network statistics to compare across models

Dramatic Structure

12 of 56

Centrality: Betweenness

13 of 56

Centrality: Betweenness

14 of 56

Amphitryon, Dryden (1690) – Betweenness Centrality

15 of 56

Amphitryon, Hawkesworth (1756) – Betweenness Centrality

16 of 56

Amphitryon, Kleist (1803) – Betweenness Centrality

17 of 56

Network Density of Amphitryon (Three Editions)

18 of 56

Topicity Between Languages

2: Topic Models

19 of 56

TWO TOPICS FROM A TOPIC MODEL OF �~200 WORKS OF SUSPENSE FICTION

20 of 56

Topic Model: The Ambassadors

21 of 56

Epistemological words

question

learning

understanding

secret

prove

Topic Model: The Ambassadors

22 of 56

Space/time words

reached

hotel

friend

evening

room

Topic Model: The Ambassadors

23 of 56

Topicity: Mono-Topical Paragraph

24 of 56

Topicity: Bi-Topical Paragraph

25 of 56

Topicity in Dickens vs Goethe

26 of 56

Topicity in Dickens vs Goethe: scaled

27 of 56

Topicity in Dickens vs Goethe

Dickens Mean Topicity: 2.55 (Scaled); 2.02 (Unscaled)

Goethe Mean Topicity: 1.08 (Scaled); 1.04 (Unscaled)

Confounding Effects:

Language baseline: does English have a higher topic density than German at all scales?
Period: Dickens is writing different kinds of novels than Goethe

A set of protocols or methods for aligning these results would allow us to interpret the difference.

28 of 56

History of Literature / Histoire de Littérature

3: Vector Models

29 of 56

CURRENT CORPUS (FRENCH)

1840 titles of literary critic (417 perfectly digitized, the rest not so well)
758 authors
Between 1707 and 1949
Texts chosen from a larger database of literary critics currently digitized
All metadata verified and curated (which means many difficult decisions …)
419 978 033 tokens
29 000+ occurrences of “littérature”, more than 58000 if we include typo and word derivatives (e g “littéraires”) ;
0,0063% global frequency of “littérature” in our corpus compared to 0.0035% in Google books (for 1800-1900) : two times more focused

30 of 56

CURRENT CORPUS (ENGLISH)

Based on ProQuest database “Historical Literary Criticism”
500 authors (as subject); 20,129 texts
Between 1503 and 1992
However: dating is a problem for some works as some dates correspond to the edition, rather than the original text – 7819 (39%) have an unreliable date
46 504 576 tokens
19 896 occurrences of “literature”, more than 44000 if we include typo and word derivatives (e g “literary”) ;

31 of 56

Corpus Statistics

32 of 56

Corpus Statistics

33 of 56

Corpus Statistics

34 of 56

CORPORA KNOWN BIASES

Mainly XIXth century for practical reasons
Some critical/pedagogical editions of older work and many extracts (French)
Critical excerpts from texts (English) / Only books and no press (French)
English corpus is soley author focused: no texts on belles lettres in general
Too many books on “critique dramatique”
Discrete data : means no linear curves but scatter plot and tendencies curves

35 of 56

Global Vectors for Word Representation: GloVe

(Penning, Socher and Manning, 2014)

Log bi-linear model with a weighted least-squares objective

Less computationally expensive than the neural nets associated with word2vec but similar results at high token counts (n>5M)

Allows for distance scores via cosine similarity as well as vector math on results

Vector Models

36 of 56

All vectors calculated to 150 dimensions using 5 token skip-grams

Vector model uses shared contextual similarity to assign distance:

Two words ”close” to each other may never appear in the same sentence.

Synonym/antonym relationships are captured by the model

Vector Models

37 of 56

Closest_Words_English	En_Terms_Score	Closest_Words_French	Fr_Terms_Score
literature	1	littérature	1
modern	0.752334704	française	0.8171132
poetry	0.743134099	poésie	0.795609667
history	0.740588306	moderne	0.763611367
english	0.731722264	contemporaine	0.747738682
fiction	0.722667968	histoire	0.742839262
art	0.694622969	littéraire	0.727554381
literary	0.67289793	langue	0.707563985
american	0.657313228	allemande	0.690306253
german	0.651461481	philosophie	0.689387309
science	0.64739467	anglaise	0.679375838
especially	0.644995867	critique	0.678499313
england	0.63924782	france	0.6757804
french	0.63682407	époque	0.651479153
european	0.630711079	dramatique	0.650263723
writers	0.626208597	surtout	0.643750204
philosophy	0.6254931	poétique	0.64152213
studies	0.614163419	siècle	0.637453337
century	0.610898707	art	0.636201112

Terms closest to “literature/littérature”

38 of 56

Literature-Nationality	Lit-Nat_Score	Littérature-Nationalité	Litt-Nation_Score
literature	0.776436357	littérature	0.736426392
art	0.63578764	histoire	0.601281709
fiction	0.612817902	poésie	0.561197857
poetry	0.609578411	essais	0.558929646
history	0.606552679	livre	0.558598607
modern	0.59033175	ouvrages	0.555483853
english	0.586196852	siècle	0.553174961
especially	0.578234135	critique	0.548915533
works	0.575399373	moderne	0.547627737
books	0.572133177	surtout	0.543034782
since	0.561425571	littéraire	0.533972948
science	0.557590999	genre	0.533355506
studies	0.542464494	art	0.530875284
england	0.542227282	philosophie	0.528667599
now	0.532075515	chapitre	0.522507932
also	0.529803747	roman	0.508692744
philosophy	0.527389281	française	0.508401501
present	0.523795619	contemporaine	0.507069903
writing	0.519118682	temps	0.506582681

Terms closest to “literature/littérature – nationality/nationalité”

39 of 56

Terms closest to “literature”

40 of 56

Terms closest to “literature” (detail)

41 of 56

Terms closest to “littérature”

42 of 56

Terms closest to “littérature” (detail)

43 of 56

Nationality as a function of history

44 of 56

Nationalité as a function of history

45 of 56

Art vs Science (English)

46 of 56

Art vs Science (French)

47 of 56

Arts vs Sciences (French)

48 of 56

Art-Science vector model

49 of 56

Arts-Sciences vector model

50 of 56

Philsophie vs Philosophy

51 of 56

Decreasing terms of function (English)

52 of 56

Increasing literary/criticism terms (English)

53 of 56

Decreasing disciplinary terms (French)

54 of 56

Increasing professional terms (French)

55 of 56

Romantic-Classical vector model

56 of 56

Romantique-Classique vector model