1 of 56

GLOBAL DIGITAL HUMANITIES

TEXT-MINING MULTILINGUAL LITERARY CORPORA

2 of 56

Computational Criticism has a language problem

  • Despite the ability of text mining techniques to analyze hundreds of thousands of texts, there still exists no reliable way to compare statistics between languages

  • From a literary perspective, this means that phenomena that take place across nations, or between languages, are never fully explored through computation

  • For example, the Enlightenment, Modernism, and the Renaissance are all aesthetic, political, and (for our purposes) literary movements that include objects in a number of (European) languages

  • Current no digital humanities project is capable of exploring any phenomena in a global context: world literature remains impossible

3 of 56

Anglo-Centrism

  • Most corpora of texts are in English: this means that most computational critical work is also done in English creating a global disparity of both resources and methods

  • Because most corpora are in English, most methods are developed primarily for English language texts, creating difficulties with NLP models and particularly non-English character sets.

  • Roopika Risam (2016) argues that non-English Digital Humanities (particularly the Global South) suffers as a result of this deficit

  • For example, even the French Bibliotheque Nationale corpus numbers 120,000 volumes; the Literary Lab collective English corpora alone numbers over 540,000 values

4 of 56

Anglo-Centrism: Hathi Trust

5 of 56

Global Digital Humanities: Aligning Results

  • This project seeks to find ways of aligning quantitative results across multi-lingual corpora

  • While individual words are irreducible in their given languages, the summary statistics generated by quantitative literary criticism can be made comparable

  • But we lack methods for aligning these results and making them compatible across corpora

6 of 56

Alignment is NOT translation

  • The act of translation does violence to the text by substituting words and concepts that may not be identical, or even meaningfully similar

  • Rybicki (2012) shows how translator signals in single author corpora can derail a stylometric analysis for authorship

  • Despite the improvements in computational translation, it is impossible to automate even a base level translation such that meaning, let alone the textual nuances, are retained

7 of 56

Alignment is NOT translation

  • The translation makes the text brutal, replacing words and phrases that are not necessarily identical or similar in meaning

  • Rybicki (2012) explains how interpreter signals in one author's authors can hamper the analysis of the standard measurement of composition

  • Despite improvements in the translation of calculations, it is not even possible to automate the basic translation, so the value, including text colors, is retained.

8 of 56

Global Collaborations

Over the past several years, the Literary Lab has joined with other Digital Humanities organizations to explore multi-lingual projects:

  • The unexpected turn: narrative analysis of early 20th century short stories in newspapers and journals in Greek and English with Anastasia Natsina (University of Crete, Greece)

  • Sententious Sentences in English and German with Fotis Jannidis (Universität Würzburg, Germany)

  • Literature/Littérature with Alexandre Gefen and Marianne Reboul (CNRS, France)

9 of 56

Multilingual Alignment: 3 Experiments

  1. Network Analysis

  • Topic Models

  • Word Embeddings / Vector Models

10 of 56

Dramatic Structure

1: Networks

11 of 56

  • Networks abstract structure from the language of the text

  • Dramatic networks connect characters (nodes) by how many lines of dialogue they exchange (edges)

  • These visualizations retain coherence across languages

  • We can also extract network statistics to compare across models

Dramatic Structure

12 of 56

Centrality: Betweenness

13 of 56

Centrality: Betweenness

14 of 56

Amphitryon, Dryden (1690) – Betweenness Centrality

15 of 56

Amphitryon, Hawkesworth (1756) – Betweenness Centrality

16 of 56

Amphitryon, Kleist (1803) – Betweenness Centrality

17 of 56

Network Density of Amphitryon (Three Editions)

18 of 56

Topicity Between Languages

2: Topic Models

19 of 56

TWO TOPICS FROM A TOPIC MODEL OF �~200 WORKS OF SUSPENSE FICTION

20 of 56

Topic Model: The Ambassadors

21 of 56

Epistemological words

question

learning

understanding

secret

prove

Topic Model: The Ambassadors

22 of 56

Space/time words

reached

hotel

friend

evening

room

Topic Model: The Ambassadors

23 of 56

Topicity: Mono-Topical Paragraph

24 of 56

Topicity: Bi-Topical Paragraph

25 of 56

Topicity in Dickens vs Goethe

26 of 56

Topicity in Dickens vs Goethe: scaled

27 of 56

Topicity in Dickens vs Goethe

Dickens Mean Topicity: 2.55 (Scaled); 2.02 (Unscaled)

Goethe Mean Topicity: 1.08 (Scaled); 1.04 (Unscaled)

Confounding Effects:

  1. Language baseline: does English have a higher topic density than German at all scales?
  2. Period: Dickens is writing different kinds of novels than Goethe

A set of protocols or methods for aligning these results would allow us to interpret the difference.

28 of 56

History of Literature / Histoire de Littérature

3: Vector Models

29 of 56

CURRENT CORPUS (FRENCH)

  • 1840 titles of literary critic (417 perfectly digitized, the rest not so well)
  • 758 authors
  • Between 1707 and 1949
  • Texts chosen from a larger database of literary critics currently digitized
  • All metadata verified and curated (which means many difficult decisions …)
  • 419 978 033 tokens
  • 29 000+ occurrences of “littérature”, more than 58000 if we include typo and word derivatives (e g “littéraires”) ;
  • 0,0063% global frequency of “littérature” in our corpus compared to 0.0035% in Google books (for 1800-1900) : two times more focused

30 of 56

CURRENT CORPUS (ENGLISH)

  • Based on ProQuest database “Historical Literary Criticism”
  • 500 authors (as subject); 20,129 texts
  • Between 1503 and 1992
  • However: dating is a problem for some works as some dates correspond to the edition, rather than the original text – 7819 (39%) have an unreliable date
  • 46 504 576 tokens
  • 19 896 occurrences of “literature”, more than 44000 if we include typo and word derivatives (e g “literary”) ;

31 of 56

Corpus Statistics

32 of 56

Corpus Statistics

33 of 56

Corpus Statistics

34 of 56

CORPORA KNOWN BIASES

  • Mainly XIXth century for practical reasons
  • Some critical/pedagogical editions of older work and many extracts (French)
  • Critical excerpts from texts (English) / Only books and no press (French)
  • English corpus is soley author focused: no texts on belles lettres in general
  • Too many books on “critique dramatique”
  • Discrete data : means no linear curves but scatter plot and tendencies curves

35 of 56

Global Vectors for Word Representation: GloVe

(Penning, Socher and Manning, 2014)

Log bi-linear model with a weighted least-squares objective

Less computationally expensive than the neural nets associated with word2vec but similar results at high token counts (n>5M)

Allows for distance scores via cosine similarity as well as vector math on results

Vector Models

36 of 56

All vectors calculated to 150 dimensions using 5 token skip-grams

Vector model uses shared contextual similarity to assign distance:

Two words ”close” to each other may never appear in the same sentence.

Synonym/antonym relationships are captured by the model

Vector Models

37 of 56

Closest_Words_English

En_Terms_Score

Closest_Words_French

Fr_Terms_Score

literature

1

littérature

1

modern

0.752334704

française

0.8171132

poetry

0.743134099

poésie

0.795609667

history

0.740588306

moderne

0.763611367

english

0.731722264

contemporaine

0.747738682

fiction

0.722667968

histoire

0.742839262

art

0.694622969

littéraire

0.727554381

literary

0.67289793

langue

0.707563985

american

0.657313228

allemande

0.690306253

german

0.651461481

philosophie

0.689387309

science

0.64739467

anglaise

0.679375838

especially

0.644995867

critique

0.678499313

england

0.63924782

france

0.6757804

french

0.63682407

époque

0.651479153

european

0.630711079

dramatique

0.650263723

writers

0.626208597

surtout

0.643750204

philosophy

0.6254931

poétique

0.64152213

studies

0.614163419

siècle

0.637453337

century

0.610898707

art

0.636201112

Terms closest to “literature/littérature”

38 of 56

Literature-Nationality

Lit-Nat_Score

Littérature-Nationalité

Litt-Nation_Score

literature

0.776436357

littérature

0.736426392

art

0.63578764

histoire

0.601281709

fiction

0.612817902

poésie

0.561197857

poetry

0.609578411

essais

0.558929646

history

0.606552679

livre

0.558598607

modern

0.59033175

ouvrages

0.555483853

english

0.586196852

siècle

0.553174961

especially

0.578234135

critique

0.548915533

works

0.575399373

moderne

0.547627737

books

0.572133177

surtout

0.543034782

since

0.561425571

littéraire

0.533972948

science

0.557590999

genre

0.533355506

studies

0.542464494

art

0.530875284

england

0.542227282

philosophie

0.528667599

now

0.532075515

chapitre

0.522507932

also

0.529803747

roman

0.508692744

philosophy

0.527389281

française

0.508401501

present

0.523795619

contemporaine

0.507069903

writing

0.519118682

temps

0.506582681

Terms closest to “literature/littérature – nationality/nationalité”

39 of 56

Terms closest to “literature”

40 of 56

Terms closest to “literature” (detail)

41 of 56

Terms closest to “littérature”

42 of 56

Terms closest to “littérature” (detail)

43 of 56

Nationality as a function of history

44 of 56

Nationalité as a function of history

45 of 56

Art vs Science (English)

46 of 56

Art vs Science (French)

47 of 56

Arts vs Sciences (French)

48 of 56

Art-Science vector model

49 of 56

Arts-Sciences vector model

50 of 56

Philsophie vs Philosophy

51 of 56

Decreasing terms of function (English)

52 of 56

Increasing literary/criticism terms (English)

53 of 56

Decreasing disciplinary terms (French)

54 of 56

Increasing professional terms (French)

55 of 56

Romantic-Classical vector model

56 of 56

Romantique-Classique vector model