1 of 32

A ‘hands-off’ approach �to modelling constructional change �with word embeddings

Lauren Fonteyn & Enrique Manjavacas

‘Modelling constructional variation and change’, 15-16 November 2021, Zurich

2 of 32

Introduction

  • Computational (i.e. distributional) approaches to semantic change have become very popular:
    • see, among many others: Sagi et al. (2011); Hamilton, Leskovec & Jurafsky (2016a, 2016b); Mitra, Mitra, Riedl, Biemann, Mukherjee & Goyal (2014); Bamler & Mandt (2017); Dubossarsky, Weinshall& Grossman (2017); Rosenfeld & Erk (2018); Rudolph & Blei (2018); Kutuzov, Øvrelid, Szymanski & Velldal (2018); Hu, Li & Liang (2019); Dubossarsky, Hengchen, Tahmasebi & Schlechtweg (2019); Tahmasebi, Borin & Jatowt (2019); Del Tredici, Fernández & Boleda (2019); Schlechtweg, McGillivray, Hengchen, Dubossarsky & Tahmasebi (2020); �Giulianelli, Del Tredici & Fernández (2020); Bizzoni, Degaetano-Ortlieb, Fankhauser & Teich (2020); …

3 of 32

Introduction

  • Wide-scope studies:
    • not interested in specific word or construction, but wish to quantify some aspect or ‘law’ of semantic change at large;
    • tend to approach semantic change in bulk (sample sizes: hundreds to thousands of linguistic items)

  • Narrow-scope, case-driven studies:
    • aims to tackle questions about the development of specific construction (often during a specific time window);
    • relatively inflexible in terms of data;
    • specific phenomenon under scrutiny constitutes a complex challenge.

4 of 32

Aims

  • Delineating a step-wise procedure that:
    • minimizes manual work (‘hands off’)
      • Why?
        • enable automated annotation/analysis for case-driven research to
          • maximally exploit available data
          • maximize the data-driven character (see Perek 2016, 2018)

5 of 32

Outline

  1. Case study: the grammaticalization of to death
  2. Data & Distributed meaning representations (Word2Vec)
  3. Method:
    1. diachronic cluster analysis
    2. sentiment analysis
  4. Results
  5. Discussion
  6. Future work: MacBERTh

6 of 32

Case study: grammaticalization of to death ☠️

😵 she was by the godesse wounded to death. (EEBO, 1641)

🙁 That look of yours frightens me to death. (CLMET3.1, 1750)

🙂 I have a new toy and I’m tickled to death. (Gutenberg, 1913)

decrease in compositionality: literal death > endpoint, extreme

negative meaning of source persists

host-class expansion (increase in productivity/schematicity):

diversification of collocate verbs (lexical & semantic)

Grammaticalization of phrasal expression to death (Hoeksema & Jo Napoli 2008; Claridge 2011; Margerie 2011; Blanco Suárez 2017):

7 of 32

Case study: why?

  • Realistic case study for which clear expectations have been formulated
  • Interesting challenge:
    • Low-frequency phenomenon: prior work indicated that it is difficult to support the development as an instance of grammaticalization (increase in lexical and semantic diversity of collocates) with quantitative evidence.
    • Questions regarding changes in lexical and semantic diversity of collocates lend themselves well to computational analysis, as demonstrated by Perek (2016, 2018).
    • Resolving potentially subjective decisions in annotation (e.g., Is put to death “irrelevant” (Margerie 2011)? Is yawn neutral (Blanco Suárez 2017) or negative?).

8 of 32

Data: corpora

  • Focus: development of to death from Early Modern English into the 20th century (1550 to 1949).
  • Corpora with such a long span tend to be rare and too small to be of use for ‘data-hungry’ models.
  • Solution: ‘patchwork corpus’
    • Early English Books Online (EEBO) + the Corpus of Late Modern English Texts (version 3.1; CLMET3.1) + the Evans Early American Imprints Collection (EVANS) + Eighteenth Century Collections Online (ECCO) + the Corpus of Historical American English (COHA) + Hansard corpus (Hansard).

9 of 32

Data: corpora

  • Pre-processing:
    • Language identification module in order to sort out foreign text:
      • Google's Compact Language Identifier (v3)
      • FastText Language Identification system (Grave 2017)
    • Tokenization and sentence-tokenization of the remaining text:
      • Punkt tokenizers, NLTK package (Bird, Klein & Loper 2009)
    • part-of-speech tagging:
      • Conditional Random Field (CRF) tagger implemented by the library PIE (Manjavacas, Kádár & Kestemont 2019)
      • trained on the PCEEME (corpus of Early Modern English letters; Nevalainen et al. 2006)
      • tagger obtained 96% accuracy on a held-out dataset

10 of 32

Data: to death

  • Attestations of to death retrieved from patchwork corpus and divided into 8 bins (50-year periods)
  • Additional filtering steps:
    • no author dominated more than 25% of the instances in a particular bin;
    • max. tokens capped at 800.

 

1550

1600

1650

1700

1750

1800

1850

1900

CLMET3.1

 

 

 

39

45

182

100

26

COHA

 

 

 

 

 

488

700

774

ECCO

 

 

 

78

395

12

 

 

EEBO

800

800

794

413

 

2

 

 

EVANS

 

 

6

211

360

116

 

 

Total

800

800

800

741

800

800

800

800

Type freq

87

101

93

95

97

150

131

135

11 of 32

Data: to death

  • Main interest: semantic diversity of collocate verbs (rather than to death itself)
  • Using pos-tags to identify collocate verbs (using a window of 15 words).
  • 🚨 Manual correction 🚨
    • be + adjective or past participle (e.g. be frozen/sick to death)
    • cases where to death functioned as a prepositional modifier of a noun (e.g. on her way to death), fixed expressions (e.g. from birth to death, be nigh to death)
    • cases where the verb was illegible (e.g. And when my mother euen before my sighte, Was (-) to death; 1550, EEBO).

12 of 32

Data: to death

  • Distributed meaning representations or ‘word embeddings’ computed by word2vec algorithm (Mikolov et al. 2013) for the remaining dataset:
    • gensim library Rehurek & Sojka (2011)
    • skip-gram objective, optimized with learning rate of 0.025 over 5 epochs, frequency threshold > 50, window size = 20
  • Sanity check:
    • benchmark datasets of Present-day English word pairs (e.g. SimLex999)
    • examine nearest neighbours of a selection of verbs from our dataset based on cosine distance

13 of 32

Data: to death

physical actions

mental verbs

burn

stab

whip

amuse

scare

vex

beat (0.57)

strangle (0.59)

cudgel (0.69)

delude (0.73)

frighten (0.78)

afflict (0.72)

kill (0.57)

knife (0.59)

bludgeon (0.66)

flatter (0.63)

terrify (0.73)

perplex (0.72)

consume (0.56)

bleed (0.58)

lash (0.66)

perplex (0.61)

startle (0.67)

harass (0.71)

scorch (0.55)

slash (0.58)

kick (0.59)

terrify (0.60)

worry (0.55)

annoy (0.69)

shoot (0.55)

bang (0.56)

cuff (0.57)

frighten (0.60)

drive (0.54)

oppress (0.69)

spoil (0.53)

kill (0.55)

spur (0.57)

tickle (0.58)

sweep (0.52)

fret (0.67)

smother (0.53)

poison (0.55)

flog (0.56)

harass (0.54)

delude (0.51)

grieve (0.64)

smoke (0.53)

bite (0.55)

bang (0.55)

tire (0.54)

astonish (0.51)

terrify (0.61)

hunt (0.53)

cudgel (0.54)

goad (0.55)

annoy (0.52)

annoy (0.50)

pester (0.60)

hang (0.53)

prick (0.54)

scourge (0.54)

vex (0.51)

amuse (0.50)

worry (0.58)

14 of 32

Method: diachronic cluster analysis

Assumption: clusters in distributional space reproduce semantic fields

Approach: monitor changes in the semantic space using clustering metrics

15 of 32

Method: diachronic cluster analysis

With increasing host-class expansion, we expect

  • Higher number of allowed semantic fields
  • Higher density in the semantic fields (due to productivity of the incoming fields)

16 of 32

Method: diachronic cluster analysis

Operationalization of change in terms of silhouette

Silhouette captures density from two angles

  • Average intra-cluster distance from a point to every other point (tightness)
  • Distance to closest point in a different cluster (separation)

17 of 32

Method: diachronic cluster analysis

What about number of clusters?

  • Notoriously difficult to set manually

Strategy: hands off!

  • Monitor the optimal number of clusters based on silhouette

18 of 32

Method: diachronic cluster analysis

Mediation of type frequency

Collocate type frequency

  • 1550-1650 134
  • 1650-1750 141
  • 1750-1850 185
  • 1850-1950 205

Increased type frequency = ?

  • It is a measure of diversity
  • Due to other factors? (e.g. cultural change)

19 of 32

Method: diachronic cluster analysis

Fixing the shape of the semantic space …

  • number of clusters
  • cluster density

… the optimal number depends on the sample size

20 of 32

Method: diachronic cluster analysis

Change

Type Frequency

Number of Clusters

?

21 of 32

Method: diachronic cluster analysis

Bootstrap simulation of variation in type frequency

  • Multinomial samples of 500 token verbs from the empirical distribution

22 of 32

Method: diachronic cluster analysis

The S-curve

Modeling choice -> time as monotonic effect

  • Strictly increasing (or decreasing)
  • Allows for changes in growth rate

Compare

  • Model with time as monotonic predictor
  • Model with time as linear predictor

Taken from “Studying the History of English”

https://www.uni-due.de/SHE/index.html

23 of 32

Results: diachronic cluster analysis

24 of 32

Results: diachronic cluster analysis

25 of 32

Method: sentiment analysis

Hypothesis

  • With increasing grammaticalization, we expect increasingly positive polarity

Approach

  • Compare monotonic and linear predictors for time
  • Measure the sentiment of the newly occurring verbs in each period

26 of 32

Method: sentiment analysis

27 of 32

Results: sentiment analysis

28 of 32

Results: sentiment analysis

29 of 32

Discussion

  • The results are in line with expectations:
    • the optimal number of verb clusters increases substantially over the course of the 18th century, when the meaning of to death expanded to non-literal, intensifying uses;
    • the predicted shift away from negative polarity also appears to be captured by the statistical model, albeit weakly.
    • BUT: weak trend aligns well with strong ‘persistence’ of source semantics (e.g. Margerie 2011; Blanco-Suarez 2017).

30 of 32

Discussion: challenges

  • Patchwork corpora can be a serious problem, but, currently, there may be no better alternative
    • Artefact results are possible when data is poorly balanced (also see Hengchen et al. 2021).
    • This is also true for large library dumps!
    • Introducing balance may not be possible and may impact sample size.
  • Results could in fact be improved (data filtering) or be more transparent (bootstrapped clustering) with more manual meddling.
    • to what extent is limiting manual involvement to an absolute minimum warranted in specific case-driven studies?
  • Polysemy/ambiguity:
    • ‘I think we should move on to another topic before we beat this one to death.’
    • ‘He promised he would love me to death’

31 of 32

Future work

  • From word type embeddings to word token embeddings for historical language varieties
  • We are currently training a BERT-model on historical English & Dutch
    • ‘MacBERTh’: pre-trained on English data, date range: 1500-1950
      • Manjavacas & Fonteyn (subm.) show the pre-trained model outperforms PDE BERT (Devlin et al. 2019) and ‘TuringBERT’ (Hosseini et al. 2021) on various, ‘hands-off’ benchmark tasks
      • In the future: more realistic, semi-automated applications for historical linguistics
    • ‘Nameless Dutch model’: in preparation

32 of 32

Thank you!

Do check out all the exciting work we refer to in our reference list:

http://ceur-ws.org/Vol-2989/long_paper26.pdf