1 of 32

A ‘hands-off’ approach �to modelling constructional change �with word embeddings

Lauren Fonteyn & Enrique Manjavacas

‘Modelling constructional variation and change’, 15-16 November 2021, Zurich

2 of 32

Introduction

Computational (i.e. distributional) approaches to semantic change have become very popular:

see, among many others: Sagi et al. (2011); Hamilton, Leskovec & Jurafsky (2016a, 2016b); Mitra, Mitra, Riedl, Biemann, Mukherjee & Goyal (2014); Bamler & Mandt (2017); Dubossarsky, Weinshall& Grossman (2017); Rosenfeld & Erk (2018); Rudolph & Blei (2018); Kutuzov, Øvrelid, Szymanski & Velldal (2018); Hu, Li & Liang (2019); Dubossarsky, Hengchen, Tahmasebi & Schlechtweg (2019); Tahmasebi, Borin & Jatowt (2019); Del Tredici, Fernández & Boleda (2019); Schlechtweg, McGillivray, Hengchen, Dubossarsky & Tahmasebi (2020); �Giulianelli, Del Tredici & Fernández (2020); Bizzoni, Degaetano-Ortlieb, Fankhauser & Teich (2020); …

3 of 32

Introduction

Wide-scope studies:

not interested in specific word or construction, but wish to quantify some aspect or ‘law’ of semantic change at large;
tend to approach semantic change in bulk (sample sizes: hundreds to thousands of linguistic items)

Narrow-scope, case-driven studies:

aims to tackle questions about the development of specific construction (often during a specific time window);
relatively inflexible in terms of data;
specific phenomenon under scrutiny constitutes a complex challenge.

However,

It is interesting to observe that many of the studies I listed on the previous slide aren’t really looking into constructional change, but they rather wish to investigate semantic change, and the particularly lexical semantic change, at large.

As a result, they approach semantic change in bulk, considering hundreds or even thousands of lexical items in their analysis.

When they do discuss a specific example in their reports, these serve as relative straightforward illustrations of their results.

Yet, often, a linguist’s is piqued by a specific, complex phenomenon that occurs in a specific time window, and it is not unusual that the phenomenon they wish to analyse is not abundantly frequent.

In other words, the methodological procedures that work for wide-scope studies may not be suitable for the case-driven research that is traditionally more common in humanities.

4 of 32

Aims

Delineating a step-wise procedure that:

minimizes manual work (‘hands off’)

Why?

enable automated annotation/analysis for case-driven research to

maximally exploit available data
maximize the data-driven character (see Perek 2016, 2018)

What we wanted to do, then is, to set out a step-wise computational procedure for case-driven research. What we will present here today is heavily inspired by Florent Perek’s work on the hell- and way-construction from 2016 and 2018, but we’ve also expanded on the method he proposes by trying to minimize manual work.

Why?

Now obviously, for wide-scope studies the desire to automate analysis obviously stems from practical issues, as they work with overwhelmingly large datasets.

For case-driven research, the effective amount of data one has to filter through may also be overwhelming, and doing this all manually may keep researchers from maximally exploiting the data that is available to them.

Second, in both wide- and narrow-scoped studies, automated annotation and analysis should help boost the data-driven character of the study, which is something we hold very dear

5 of 32

Outline

Case study: the grammaticalization of to death
Data & Distributed meaning representations (Word2Vec)
Method:

diachronic cluster analysis
sentiment analysis

Results
Discussion
Future work: MacBERTh

6 of 32

Case study: grammaticalization of to death ☠️

😵 she was by the godesse wounded to death. (EEBO, 1641)

🙁 That look of yours frightens me to death. (CLMET3.1, 1750)

🙂 I have a new toy and I’m tickled to death. (Gutenberg, 1913)

decrease in compositionality: literal death > endpoint, extreme

negative meaning of source persists

host-class expansion (increase in productivity/schematicity):

diversification of collocate verbs (lexical & semantic)

Grammaticalization of phrasal expression to death (Hoeksema & Jo Napoli 2008; Claridge 2011; Margerie 2011; Blanco Suárez 2017):

So when we decided we wanted to set out a hands-off, embedding based methodology for case studies of contructional change, our eye fell on the grammaticalization of to death. Now we know from various previous analyses that to death developed from a literal, resultative phrase, into an intensifying expression.
So, first, to death was a literal resultative phrase. She was wounded to death means that the subject was wounded, which then resulted in her death.
Subsequently, the phrase became ‘compositional’ when it started to mean ‘endpoint or extreme’. This is a metaphorical, hyperbolic use of the phrase, because, no matter how bad being frightened may feel, that feeling itself usually does not directly result in your death in the same was as being stabbed or shot does. And we start seeing uses like these more frequently in the Early Modern English period.
Importantly, the negative meaning of the source expression persists here: the verbs to death collocates with are negative, like frighten, scare, bore and so on.
Finally, the collocate verbs of to death started expanding further in the Late Modern period. Now, to death combines with more positive verbs, like ‘please’ or ‘tickle’.

7 of 32

Case study: why?

Realistic case study for which clear expectations have been formulated
Interesting challenge:

Low-frequency phenomenon: prior work indicated that it is difficult to support the development as an instance of grammaticalization (increase in lexical and semantic diversity of collocates) with quantitative evidence.
Questions regarding changes in lexical and semantic diversity of collocates lend themselves well to computational analysis, as demonstrated by Perek (2016, 2018).
Resolving potentially subjective decisions in annotation (e.g., Is put to death “irrelevant” (Margerie 2011)? Is yawn neutral (Blanco Suárez 2017) or negative?).

8 of 32

Data: corpora

Focus: development of to death from Early Modern English into the 20^th century (1550 to 1949).
Corpora with such a long span tend to be rare and too small to be of use for ‘data-hungry’ models.
Solution: ‘patchwork corpus’

Early English Books Online (EEBO) + the Corpus of Late Modern English Texts (version 3.1; CLMET3.1) + the Evans Early American Imprints Collection (EVANS) + Eighteenth Century Collections Online (ECCO) + the Corpus of Historical American English (COHA) + Hansard corpus (Hansard).

9 of 32

Data: corpora

Pre-processing:

Language identification module in order to sort out foreign text:

Google's Compact Language Identifier (v3)
FastText Language Identification system (Grave 2017)

Tokenization and sentence-tokenization of the remaining text:

Punkt tokenizers, NLTK package (Bird, Klein & Loper 2009)

part-of-speech tagging:

Conditional Random Field (CRF) tagger implemented by the library PIE (Manjavacas, Kádár & Kestemont 2019)
trained on the PCEEME (corpus of Early Modern English letters; Nevalainen et al. 2006)
tagger obtained 96% accuracy on a held-out dataset

10 of 32

Data: to death

Attestations of to death retrieved from patchwork corpus and divided into 8 bins (50-year periods)
Additional filtering steps:

no author dominated more than 25% of the instances in a particular bin;
max. tokens capped at 800.

	1550	1600	1650	1700	1750	1800	1850	1900
CLMET3.1				39	45	182	100	26
COHA						488	700	774
ECCO				78	395	12
EEBO	800	800	794	413		2
EVANS			6	211	360	116
Total	800	800	800	741	800	800	800	800
Type freq	87	101	93	95	97	150	131	135

11 of 32

Data: to death

Main interest: semantic diversity of collocate verbs (rather than to death itself)
Using pos-tags to identify collocate verbs (using a window of 15 words).
🚨 Manual correction 🚨

be + adjective or past participle (e.g. be frozen/sick to death)
cases where to death functioned as a prepositional modifier of a noun (e.g. on her way to death), fixed expressions (e.g. from birth to death, be nigh to death)
cases where the verb was illegible (e.g. And when my mother euen before my sighte, Was (-) to death; 1550, EEBO).

12 of 32

Data: to death

Distributed meaning representations or ‘word embeddings’ computed by word2vec algorithm (Mikolov et al. 2013) for the remaining dataset:

gensim library Rehurek & Sojka (2011)
skip-gram objective, optimized with learning rate of 0.025 over 5 epochs, frequency threshold > 50, window size = 20

Sanity check:

benchmark datasets of Present-day English word pairs (e.g. SimLex999)
examine nearest neighbours of a selection of verbs from our dataset based on cosine distance

13 of 32

Data: to death

physical actions			mental verbs
burn	stab	whip	amuse	scare	vex
beat (0.57)	strangle (0.59)	cudgel (0.69)	delude (0.73)	frighten (0.78)	afflict (0.72)
kill (0.57)	knife (0.59)	bludgeon (0.66)	flatter (0.63)	terrify (0.73)	perplex (0.72)
consume (0.56)	bleed (0.58)	lash (0.66)	perplex (0.61)	startle (0.67)	harass (0.71)
scorch (0.55)	slash (0.58)	kick (0.59)	terrify (0.60)	worry (0.55)	annoy (0.69)
shoot (0.55)	bang (0.56)	cuff (0.57)	frighten (0.60)	drive (0.54)	oppress (0.69)
spoil (0.53)	kill (0.55)	spur (0.57)	tickle (0.58)	sweep (0.52)	fret (0.67)
smother (0.53)	poison (0.55)	flog (0.56)	harass (0.54)	delude (0.51)	grieve (0.64)
smoke (0.53)	bite (0.55)	bang (0.55)	tire (0.54)	astonish (0.51)	terrify (0.61)
hunt (0.53)	cudgel (0.54)	goad (0.55)	annoy (0.52)	annoy (0.50)	pester (0.60)
hang (0.53)	prick (0.54)	scourge (0.54)	vex (0.51)	amuse (0.50)	worry (0.58)

14 of 32

Method: diachronic cluster analysis

Assumption: clusters in distributional space reproduce semantic fields

Approach: monitor changes in the semantic space using clustering metrics

15 of 32

Method: diachronic cluster analysis

With increasing host-class expansion, we expect

Higher number of allowed semantic fields
Higher density in the semantic fields (due to productivity of the incoming fields)

16 of 32

Method: diachronic cluster analysis

Operationalization of change in terms of silhouette

Silhouette captures density from two angles

Average intra-cluster distance from a point to every other point (tightness)
Distance to closest point in a different cluster (separation)

17 of 32

Method: diachronic cluster analysis

What about number of clusters?

Notoriously difficult to set manually

Strategy: hands off!

Monitor the optimal number of clusters based on silhouette

18 of 32

Method: diachronic cluster analysis

Mediation of type frequency

Collocate type frequency

1550-1650 134
1650-1750 141
1750-1850 185
1850-1950 205

Increased type frequency = ?

It is a measure of diversity
Due to other factors? (e.g. cultural change)

19 of 32

Method: diachronic cluster analysis

Fixing the shape of the semantic space …

number of clusters
cluster density

… the optimal number depends on the sample size

20 of 32

Method: diachronic cluster analysis

Change

Type Frequency

Number of Clusters

?

21 of 32

Method: diachronic cluster analysis

Bootstrap simulation of variation in type frequency

Multinomial samples of 500 token verbs from the empirical distribution

22 of 32

Method: diachronic cluster analysis

The S-curve

Modeling choice -> time as monotonic effect

Strictly increasing (or decreasing)
Allows for changes in growth rate

Compare

Model with time as monotonic predictor
Model with time as linear predictor

Taken from “Studying the History of English”

https://www.uni-due.de/SHE/index.html

23 of 32

Results: diachronic cluster analysis

Summarising, we have 4 models. These vary depending on whether we control for sample size (shown by the +S on the left column), and whether we use a monotonic effect for time or a linear.

Mono P + S means we have a monotonic effect for time period, and we also control for sample size, etc.

We do model comparison using an information criteria known as WAIC. Without going into the details right now, what is important is that the lower this quantity the better and also that we can use WAIC in order to compute a measure of relative importance of each model.
This relative importance is shown in the right and shows clearly that both controlling for size and using a monotonic effect are important.
Finally, we can restrict ourselves to this model, and inspect what this model is actually describing the diachronic process

24 of 32

Results: diachronic cluster analysis

25 of 32

Method: sentiment analysis

Hypothesis

With increasing grammaticalization, we expect increasingly positive polarity

Approach

Compare monotonic and linear predictors for time
Measure the sentiment of the newly occurring verbs in each period

26 of 32

Method: sentiment analysis

27 of 32

Results: sentiment analysis

28 of 32

Results: sentiment analysis

29 of 32

Discussion

The results are in line with expectations:

the optimal number of verb clusters increases substantially over the course of the 18th century, when the meaning of to death expanded to non-literal, intensifying uses;
the predicted shift away from negative polarity also appears to be captured by the statistical model, albeit weakly.
BUT: weak trend aligns well with strong ‘persistence’ of source semantics (e.g. Margerie 2011; Blanco-Suarez 2017).

30 of 32

Discussion: challenges

Patchwork corpora can be a serious problem, but, currently, there may be no better alternative

Artefact results are possible when data is poorly balanced (also see Hengchen et al. 2021).
This is also true for large library dumps!
Introducing balance may not be possible and may impact sample size.

Results could in fact be improved (data filtering) or be more transparent (bootstrapped clustering) with more manual meddling.

to what extent is limiting manual involvement to an absolute minimum warranted in specific case-driven studies?

Polysemy/ambiguity:

‘I think we should move on to another topic before we beat this one to death.’
‘He promised he would love me to death’

First, we found that patchwork corpora can be a serious problem. Yet, currently, there may be no better alternative.
In a patchwork corpus, text types are likely not balanced across time bins, which can introduce artefact results.
But this issue of balance is also present for the large library dumps computational studies enjoy using so much.
It seems wise to curate the data carefully, then, to ensure this balance, but this may seriously affect the number of datapoints you end up having.
Second, we wanted to keep our procedure hand-off, but manual meddling was not entirely absent.
And in fact, we think the results could have been improved if we did more manual filtering, and the analysis could be more transparent without the bootstrapped clustering (though it would become less robust).
Confronted with that, the question arises to what extent limiting manual involvement to an absolute minimum is warranted, and, ultimately, we would like to keep probing into the question of what kind of manual work remains crucial in in case-driven humanities research.

31 of 32

Future work

From word type embeddings to word token embeddings for historical language varieties
We are currently training a BERT-model on historical English & Dutch

‘MacBERTh’: pre-trained on English data, date range: 1500-1950

Manjavacas & Fonteyn (subm.) show the pre-trained model outperforms PDE BERT (Devlin et al. 2019) and ‘TuringBERT’ (Hosseini et al. 2021) on various, ‘hands-off’ benchmark tasks
In the future: more realistic, semi-automated applications for historical linguistics

‘Nameless Dutch model’: in preparation

32 of 32

Thank you!

Do check out all the exciting work we refer to in our reference list:

http://ceur-ws.org/Vol-2989/long_paper26.pdf