1 of 60

Unsupervised Learning of Morphology

MSc Defense

September 9, 2022

Müge Kural

1

2 of 60

Structure

Motivation and Background:

What is unsupervised learning?
What is morphology?
How do we evaluate an unsupervised morphology learner?

Models

Model architectures

Evaluations

Experiments & Results
Related Works

Conclusion

Discussion & Future work

2

3 of 60

3

Motivation

&

Background

4 of 60

What is unsupervised learning?

4

Training machine learning models without providing annotated examples

Then asking for the tasks that can be solved within the data:

reduce the dimension of data

reconstruct the data from low dimensional representations or,

cluster the data or,

corrupt then recover the data etc.

To learn how the model deciphers the data:

important features to reconstruct the data

which examples share the same features

how these features can be used to generate that data

5 of 60

What is morphology?

5

Main elements: morphemes

A morpheme is a minimal unit that contributes to the word's meaning.

e.g. three morphemes in untouchable: un-touch-able

Main rules:

Morphotactics: how morphemes can touch each other

do + -able + -ity = doability but not

do + -ity + -able = *doityable

Morphophonemics: sound changes in morphemes when they combine to form words:

e.g. vowel harmony: gel+di+ler , oku+du+lar

e.g. vowel deletion: hike+ing -> hiking

e.g. consonant assimilation: fistık+çı

Study of word structures in language

6 of 60

Earlier Studies

Two-level formalism (Koskenniemi, 1983)

Turkish Morphological Analyzer (Oflazer, 1993)

6

we need:

lexicon containing the stems and affixes
morphotactics: the model of morpheme ordering
a set of rules: (ortographic etc.) the model of changes that occur in a word when two morphemes combine (e.g. git+di-> gitti)

7 of 60

7

Turkish Morphological Analyzer (Oflazer, 1993)

8 of 60

Children

8

first learn

to analyze the structure of words heard,

identify their stems and affixes,

map consistent meanings onto them,

and then begin to use those stems and affixes in new combinations.

Unsupervised morphology learners:

(2) NLP models?

9 of 60

How do we evaluate unsupervised morphology learner?

[Goldsmith,2017]:

What are the component morphemes of each word?
Are there alternative forms (allomorphs, -ler, -lar in Turkish) of any morphemes, and if so, under what conditions is each used?
Are there inflectional paradigms in the language?

9

10 of 60

Academic Competitions

MorphoChallenges [2005-2010]

(Surface-level) Morpheme Segmentation

SIGMORPHON Challanges [2016-]

Morphological (Re)-inflection
Unsupervised Paradigm Clustering
Unsupervised Paradigm Completion
(Canonical-level) Morpheme Segmentation

10

11 of 60

Our evaluations

MorphoChallenges [2005-2010]

(Surface-level) Morpheme Segmentation

SIGMORPHON Challanges [2016-]

Morphological (Re)-inflection
Unsupervised Paradigm Clustering
Unsupervised Paradigm Completion
(Canonical-level) Morpheme Segmentation

Probing for morphological features

11

12 of 60

12

Models

13 of 60

Models

Our: Vector-Quantized Variational Autoencoders with multiple codebooks (VQVAE-MC)

Character-Level Language Models
Autoencoders
Variational Autoencoders

13

Baselines

14 of 60

Vector-Quantized Variational Autoencoders with multiple codebooks

14

15 of 60

15

VQVAE with multiple codebooks

1

2

3

reconstruction

codebook embeddings

KL div

The first part of the loss is the reconstruction loss, where the model conditions both continuous latent variable $z_c$ and discrete latent variable $z_q$ to reconstruct the observed data x. The second term is the total vector quantization loss for each vector $z_e^{(n)}(x)$. As in the original VQVAE,

to learn the dictionary embeddings $e^{(n)}$, stop gradient operation (which is denoted as sg) is used. Stop-gradient makes the gradient of applied term zero in forward computation time and converts it to a non-updated constant. In minimizing $l_2$ distance, only the dictionary embeddings are updated. The same is applied to update the encoder outputs $z_e^{(n)}(x)}$ to minimize the $l_2$ which is called \textit{commitment loss} weighted by parameter $\beta$ to ensure encoder outputs do not grow faster than dictionary embeddings. Lastly, the last term regulates the continuous vector with a Gaussian distribution. When defining a uniform prior on \textit{K} categories, a constant KL divergence $\log K$ can be obtained, and since it is constant, it can be ignored.

16 of 60

16

Character-Level Language Models

“how likely to see characters sequence ‘abc’ in a language?”

Each character's probability is predicted and conditioned on the preceding characters.

17 of 60

17

Autoencoders

encoder (a parametric function q_ϕ) encodes the observed data into a low dimensional vector z,

decoder (a parametric function p_θ) reconstructs the data again by conditioning this vector.

18 of 60

18

Variational Autoencoders

encoder (a parametric function q_ϕ) encodes the observed data into a low dimensional hidden vectors µ and σ,

z = µ(x) + σ(x) * ε where ε ~ N(0, I)

decoder (a parametric function p_θ) reconstructs the data again by conditioning this vector.

19 of 60

19

Evaluations: Experiments & Results

20 of 60

20

Probing for

Morphological Features

21 of 60

Probing

21

Evaluation Tasks: Probing

Adding a classifier over the model's representations to determine if these representations are classified similarly to the way humans do.

22 of 60

Probing: AE

22

Evaluation Tasks: Probing

We probe the feature vectors, which are the reduction of encoder outputs (32-dimensional)

23 of 60

Probing: VAE

23

Evaluation Tasks: Probing

We probe the mean vectors, which are the reduction of encoder outputs (32-dimensional)

24 of 60

Probing: VQVAE

24

Evaluation Tasks: Probing

we probe

the continuous mean vectors, which are the reduction of encoder outputs (32-dimensional)

and the concatenation of codebook outputs (288 dimensional)

25 of 60

Tense probes - Dataset

25

Evaluation Tasks: Probing

3696 Turkish verbs with five tense classes: Aorist, Progressive1, Past, Future, and Narrative.

e.g. gelmiş -> Narr

Data statistics

for pretraining and probing

26 of 60

Tense probes - Results

26

Evaluation Tasks: Probing

Probing accuracies of models. The baseline is the percentage of majority tense (progressive1) in the dataset. While both AE and VAE representations are classified well, the VQVAE continuous latent vector is classified with a slightly better than baseline, indicating that the morphological features are primarily represented in quantized variables of codebooks as we have intended.

27 of 60

Tense probes - Visualization

27

Evaluation Tasks: Probing

Tense probing scores of VAE representations

28 of 60

Person probes - Dataset

28

Evaluation Tasks: Probing

3696 Turkish verbs with six person classes: A1sg, A2sg, A3sg, A1pl, A2pl, and A3pl.

e.g. gelmiş -> A3sg

Data statistics

for pretraining and probing

29 of 60

Person probes - Results

29

Evaluation Tasks: Probing

Probing accuracies of models. The baseline is the percentage of the majority person (A3sg) in the dataset. While both AE and VAE representations are classified well, the VQVAE continuous latent vector is classified with a slightly better than baseline, indicating that the morphological features are primarily represented in quantized variables of codebooks as we have intended.

30 of 60

Person probes - Visualization

30

Evaluation Tasks: Probing

Tense probing scores of VAE representations

Person probing scores of VQVAE. Left: Probes on quantized latent representation. Right: Probes on continuous representations. While quantized latent representations cluster well, continuous representations are more mixed

31 of 60

More probes for VQVAE - Results

31

Evaluation Tasks: Probing

z_c(x): Continuous latent repr.

z_q(x): Quantized latent repr.

32 of 60

More probes for VQVAE - Results

32

Evaluation Tasks: Probing

Word categorizations based on suffixes

33 of 60

33

Morphological Segmentation

34 of 60

Morphological Segmentation

Identifying word’s morphemes

singing -> sing+ing

geliyordu -> gel+iyor+du

34

35 of 60

Models

Generative Models: VAE and CharLM

35

36 of 60

Heuristic

36

e.g. word: <s>yolda</s>

<s>yolda</s>

Generative models assign

low probabilities for invalid/incomplete words

high probabilities to valid/complete words

37 of 60

Heuristic

For a word w_i : define subwords of words w_{0 ,}w_{1 ,}…, w_t

Detect a morpheme boundary IF:

p(w_i-1)< p(w_i) > p(w_i+1) AND length(w_i) > 2

37

e.g. word: <s>yolda</s>

<s>yolda</s>

38 of 60

38

Evaluation Tasks: Morphological Segmentation

To apply our heuristics, we need to obtain subword likelihoods from the models

VAE

CharLM

39 of 60

Dataset

39

Morpho-Challenge 2010 Turkish dataset

40 of 60

Results

40

Morphological segmentation results on the test set. *:Our models

41 of 60

Error Analysis

41

Under segmentation cases. Forms 39.5% of errors.

Over segmentation cases. Forms 43% of errors.

Both under+over segmentation cases. Forms 14% of errors.

42 of 60

Error Analysis: Causes

42

Under segmentation: consecutive valid words

also disables one-letter-morpheme identification

e.g. azaltmasI ->azal-t-ma-sI, azalt-ma-sI

Under segmentation cases. Forms 39.5% of errors.

43 of 60

Error Analysis: Causes

43

Over segmentation:

A valid word but not the correct boundary for the word

Over segmentation cases. Forms 39.5% of errors.

44 of 60

Error Analysis: Causes

44

Phonological alterations

e.g. keşfinin -> keşf-in-in, keşf-i-nin, keşfi-nin

e.g. tarağına -> tarağ-ı-na, tarağ-ın-a, tarağın-a

45 of 60

45

Errors among different models

46 of 60

46

Morphological Segmentation

Related Work

47 of 60

47

Morphagram

Based on Probabilistic Context- Free Grammars: G = {N, Σ, S, R, θ }

48 of 60

48

Pr+St+Su

Morphological Grammar of MorphAgram

49 of 60

49

Morphological Reinflection

50 of 60

Morphological Reinflection

the relation between word’s inflections

how much a model can generalize

the inflection rules of words.

50

gözlükçüsüne + pos=N,tense=PST,per=1,num=SG,evid=NFH -> gözlükçüymüşüz

Fig:The relatedness of inflected forms of Spanish verbs hablar ‘speak’ and caminar ‘walk’ [1] https://aclanthology.org/W16-2002.pdf

51 of 60

Model

51

K codebooks: number of morphological features,

with N entries: number of classes for the feature

e.g. A codebook for Tense feature with 5 entries (Future, Past, Progressive, Aorist, Narrative)

52 of 60

52

Semi-supervision for codebook selection

Evaluation Tasks: Morphological Reinflection: Model

53 of 60

Results

53

Evaluation Tasks: Morphological Reinflection: Model

54 of 60

Error Analysis

54

Evaluation Tasks: Morphological Reinflection: Model

wrong lemma+ correct suffix inflection

55 of 60

Error Analysis

55

Evaluation Tasks: Morphological Reinflection

correct lemma+ wrong suffix inflection

56 of 60

Error Analysis

56

Evaluation Tasks: Morphological Reinflection: Model

Only-supervised model has erroneous reinflections because of not applying phonological rules.

Semi-sup model with the addition of unsupervised data handles most of it.

57 of 60

57

Morphological Reinflection

Related Work

58 of 60

MSVED (Zhou, Neubig, 2017)

58

Gumbel-softmax classifiers
Attention mechanism over discrete features

Supervision: x_source+ y_tags -> x_target

(Our) supervision: x_target+ y_tags -> x_target

59 of 60

59

Conclusion & Discussion

60 of 60

60

Extracting morphological features of unsupervised models with linear probes

Morpheme identification/segmentation with generative unsupervised models can

A new unsupervised model with continuous & discrete latent variables separate the word as lemma + suffixes

First attempt to analyze unsupervised models for Turkish

Contributions