Unsupervised Learning of Morphology
MSc Defense
September 9, 2022
Müge Kural
1
Structure
2
3
Motivation
&
Background
What is unsupervised learning?
4
Training machine learning models without providing annotated examples
Then asking for the tasks that can be solved within the data:
reduce the dimension of data
reconstruct the data from low dimensional representations or,
cluster the data or,
corrupt then recover the data etc.
To learn how the model deciphers the data:
important features to reconstruct the data
which examples share the same features
how these features can be used to generate that data
What is morphology?
5
Main elements: morphemes
A morpheme is a minimal unit that contributes to the word's meaning.
e.g. three morphemes in untouchable: un-touch-able
Main rules:
do + -able + -ity = doability but not
do + -ity + -able = *doityable
e.g. vowel harmony: gel+di+ler , oku+du+lar
e.g. vowel deletion: hike+ing -> hiking
e.g. consonant assimilation: fistık+çı
Study of word structures in language
Earlier Studies
Two-level formalism (Koskenniemi, 1983)
Turkish Morphological Analyzer (Oflazer, 1993)
6
we need:
7
Turkish Morphological Analyzer (Oflazer, 1993)
8
first learn
to analyze the structure of words heard,
identify their stems and affixes,
map consistent meanings onto them,
and then begin to use those stems and affixes in new combinations.
Unsupervised morphology learners:
(2) NLP models?
How do we evaluate unsupervised morphology learner?
[Goldsmith,2017]:
9
Academic Competitions
10
Our evaluations
11
12
Models
Models
Our: Vector-Quantized Variational Autoencoders with multiple codebooks (VQVAE-MC)
13
Baselines
Vector-Quantized Variational Autoencoders with multiple codebooks
14
15
VQVAE with multiple codebooks
1
2
3
reconstruction
codebook embeddings
KL div
16
Character-Level Language Models
“how likely to see characters sequence ‘abc’ in a language?”
Each character's probability is predicted and conditioned on the preceding characters.
17
Autoencoders
18
Variational Autoencoders
19
Evaluations: Experiments & Results
20
Probing for
Morphological Features
Probing
21
Evaluation Tasks: Probing
Adding a classifier over the model's representations to determine if these representations are classified similarly to the way humans do.
Probing: AE
22
Evaluation Tasks: Probing
We probe the feature vectors, which are the reduction of encoder outputs (32-dimensional)
Probing: VAE
23
Evaluation Tasks: Probing
We probe the mean vectors, which are the reduction of encoder outputs (32-dimensional)
Probing: VQVAE
24
Evaluation Tasks: Probing
we probe
Tense probes - Dataset
25
Evaluation Tasks: Probing
3696 Turkish verbs with five tense classes: Aorist, Progressive1, Past, Future, and Narrative.
e.g. gelmiş -> Narr
Data statistics
for pretraining and probing
Tense probes - Results
26
Evaluation Tasks: Probing
Probing accuracies of models. The baseline is the percentage of majority tense (progressive1) in the dataset. While both AE and VAE representations are classified well, the VQVAE continuous latent vector is classified with a slightly better than baseline, indicating that the morphological features are primarily represented in quantized variables of codebooks as we have intended.
Tense probes - Visualization
27
Evaluation Tasks: Probing
Tense probing scores of VAE representations
Person probes - Dataset
28
Evaluation Tasks: Probing
3696 Turkish verbs with six person classes: A1sg, A2sg, A3sg, A1pl, A2pl, and A3pl.
e.g. gelmiş -> A3sg
Data statistics
for pretraining and probing
Person probes - Results
29
Evaluation Tasks: Probing
Probing accuracies of models. The baseline is the percentage of the majority person (A3sg) in the dataset. While both AE and VAE representations are classified well, the VQVAE continuous latent vector is classified with a slightly better than baseline, indicating that the morphological features are primarily represented in quantized variables of codebooks as we have intended.
Person probes - Visualization
30
Evaluation Tasks: Probing
Tense probing scores of VAE representations
Person probing scores of VQVAE. Left: Probes on quantized latent representation. Right: Probes on continuous representations. While quantized latent representations cluster well, continuous representations are more mixed
More probes for VQVAE - Results
31
Evaluation Tasks: Probing
zc(x): Continuous latent repr.
zq(x): Quantized latent repr.
More probes for VQVAE - Results
32
Evaluation Tasks: Probing
Word categorizations based on suffixes
33
Morphological Segmentation
Morphological Segmentation
Identifying word’s morphemes
singing -> sing+ing
geliyordu -> gel+iyor+du
34
Models
Generative Models: VAE and CharLM
35
Heuristic
36
e.g. word: <s>yolda</s>
<s>y</s>
<s>yo</s>
<s>yol</s>
<s>yold</s>
<s>yolda</s>
Generative models assign
low probabilities for invalid/incomplete words
high probabilities to valid/complete words
Heuristic
For a word wi : define subwords of words w0 ,w1 , …, wt
Detect a morpheme boundary IF:
p(wi-1)< p(wi ) > p(wi+1) AND length(wi) > 2
37
e.g. word: <s>yolda</s>
<s>y</s>
<s>yo</s>
<s>yol</s>
<s>yold</s>
<s>yolda</s>
38
Evaluation Tasks: Morphological Segmentation
To apply our heuristics, we need to obtain subword likelihoods from the models
VAE
CharLM
Dataset
39
Morpho-Challenge 2010 Turkish dataset
Results
40
Morphological segmentation results on the test set. *:Our models
Error Analysis
41
Under segmentation cases. Forms 39.5% of errors.
Over segmentation cases. Forms 43% of errors.
Both under+over segmentation cases. Forms 14% of errors.
Error Analysis: Causes
42
Under segmentation: consecutive valid words
also disables one-letter-morpheme identification
e.g. azaltmasI ->azal-t-ma-sI, azalt-ma-sI
Under segmentation cases. Forms 39.5% of errors.
Error Analysis: Causes
43
Over segmentation:
A valid word but not the correct boundary for the word
Over segmentation cases. Forms 39.5% of errors.
Error Analysis: Causes
44
Phonological alterations
e.g. keşfinin -> keşf-in-in, keşf-i-nin, keşfi-nin
e.g. tarağına -> tarağ-ı-na, tarağ-ın-a, tarağın-a
45
Errors among different models
46
Morphological Segmentation
Related Work
47
Morphagram
Based on Probabilistic Context- Free Grammars: G = {N, Σ, S, R, θ }
48
Pr+St+Su
Morphological Grammar of MorphAgram
49
Morphological Reinflection
Morphological Reinflection
the relation between word’s inflections
how much a model can generalize
the inflection rules of words.
50
gözlükçüsüne + pos=N,tense=PST,per=1,num=SG,evid=NFH -> gözlükçüymüşüz
Fig:The relatedness of inflected forms of Spanish verbs hablar ‘speak’ and caminar ‘walk’ [1] https://aclanthology.org/W16-2002.pdf
Model
51
K codebooks: number of morphological features,
with N entries: number of classes for the feature
e.g. A codebook for Tense feature with 5 entries (Future, Past, Progressive, Aorist, Narrative)
52
Semi-supervision for codebook selection
Evaluation Tasks: Morphological Reinflection: Model
Results
53
Evaluation Tasks: Morphological Reinflection: Model
Error Analysis
54
Evaluation Tasks: Morphological Reinflection: Model
wrong lemma+ correct suffix inflection
Error Analysis
55
Evaluation Tasks: Morphological Reinflection
correct lemma+ wrong suffix inflection
Error Analysis
56
Evaluation Tasks: Morphological Reinflection: Model
Only-supervised model has erroneous reinflections because of not applying phonological rules.
Semi-sup model with the addition of unsupervised data handles most of it.
57
Morphological Reinflection
Related Work
MSVED (Zhou, Neubig, 2017)
58
(Our) supervision: xtarget + ytags -> xtarget
59
Conclusion & Discussion
60
Contributions