Unsupervised Cross-lingual Representation Learning
July 28, 2019
ACL 2019
Sebastian Ruder
Ivan
Vulić
Anders
Søgaard
Unsupervised Cross-lingual Representation Learning
Follow along with the tutorial:
Questions:
Why Cross-Lingual NLP?
Speaking more languages means communicating with more people…
...and reaching more users and customers...
Why Cross-Lingual NLP?
...but there are more profound and democratic reasons to work in this area:
“95% of all languages in use today will never gain traction online” (Andras Kornai)
“The limits of my language online mean the limits of my world?”
Why Cross-Lingual NLP?
Inequality of information and representation can also affect how we understand places, events, processes...
We’re in Zagreb searching for...
...jatetxe (EU)
...restaurants (EN)
...éttermek (HU)
Motivation: Cross-Lingual Representations are Everywhere
Motivation: Cross-Lingual Representations are Everywhere
Searching for “multilingual”, “cross-lingual” and “bilingual” in the ACL anthology (ACL+EMNLP)
Motivation (Very High-Level)
We want to understand and model the meaning of...
...without manual/human input and without perfect MT
Source: dreamstime.com
The World Existed B.E. (Before Embeddings)
The World Existed B.E.
B.E. Example 1: Cross-lingual (parser) transfer
The World Existed B.E.
B.E. Example 2a: Traditional “count-based” cross-lingual vector vector spaces…
[Gaussier et al., ACL 2004; Laroche and Langlais, COLING 2010]
The World Existed B.E.
B.E. Example 2a: Traditional “count-based” cross-lingual vector vector spaces…
[Gaussier et al., ACL 2004; Laroche and Langlais, COLING 2010]
The World Existed B.E.
B.E. Example 2b: …and bootstrapping from limited bilingual signal (a sort of self-learning)
[Peirsman and Padó, NAACL-10; Vulić and Moens, EMNLP-13]
The World Existed B.E.
B.E. Example 3: Cross-lingual latent topic spaces
[Mimno et al., EMNLP-09; Vulić et al., ACL-11]
So, Why (Unsupervised) Cross-Lingual Embeddings Exactly?
Cross-lingual word embeddings (CLWE-s)
Unsupervised CLWE-s:
Potential for transforming cross-lingual and cross-domain NLP
Unsupervised Cross-Lingual Representations
Conneau et al. (2018)
best supervised
best unsupervised
Unsupervised Cross-Lingual Representations
[Conneau et al.; ICLR 2018]: “Without using any character information, our model even outperforms existing supervised methods on cross-lingual tasks for some language pairs”
[Artetxe et al.; ACL 2018]: “Our method succeeds in all tested scenarios and obtains the best published results in standard datasets, even surpassing previous supervised systems”
[Hoshen and Wolf; EMNLP 2018]: “...our method achieves better performance than recent state-of-the-art deep adversarial approaches and is competitive with the supervised baseline”
[Xu et al.; EMNLP 2018]: “Our evaluation (...) shows stronger or competitive performance of the
proposed method compared to other state-of-the-art supervised and unsupervised methods...”
[Chen and Cardie; EMNLP 2018]: “In addition, our model even beats supervised approaches trained with cross-lingual resources.”
So, Why (Unsupervised) Cross-Lingual Embeddings Exactly?
How well do unsupervised cross-lingual representations actually perform?
Tutorial Goals
unsupervised and weakly supervised methods
Agenda
What this is not: Comprehensive. While we’ll try to tell a compelling and coherent story from multiple angles, it is impossible to cover all related papers in one tutorial.
(If we haven’t mentioned your paper, don’t be too mad at us…)
Agenda
[6] Unsupervised deep� models
Motivation: Crossing the Lexical Chasm
“The(ir) model, however, is only applicable to English, as large enough training sets do not exist for other languages…”
Old paradigm:
New paradigm:
semantic vectors (embeddings)
Multilingual / cross-lingual
representation learning
Motivation: Crossing the Lexical Chasm
Multilingual / Cross-lingual
representation of meaning
[Conneau and Lample, arXiv-19]
Recently: Cross-Lingual Language Modeling Pretraining
Image from: [Lample and Conneau, arXiv-19]
Cross-lingual word embeddings obtained via
the lookup table of a cross-lingual LM
[Artetxe et al., ACL-19]: use
UNMT and CLWE induction
in an alternate-and-iterate fashion
Very recent sub-area
of research…
(see also [Pires et al., ACL-19])
Cross-Lingual Word Embeddings
Why Cross-Lingual Word Representations?
Capturing meaning across languages: a standard task of bilingual lexicon induction (BLI)
Retrieving nearest neighbours from a shared cross-lingual embedding space (P@1, MRR, MAP)
Similarity-Based vs. Classification-Based BLI
Two modes of how to use CLWE-s: similarity-based versus feature-based.
Classification-based BLI (combining heterogeneous features)
[Irvine and Callison-Burch, NAACL-13; Heyman et al., EACL-17]
Combining character-level and word-level information with a classifier
More Applications: Cross-Lingual IR and QA
[Vulić and Moens, SIGIR-15; Mitra and Craswell arXiv-17; Litschko et al., SIGIR-18, SIGIR-19]
More Applications: Cross-Lingual NLU for Task-Oriented Dialog
NLU architectures such as NBT [Mrkšić et al., ACL-17; Ramadan et al., ACL-18], GLAD [Zhong et al., ACL-18] or StateNet [Ren et al., EMNLP-18] rely on word embeddings…
More Applications: Cross-Lingual NLU for Task-Oriented Dialog
Why (Unsupervised) Cross-Lingual Word Representations?
[Artetxe et al., ICLR-18, EMNLP-18, ACL-19; Lample et al., ICLR-18, EMNLP-18; Wu et al., NAACL-19;...]
Key component: initialization via unsupervised cross-lingual word embeddings
Image from [Lample et al., ICLR-18]
Unsupervised MT
[Artetxe et al., ICLR-18, EMNLP-18, ACL-19; Lample et al., ICLR-18, EMNLP-18; Wu et al., NAACL-19;...]
Key component: initialization via unsupervised cross-lingual word embeddings
Image from [Artetxe et al., ICLR-18]
Unsupervised MT
Image from [Artetxe et al., ICLR-18]
Training regime in a nutshell:
reconstruct the input in the same language (E+L2)
WMT now has the track on unsupervised MT!
Unsupervised MT: Further Improvements
Image and algorithm from [Lample et al., EMNLP-18], a similar idea in [Artetxe et al., EMNLP-18]
NMT:
Back-translation
PBSMT:
Why (Unsupervised) Cross-Lingual Word Representations?
[Artetxe et al., ACL-19]
Two (and a half) open questions:
Why (Unsupervised) Cross-Lingual Word Representations?
[Ponti et al., EMNLP-18; Glavaš and Vulić, NAACL-18, ACL-18; Jebbara and Cimiano, NAACL-19;...]
Means of transfer: unsupervised cross-lingual word embeddings
Cross-Lingual Word Embeddings
Cross-Lingual Word Embeddings
A large number of different methods, but the same end goal:
Induce a shared semantic vector space in which words with similar meaning end up with similar vectors, regardless of their actual language.
We need some bilingual supervision to
learn CLWE-s.
Fully unsupervised CLWE-s: they rely only on monolingual data
Cross-Lingual Word Embeddings
A large number of different methods, but the same end goal:
Induce a shared semantic vector space in which words with similar meaning end up with similar vectors, regardless of their actual language.
Typology of methods for inducing CLWE-s [Ruder et al., JAIR-19; Søgaard et al., M&C Book]
General (Simplified) CLWE Methodology
Image adapted from [Gouws et al., ICML-15]
Joint CWLE Models (selection)
[Ammar et al., arXiv-15; Luong et al., NAACL-15; Guo et al., ACL-15; Shi et al., ACL-15]
[Gouws and Søgaard, NAACL-15; Duong et al., EMNLP-16; Adams et al., EACL-17]
[Vulić and Moens, ACL-15, JAIR-16; Søgaard et al., ACL-15]
A Commercial Break
Book published in June 2019
It covers a lot of material we just skipped…
(An older overview is also in the EMNLP-17 tutorial)
2. Main Framework
Projection-based CLWE Learning
Post-hoc alignment of independently trained monolingual distributional word vectors
Alignment based on word translation pairs (dictionary D)
[Glavaš et al.; ACL-19]
Projection-based CLWE Learning
Most models learn a single projection matrix WL1 (i..e., WL2 = I), but bidirectional learning is also common.
How do we find the “optimal” projection matrix WL1?
...except…
Minimising Euclidean Distance
[Mikolov et al., arXiv-13] minimize the Euclidean distances for translation
pairs after projection
The optimisation problem has no closed-form solution
More complex mappings: e.g., non-linear DFFNs instead of linear projection matrix yield worse performance
Better (word translation) results when WL1 is constrained to be orthogonal
Minimising Euclidean Distance or Cosine?
[Xing et al., NAACL-15]: there is a mismatch between the initial objective function, the distance measure, and the transformation objective
Solution: 1. Normalization of word vectors to unit length + 2. Replacing mean-square error with cosine similarity for learning the mapping
Orthogonal Mapping: Solving the Procrustes Problem
If W is orthogonal, the optimisation problem is the so-called Procrustes problem:
Important! Almost all projection-based CLWE methods, supervised and unsupervised alike, solve the Procrustes problem in the final step or during self-learning...
What is Hubness?
Hubness: the tendency of some vectors (i.e., “hubs”) to appear in the ranked lists of nearest neighbours of many other vectors in high-dimensional spaces
[Radovanović et al., JMLR-10; Dinu et al., ICLR-15]
Solutions:
[Dinu et al., ICLR-15; Smith et al., ICLR-17]
[Lazaridou et al., ACL-15]
[Conneau et al., ICLR-18]
Relaxed Cross-Domain Similarity Local Scaling (RCSLS)
If our goal is to optimise for the word translation performance…
Improved word translation retrieval if the retrieval procedure is corrected for hubness [Radovanović et al., JMLR-10; Dinu et al., ICLR-15]
Cross-domain similarity local scaling (CSLS) proposed by [Conneau et al., ICLR-18]
[Joulin et al., EMNLP-18] maximise CSLS (after projection) instead of minimising Euclidean distance; they relax the orthogonality constraint: Relaxed CSLS
How Important is (the Noise in the) the Seed Lexicon?
Image adapted from [Lubin et al., NAACL-19]
Orthogonal Procrustes is less affected by noise (i.e., incorrect translation pairs in the training set)
Open question: how to reduce the noise in seed lexicons?
Are larger seed lexicons necessarily better seed lexicons?
How Important is (the Size of) the Seed Lexicon?
Image adapted from [Vulić and Korhonen, ACL-16]
Performance saturates or even drops when adding lower-frequency words into dictionaries.
Better results with fewer translation pairs (but less noisier pairs): Symmetric/mutual nearest neighbours
Identical words can also be useful, but they are heavily dependent on language proximity and writing scripts.
Can we reduce the requirements further?
Reducing the Cross-Lingual Signal Requirements Further
Image from [Artetxe et al., ACL-17]
Other sources of supervision:
[Vulić and Moens, EMNLP-13; Vulić and Korhonen, ACL-16; Zhang et al., NAACL-17; Artetxe et al., ACL-17, Smith et al., ICLR-17]
Reducing the dictionaries only to 25-40 examples.
The crucial idea with limited supervision is bootstrapping or self-learning
The final frontier… Unsupervised methods...
3. (Basic) Self-Learning
Self-Learning in a Nutshell
Image courtesy of Eneko Agirre
In theory, this is the
only true/core “unsupervised” part
Self-Learning in a Nutshell
Image courtesy of Eneko Agirre
In theory, this is the
only true/core “unsupervised” part
Bootstrapping with Orthogonal Procrustes
[Artetxe et al., ACL-17, Glavaš et al., ACL-19]
first iteration (only lower-frequency
words and noise)
A General Framework (with all the “tricks of the trade”…)
A General Framework (with all the “tricks of the trade”…)
Typical focus: Seed dictionary induction and learning projections.
The only difference between weakly supervised and fully unsupervised methods is
than in C1.
A General Framework (with all the “tricks of the trade”…)
C1. Seed Lexicon Extraction
C2. Self-Learning Procedure
C3. Preprocessing and Postprocessing Steps [Artetxe et al., AAAI-18]
C1. Seed Lexicon Extraction and C2. Self-Learning
The same general framework for all unsupervised CLWE methods:
Repeat:
2. Learn the projection W(k) using D(k)
3. Induce a new dictionary D(k+1) from XW(k) U Y
C1. Seed Lexicon Extraction and C2. Self-Learning
The same general framework for all unsupervised CLWE methods.
Different approaches to C1 (i.e., to obtain D(1)), e.g.,:
All solutions assume approximate isomorphism
of monolingual embedding spaces
(Approximate) Isomorphism
“. . . we hypothesize that, if languages are used to convey thematically similar information in similar contexts, these random processes should be approximately isomorphic between languages, and that this isomorphism can be learned from the statistics of the realizations of these processes, the monolingual corpora, in principle without any form of explicit alignment.”
[Miceli Barone, RepL4NLP-16]
Fully unsupervised CLWE methods rely on
this assumption twice
Is this problematic?
(...to be continued...)
[Mikolov et al., arXiv-13]
C3. Preprocessing and Postprocessing Steps
A general projection-based supervised framework of [Artetxe et al., AAAI-18]
C3. Preprocessing and Postprocessing Steps and Other Choices
A (most robust) unsupervised framework of [Artetxe et al., ACL-18]
Improving C3: Preprocessing Steps and Other “Tricks”
Another look into the future [Zhang et al., ACL-19]
Based on [Artetxe et al., ACL-18]
Iterative Normalization guarantees:
Preprocessing which makes two
monolingual embedding spaces
more isomorphic
Better results with orthogonal
projections especially for
distant language pairs.
From Bilingual to Multilingual Word Embeddings
N languages: N-1 seed lexicons needed...
N languages: (N-1)*N/2 seed lexicons needed...
A Quick Recap
Unsupervised CLWE-s are Projection-Based Methods (with some fancy improvements over the basic framework)
C1. Seed Lexicon Extraction
C2. Self-Learning Procedure
C3. Preprocessing and Postprocessing Steps
This comes next...
4. Unsupervised Seed Induction
Overview
Authors | Seed dictionary induction |
GAN | |
Wasserstein GAN / Optimal transport | |
Heuristic | |
Point Cloud Matching |
|
Adversarial |
Non-�adversarial |
Additional References (with Similar High-Level Ideas)
GAN
Wasserstein GAN / optimal transport
Heuristic
One promising direction for future research?
Overview
Authors | Seed dictionary induction |
GAN | |
Wasserstein GAN / Optimal transport | |
Heuristic | |
Point Cloud Matching |
|
Adversarial |
Non-�adversarial |
Adversarial mapping: Main idea
L1 word embedding
Image credit: Luis Prado, Jojo Ticon, Graphic Tigers
Generator
“true” L2 embedding
“fake” L2 embedding
Discriminator
Adversarial mapping: Main idea
L1 word embedding
Image credit: Luis Prado, Jojo Ticon, Graphic Tigers
Generator
“true” L2 embedding
“fake” L2 embedding
Discriminator
Adversarial mapping: Main idea
L1 word embedding
Generator
“true” L2 embedding
“fake” L2 embedding
Discriminator
Why should this work at all?
Main assumption:�Embedding spaces in different languages have similar structure so that the same transformation can align source language words with target language words.
More specifically:�Embeddings spaces should be approximately isomorphic.
Supervised alignment
cat
horse
dog
house
table
tree
street
city
river
caballo
perro
gato
mesa
casa
árbol
río
ciudad
calle
Inspired by: Slides from Eneko Agirre, Mikel Artetxe
Supervised alignment
cat
horse
dog
perro
caballo
gato
árbol
house
mesa
table
tree
street
calle
ciudad
city
river
río
caballo
perro
gato
casa
mesa
casa
árbol
río
ciudad
calle
Inspired by: Slides from Eneko Agirre, Mikel Artetxe
Unsupervised alignment
Inspired by: Slides from Eneko Agirre, Mikel Artetxe
Unsupervised alignment
Inspired by: Slides from Eneko Agirre, Mikel Artetxe
Unsupervised alignment
Inspired by: Slides from Eneko Agirre, Mikel Artetxe
Unsupervised alignment
Inspired by: Slides from Eneko Agirre, Mikel Artetxe
Unsupervised alignment
Inspired by: Slides from Eneko Agirre, Mikel Artetxe
Unsupervised alignment
Inspired by: Slides from Eneko Agirre, Mikel Artetxe
Unsupervised alignment
Inspired by: Slides from Eneko Agirre, Mikel Artetxe
Early adversarial approaches
Training GANs is often unstable, gets stuck in poor minima.
Early approaches were not as successful…
For more on GANs: NAACL 2019 Deep Adversarial Learning tutorial
Conneau et al. (2018)
Conneau et al. (2018)
Adversarial mapping in detail
For every input sample, generator and discriminator are trained successively with SGD to minimize their losses.
Maximize probability of predicting correct source
Maximize probability of “fooling” discriminator
an embedding in
an embedding in
Overview
Authors | Seed dictionary induction |
GAN | |
Wasserstein GAN / Optimal transport | |
Heuristic | |
Point Cloud Matching |
|
Adversarial |
Non-�adversarial |
Wasserstein GAN / Optimal Transport
Wasserstein GAN
Discriminator: assign higher scores to “true” than to “fake” samples
Generator: “fool” discriminator to predict high scores for “fake” samples
Wasserstein GAN (WGAN; Arjovsky et al., 2017) has been shown to be more stable than the original GAN
L1 word embedding
Generator
“true” L2 embedding
“fake” L2 embedding
Discriminator
Wasserstein GAN
Wasserstein GAN (WGAN; Arjovsky et al., 2017) implicitly minimizes Wasserstein distance
→ What is the Earth Mover’s Distance?
Earth Mover’s Distance (EMD)
Earth Mover’s Distance (EMD)
probability of i-th word, assumed to be uniform
Dirac delta function: unit point mass at point
cost: distance between and
transport polytope:
transport matrix: indicates assignment,�i.e. how much probability mass is transported�from to
�
dog
cat
horse
street
pero
gato
caballo
calle
1/4
1/4
d
d
Learning cross-lingual embeddings with EMD
Need to learn alignment �together with .
pero
gato
caballo
calle
1/4
d
dog
cat
horse
street
1/4
d
Learning cross-lingual embeddings with EMD
In linear optimization, solutions to the OT problem are always on a vertex of �i.e. is a sparse matrix with only (Brualdi, 2006; §8.1.3).
→ Good fit for matching words across languages.
→ Probability mass preservation: We want to translate all words.
EMD and Wasserstein distance
set of all joint distributions
infimum: greatest lower bound
Wasserstein GAN
Objective to minimize: Wasserstein distance between transformed source and target embeddings:
distributions of transformed source and target word embeddings
supremum:�least upper bound
can replace these with neural network discriminator with weight clipping
If cost c is Euclidean distance, can be written as (Kantorovich-Rubinstein duality, Villani, 2006):
Wasserstein GAN
Discriminator: assign higher scores to “true” than to “fake” samples
Generator: minimize approximate Wasserstein distance
How to track performance
How to track performance
Wasserstein GAN / Optimal Transport
Break
Overview
Authors | Seed dictionary induction |
GAN | |
Wasserstein GAN / Optimal transport | |
Heuristic | |
Point Cloud Matching |
|
Adversarial |
Non-�adversarial |
Wasserstein GAN / Optimal Transport
EMD with orthogonality constraint
In order to incorporate the orthogonality constraint, we need to alternate optimizing the transport matrix and the translation matrix at every step :
keep this fixed
keep this fixed
Alternating optimization of T and W
Alternating optimization of T and W
Alternating optimization of T and W
Alternating optimization of T and W
Alternating optimization of T and W
Alternating optimization of T and W
EMD with orthogonality constraint
In order to incorporate the orthogonality constraint, we need to alternate optimizing the transport matrix T and the translation matrix W at every step k:
can be optimized with approximate optimal transport solver (Cuturi, 2013)
if c is squared L2 distance, this is just Procrustes problem
Solving the optimal transport (OT) problem
Solving the optimal transport (OT) problem
Solving the optimal transport (OT) problem
Two reasons for adding an entropy constraint (Cuturi, 2013):
Solving the optimal transport (OT) problem
Entropy constraint
Lagrange multiplier
Solution has the form (Cuturi, 2013):��where and are non-negative�vectors and .
Solving the optimal transport (OT) problem
Entropy constraint
Known as Sinkhorn distance (“dual-Sinkhorn divergence” specifically)
Lagrange multiplier
Can be efficiently computed with using only matrix-vector multiplication.�→ can back-propagate through it
Solving the optimal transport (OT) problem
Solution has the form (Cuturi, 2013):��where and are non-negative�vectors and .
Cosine distance is not a valid metric (does not satisfy triangle inequality).
→ Xu et al. (2018) use square root cosine distance instead.
Solving the optimal transport (OT) problem
Distance should be a valid metric, e.g. Euclidean distance
Gromov Wasserstein distance
Main idea: Compare metric spaces directly (instead of just comparing samples) by comparing distances between pairs of points (distance between distances).
measures cost of matching to and to
Gromov Wasserstein distance
Main idea: Compare metric spaces directly (instead of just comparing samples) by comparing distances between pairs of points (distance between distances).
Can be optimized efficiently with first-order methods (Peyré et al., 2016)
requires operating over a fourth-order tensor!
Gromov Wasserstein distance
Compute a pseudo-cost matrix:
intra-language similarities� and
cross-language similarity
Solve optimal transport problem using as cost matrix:
Repeat this process until convergence.
Finding a good initialisation
Problem is non-convex. Without a good initialisation, can be easy to get stuck in poor local optima!
Finding a good initialisation
In practice, first use WGAN to find a good initialisation and then solve optimal transport problem with orthogonality constraint.
Bidirectionality
Reconstruction
Reconstruction loss�Minimise cosine distance between original and twice projected embeddings:�
Cosine�distance
Cosine�distance
Overview
Authors | Seed dictionary induction |
GAN | |
Wasserstein GAN / Optimal transport | |
Heuristic | |
Point Cloud Matching |
|
Adversarial |
Non-�adversarial |
Heuristic seed induction
We have seen heuristics for specifying a seed dictionary (“same spelling”, “numerals”) but these make strong assumptions about the writing system.
Can we come up with a heuristic that is independent of the writing system?
Heuristic seed induction
Main idea: Translations are similar to other words in the same way across languages.
Specifically: Words with similar meaning have similar monolingual similarity distributions (i.e. distributions of similarity across all words of the same language).
for x in vocab:� sim(x, “two”)
Similar to going from�“distance” to “distance between distances”, we now go from�“similarity” to “similarity between similarities”…
intra-language similarity
Gromov-Wasserstein distance (Alvarez-Melis & Jaakkola, 2018) incorporates this, too
Heuristic seed induction
Main idea: Translations are similar to other words in the same way across languages.
Specifically: Words with similar meaning have similar monolingual similarity distributions (i.e. distributions of similarity across all words of the same language).
Heuristic seed induction
Heuristic seed induction
Overview
Authors | Seed dictionary induction |
GAN | |
Wasserstein GAN / Optimal transport | |
Heuristic | |
Point Cloud Matching |
|
Adversarial |
Non-�adversarial |
Word embedding spaces as point clouds
Point cloud matching
Approximate distribution alignment via PCA
Iterative closest point (ICP)
ICP is very similar to our self-learning loop:
Hoshen & Wolf (2018) do this in mini-batches, add bidirectionality + reconstruction.
Mini-Batch Cycle Iterative Closest Point (MBC-ICP)
Do this for both languages.
Mini-Batch Cycle Iterative Closest Point (MBC-ICP)
Mini-Batch Cycle Iterative Closest Point (MBC-ICP)
Initialization is important!
In practice, MBC-ICP is run three times:
Overview
Authors | Seed dictionary induction |
GAN | |
Wasserstein GAN / Optimal transport | |
Heuristic | |
Point Cloud Matching |
Take-aways
5. Systematic Comparisons
State of the art
State of the art
Common wisdom | Unsupervised is better than supervised (unless you have several thousands of good seeds)? |
VecMap (heuristic initialization and stochastic dictionary induction) is better than MUSE (GANs and Procrustes Analysis)? | |
Our message | |
State of the art
Common wisdom | Unsupervised is better than supervised (unless you have several thousands of good seeds)? |
VecMap (heuristic initialization and stochastic dictionary induction) is better than MUSE (GANs and Procrustes Analysis)? | |
Our message | Unsupervised is mostly worse than supervised.�Stochastic dictionary induction is a key component in the self-learning step.�GANs are competitive with heuristic (second-order) initialization. |
Reminder
Footnote on Stochasticity and Optimization
Procrustes Analysis overfits the seed
PA overfits the seed
PA overfits the seed
PA overfits the seed
Reminder: Why SGD works so well
Reminder: Why SGD works so well
Reminder: Why SGD works so well
Systematic comparisons of unsupervised methods
A General Framework (with all the tricks…)
A General Framework (with all the tricks…)
Typical focus: Seed dictionary induction and learning projections.
Common wisdom: Artetxe et al. (2018) state of the art, with incremental improvements in 2019.
What we do: Fix learning projections and compare approaches to C1.
Comparing GAN, ICP and Heuristic
Overview
Authors | Seed dictionary induction |
GAN | |
Wasserstein GAN / Optimal transport | |
Heuristic | |
Point Cloud Matching |
GAN, ICP, and GWA (with no refinement)
GAN, ICP, and GWA with Procrustes Analysis
GAN and GWA with SDI
State of the art
Common wisdom | Unsupervised is better than supervised (unless you have several thousands of good seeds)? |
VecMap (second order initialization and stochastic dictionary induction) is better than MUSE (GANs and Procrustes Analysis)? | |
Our message | Stochastic dictionary induction improves GAN-based seed induction.�How about other flavors of GANs? |
|
Comparing flavors of GAN
Overview
Authors | Seed dictionary induction |
GAN | |
Wasserstein GAN / Optimal transport | |
Heuristic | |
Point Cloud Matching |
Flavors of GANs
GAN, WGAN, WGAN-GP and CT-GAN (w/o refinement)
State of the art
Common wisdom | Unsupervised is better than supervised (unless you have several thousands of good seeds)? |
VecMap (second order initialization and stochastic dictionary induction) is better than MUSE (GANs and Procrustes Analysis)? | |
Our message | Stochastic dictionary induction improves GAN-based seed induction. WGAN-GP works best as GAN-based seed induction. |
|
Systematic comparisons of unsupervised vs. supervised
Unsupervised vs. supervised
[Conneau et al.; ICLR 2018]: “Without using any character information, our model even outperforms existing supervised methods on cross-lingual tasks for some language pairs”
[Artetxe et al.; ACL 2018]: “Our method succeeds in all tested scenarios and obtains the best published results in standard datasets, even surpassing previous supervised systems”
[Hoshen and Wolf; EMNLP 2018]: “...our method achieves better performance than recent state-of-the-art deep adversarial approaches and is competitive with the supervised baseline”
[Xu et al.; EMNLP 2018]: “Our evaluation (...) shows stronger or competitive performance of the
proposed method compared to other state-of-the-art supervised and unsupervised methods...”
[Chen and Cardie; EMNLP 2018]: “In addition, our model even beats supervised approaches trained with cross-lingual resources.”
Unsupervised vs. supervised
Unsupervised vs. supervised
Unsupervised vs. supervised
Unsupervised vs. supervised
Unsupervised vs. supervised
Unsupervised vs. supervised
{Bulgarian, Catalan, Esperanto, Estonian, Basque, Finnish, Hebrew, Hungarian, Indonesian, Georgian, Korean, Lithuanian, Bokmål, Thai, Turkish}
x
{Bulgarian, Catalan, Esperanto, Estonian, Basque, Finnish, Hebrew, Hungarian, Indonesian, Georgian, Korean, Lithuanian, Bokmål, Thai, Turkish}
Unsupervised vs. supervised: similar languages
While fully unsupervised CLWEs really show impressive performance for similar language pairs, they are still worse than weakly supervised methods…
Unsupervised vs. supervised: all languages
Supervised never fails!
Unsupervised never wins!
Unsupervised vs. supervised: distant languages
State of the art
Common wisdom | Unsupervised is better than supervised (unless you have several thousands of good seeds)? |
VecMap (second order initialization and stochastic dictionary induction) is better than MUSE (GANs and Procrustes Analysis)? | |
Our message | Stochastic dictionary induction improves GAN-based seed induction. WGAN-GP works best as GAN-based seed induction. |
Supervision doesn’t seem to hurt. |
6. Open Problems
6. Systematic Comparisons
Open problems
*We argue: Maybe Instability is not a problem.
(Robustness and) Instability
Robustness and instability
Robustness and instability
Sensitive to languages [Søgaard et al., ACL-18]
Adversarial training fails for more distant language pairs
Sensitive to domains [Søgaard et al., ACL-18]
Even the most robust unsupervised model breaks down when the domains are dissimilar, even for similar language pairs
Sensitive to both [Søgaard et al., ACL-18]
Domain differences may exacerbate difficulties of generalising across dissimilar languages
Sensitive to algorithm [Hartmann et al., EMNLP-18]
Sensitive to random seed [Søgaard et al.; Artetxe et al.]
Is instability a problem?
Morphology
Morphology
Morphology
Morphology
Morphology
atuaraangama
whenever I read
atuaruma
when I will read
atuarpoq
read
Morphology
Morphology
Open questions
Does cross-lingual embeddings - and/or BDI - even make sense in the context of different morphologies? |
|
|
|
|
|
Isomorphism
Isomorphism
Measuring isomorphism
Nearest neighbour (NN) graphs of ten most frequent nouns in English and their German translations
where
Laplacian eigenvalues
smallest k so that sum of largest k eigenvalues is�> 90% of sum of all eigenvalues
Other measures of isomorphism: [Patra et al., ACL-19]
Measuring isomorphism
EN-ES
EN-{ET, FI, EL}
Søgaard et al. (2018)
Isomorphism
Isomorphism
Open questions
Does cross-lingual embeddings - and/or BDI - even make sense in the context of different morphologies? |
Is non-isomorphism a problem we should deal with during alignment? Or a problem we should deal with when inducing monolingual embeddings? |
|
|
|
|
Evaluation
Open questions
Does cross-lingual embeddings - and/or BDI - even make sense in the context of different morphologies? |
Is non-isomorphism a problem we should deal with in alignment? Or a problem we should deal with when inducing monolingual embeddings? |
If the purpose of cross-lingual word embeddings is unsupervised MT, bilingual dictionary induction, and cross-lingual transfer, how should we best evaluate them? |
|
|
|
Standard approach
Standard approach
We argue:
Bad choice!
Bilingual dictionaries
MUSE | 45 |
Wiktionary | 400+ |
Panlex | 9000+ |
ASJP | 9000+ |
The problem with MUSE
Alternatives to Bilingual Lexicon Induction
Downstream tasks | |
Document classification | Klementiev et al. (2003), etc. |
Syntactic analysis | Gouws and Søgaard (2015), etc. |
Semantic analysis | Glavas et al. (2019), etc. |
Machine translation | Conneau et al. (2017), etc. |
Word alignment | Levy et al. (2017), etc. |
BLI correlation with downstream tasks
→ For non-orthogonal projections, downstream evaluation is particularly important
[Glavaš et al., ACL-19]
Models | XNLI | CLDC | CLIR |
All models | 0.269 | 0.390 | 0.764 |
All w/o RCSLS | 0.951 | 0.266 | 0.910 |
Correlations of model-level results between BLI and each of the downstream tasks.
The problem (or beauty) of downstream evaluation
Metrics for BDI
Other methodological weaknesses
Other methodological weaknesses
Other methodological weaknesses
Conclusions
7. Conclusions and Future Directions
Conclusions: Uncertainty prevails
Open questions
Does cross-lingual embeddings - and/or BDI - even make sense in the context of different morphologies? |
Is non-isomorphism a problem we should deal with in alignment? Or a problem we should deal with when inducing monolingual embeddings? |
If the purpose of cross-lingual word embeddings is unsupervised MT, bilingual dictionary induction, and cross-lingual transfer, how should we best evaluate them? |
How strong is the correlation between extrinsic and intrinsic evaluation tasks? |
Denoising seed lexica? (Beyond dropout and symmetry?) |
Cross-lingual contextualized embeddings [Schuster et al., NAACL-19; Aldarmaki and Diab, NAACL-19] |
Open questions
Does cross-lingual embeddings - and/or BDI - even make sense in the context of different morphologies? |
Is non-isomorphism a problem we should deal with in alignment? Or a problem we should deal with when inducing monolingual embeddings? |
If the purpose of cross-lingual word embeddings is unsupervised MT, bilingual dictionary induction, and cross-lingual transfer, how should we best evaluate them? |
How strong is the correlation between extrinsic and intrinsic evaluation tasks? |
Denoising seed lexica? (Beyond dropout and symmetry?) |
Cross-lingual contextualized embeddings, e.g., [Aldarmaki and Diab, NAACL-19] |
Are there scenarios in cross-lingual NLP where we can really really benefit from fully unsupervised cross-lingual word embeddings?
Useful Links and Resources
Useful Links
https://github.com/Babylonpartners/fastText_multilingual
https://fasttext.cc/docs/en/aligned-vectors.html
https://github.com/facebookresearch/XLM
https://github.com/google-research/bert/blob/master/multilingual.md
https://github.com/facebookresearch/UnsupervisedMT
https://github.com/artetxem/monoses
https://github.com/artetxem/undreamt
Useful Links
https://github.com/artetxem/vecmap
https://github.com/sebastianruder/latent-variable-vecmap
https://github.com/facebookresearch/MUSE
https://github.com/dmelis/otalign
https://github.com/facebookresearch/NAM
Useful Links
https://github.com/xrc10/unsup-cross-lingual-embedding-transfer
https://github.com/rlitschk/UnsupCLIR
https://github.com/geert-heyman/BLI_classifier
https://github.com/nmrksic/neural-belief-tracker
https://github.com/salesforce/glad
https://github.com/wenhuchen/Cross-Lingual-NBT
...And Some Resources (!)
https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix
At least 50k-100k parallel sentences for 54,376 language pairs
Questions?
Slides: https://tinyurl.com/xlingual
If you found these slides helpful, consider citing the tutorial as:
@inproceedings{ruder2019unsupervised,� title={Unsupervised Cross-Lingual Representation Learning},� author={Ruder, Sebastian and S{\o}gaard, Anders and Vuli{\'c}, Ivan},� booktitle={Proceedings of ACL 2019, Tutorial Abstracts},� pages={31--38},� year={2019}�}
234
Some Important Aspects Not Covered Here
[Upadhyay et al., ACL-16; Heyman et al., NAACL-19; Glavaš et al., ACL-19]
[Guo et al., ACL-15; Mrkšić et al., TACL-17; Vulić et al., EMNLP-17, Upadhyay et al., NAACL-18]
[Søgaard et al., ACL-18; Hartmann et al., arXiv-18; Alvarez Melis and Jaakkola, EMNLP-18, Patra et al., ACL-19; Fujinuma et al., ACL-19]
[Doval et al., EMNLP-18, Zhang et al., ACL-19, Patra et al., ACL-19]
[Søgaard et al., ACL-18; Hartmann et al., EMNLP-18; Braune et al., NAACL-18]
[Grave et al., ICLR-18; Wu et al., EMNLP-18; Heyman et al., NAACL-19, Alaux et al., ICLR-19]