1 of 235

Unsupervised Cross-lingual Representation Learning

July 28, 2019

ACL 2019

Sebastian Ruder

Ivan

Vulić

Anders

Søgaard

2 of 235

Unsupervised Cross-lingual Representation Learning

Follow along with the tutorial:

Slides: https://tinyurl.com/xlingual
Useful repos and resources: listed at the end of the tutorial

Questions:

Twitter: #ACLUnsupXL during the tutorial
Ask us during the break, after the tutorial, or any time during the conference

3 of 235

Why Cross-Lingual NLP?

Speaking more languages means communicating with more people…

...and reaching more users and customers...

4 of 235

Why Cross-Lingual NLP?

...but there are more profound and democratic reasons to work in this area:

decreasing the digital divide
dealing with inequality of information
mitigating cross-cultural biases
deploying language technology for underrepresented languages, dialects, minorities; societal impact
understanding cross-linguistic differences

“95% of all languages in use today will never gain traction online” (Andras Kornai)

“The limits of my language online mean the limits of my world?”

Source: http://labs.theguardian.com/digital-language-divide/

5 of 235

Why Cross-Lingual NLP?

Inequality of information and representation can also affect how we understand places, events, processes...

We’re in Zagreb searching for...

...jatetxe (EU)

...restaurants (EN)

...éttermek (HU)

6 of 235

Motivation: Cross-Lingual Representations are Everywhere

Cross-lingual and multilingual semantic similarity
Bilingual lexicon induction, multi-modal representations

Cross-lingual SRL, POS tagging, NER
Cross-lingual dependency parsing, sentiment analysis
Cross-lingual natural language understanding for dialogue
Cross-lingual lexical entailment
Cross-lingual annotation and model transfer
Cross-lingual you-name-it-task

Statistical and neural MT
Cross-lingual IR and QA

7 of 235

Motivation: Cross-Lingual Representations are Everywhere

Searching for “multilingual”, “cross-lingual” and “bilingual” in the ACL anthology (ACL+EMNLP)

10+ papers on unsupervised cross-lingual word embeddings at EMNLP 2018

The trend continues:
20+ papers on cross-lingual learning and applications at NAACL 2019.

8 of 235

Motivation (Very High-Level)

We want to understand and model the meaning of...

...without manual/human input and without perfect MT

Source: dreamstime.com

9 of 235

The World Existed B.E. (Before Embeddings)

10 of 235

The World Existed B.E.

B.E. Example 1: Cross-lingual (parser) transfer

11 of 235

The World Existed B.E.

B.E. Example 2a: Traditional “count-based” cross-lingual vector vector spaces…

[Gaussier et al., ACL 2004; Laroche and Langlais, COLING 2010]

12 of 235

The World Existed B.E.

B.E. Example 2a: Traditional “count-based” cross-lingual vector vector spaces…

[Gaussier et al., ACL 2004; Laroche and Langlais, COLING 2010]

13 of 235

The World Existed B.E.

B.E. Example 2b: …and bootstrapping from limited bilingual signal (a sort of self-learning)

[Peirsman and Padó, NAACL-10; Vulić and Moens, EMNLP-13]

14 of 235

The World Existed B.E.

B.E. Example 3: Cross-lingual latent topic spaces

[Mimno et al., EMNLP-09; Vulić et al., ACL-11]

15 of 235

So, Why (Unsupervised) Cross-Lingual Embeddings Exactly?

Cross-lingual word embeddings (CLWE-s)

Simple: quick and efficient to train
(Still) state-of-the-art in cross-lingual NLP; omnipresent
Lightweight and inexpensive
Multilingual modeling of meaning and support for cross-lingual NLP

Unsupervised CLWE-s:

Wide portability without bilingual resources?
Deploying language technology for virtually any language?
Increasing the ability of cross-lingual transfer?
An interesting scientific problem still at its infancy:

Potential for transforming cross-lingual and cross-domain NLP

What is the current state-of-the-art in unsupervised cross-lingual representation learning?

16 of 235

Unsupervised Cross-Lingual Representations

Unsupervised approaches claim performance similar or superior to the best supervised approaches

Conneau et al. (2018)

best supervised

best unsupervised

17 of 235

Unsupervised Cross-Lingual Representations

[Conneau et al.; ICLR 2018]: “Without using any character information, our model even outperforms existing supervised methods on cross-lingual tasks for some language pairs”

[Artetxe et al.; ACL 2018]: “Our method succeeds in all tested scenarios and obtains the best published results in standard datasets, even surpassing previous supervised systems”

[Hoshen and Wolf; EMNLP 2018]: “...our method achieves better performance than recent state-of-the-art deep adversarial approaches and is competitive with the supervised baseline”

[Xu et al.; EMNLP 2018]: “Our evaluation (...) shows stronger or competitive performance of the

proposed method compared to other state-of-the-art supervised and unsupervised methods...”

[Chen and Cardie; EMNLP 2018]: “In addition, our model even beats supervised approaches trained with cross-lingual resources.”

18 of 235

So, Why (Unsupervised) Cross-Lingual Embeddings Exactly?

Supervision is readily available

Small word translation dictionaries
Linguistic resources, e.g. ASJP database (Wichmann et al., 2018) with 40-item word lists in all the world’s languages
Weak supervision (shared vocabulary)

Unsupervised representations are practically justified only if they outperform their supervised counterparts

How well do unsupervised cross-lingual representations actually perform?

19 of 235

Tutorial Goals

Provide a systematic overview and typology of unsupervised cross-lingual representation learning models

Discuss the similarities between current unsupervised approaches

Analyze the modeling and empirical similarities and differences between fully

unsupervised and weakly supervised methods

Critically examine current limitations, with the focus on (in)stability, robustness, and applicability to distant language pairs and low-data regimes

Stress the importance of (unsupervised) cross-lingual representations in cross-lingual downstream tasks and applications

Detect a large number of challenges and open questions for future research

20 of 235

Agenda

What this is not: Comprehensive. While we’ll try to tell a compelling and coherent story from multiple angles, it is impossible to cover all related papers in one tutorial.

(If we haven’t mentioned your paper, don’t be too mad at us…)

21 of 235

Agenda

[6] Unsupervised deep� models

22 of 235

Motivation: Crossing the Lexical Chasm

“The(ir) model, however, is only applicable to English, as large enough training sets do not exist for other languages…”

Old paradigm:

Language-specific NLP models
Language-specific feature computation and preprocessing

New paradigm:

Representation learning: inputs are

semantic vectors (embeddings)

Multilingual / cross-lingual

representation learning

23 of 235

Motivation: Crossing the Lexical Chasm

Multilingual / Cross-lingual

representation of meaning

Word-level
Cross-lingual word embeddings
Words with similar meanings across languages have similar vectors

Sentence-/paragraph-level
Most recent developments
Multilingual unsupervised pretraining

[Conneau and Lample, arXiv-19]

(Unsupervised) NMT?

24 of 235

Recently: Cross-Lingual Language Modeling Pretraining

Image from: [Lample and Conneau, arXiv-19]

Cross-lingual word embeddings obtained via

the lookup table of a cross-lingual LM

[Artetxe et al., ACL-19]: use

UNMT and CLWE induction

in an alternate-and-iterate fashion

Very recent sub-area

of research…

(see also [Pires et al., ACL-19])

25 of 235

Cross-Lingual Word Embeddings

26 of 235

Why Cross-Lingual Word Representations?

Capturing meaning across languages: a standard task of bilingual lexicon induction (BLI)

Retrieving nearest neighbours from a shared cross-lingual embedding space (P@1, MRR, MAP)

27 of 235

Similarity-Based vs. Classification-Based BLI

Two modes of how to use CLWE-s: similarity-based versus feature-based.

Classification-based BLI (combining heterogeneous features)

[Irvine and Callison-Burch, NAACL-13; Heyman et al., EACL-17]

Combining character-level and word-level information with a classifier

28 of 235

More Applications: Cross-Lingual IR and QA

[Vulić and Moens, SIGIR-15; Mitra and Craswell arXiv-17; Litschko et al., SIGIR-18, SIGIR-19]

29 of 235

More Applications: Cross-Lingual NLU for Task-Oriented Dialog

NLU architectures such as NBT [Mrkšić et al., ACL-17; Ramadan et al., ACL-18], GLAD [Zhong et al., ACL-18] or StateNet [Ren et al., EMNLP-18] rely on word embeddings…

CLWE-s: use training data from a resource-rich language?

30 of 235

More Applications: Cross-Lingual NLU for Task-Oriented Dialog

Some results from [Mrkšić et al., TACL-17]

CLWE-s are used to leverage additional dialogue training data in a resource-rich source language

31 of 235

Why (Unsupervised) Cross-Lingual Word Representations?

Recently: unsupervised neural and statistical machine translation

[Artetxe et al., ICLR-18, EMNLP-18, ACL-19; Lample et al., ICLR-18, EMNLP-18; Wu et al., NAACL-19;...]

Key component: initialization via unsupervised cross-lingual word embeddings

L1-L1 and L2-L2 reconstruction with denoising autoencoders: C() is added noise
L1-L2 and L2-L1 reconstruction via a latent space (initialised using CLWE-s)
Adversarial component: classify between the encoding of source sentences and the encoding of target sentences, i.e., predict the language of the encoded sentence.

Image from [Lample et al., ICLR-18]

32 of 235

Unsupervised MT

Recently: unsupervised neural and statistical machine translation

[Artetxe et al., ICLR-18, EMNLP-18, ACL-19; Lample et al., ICLR-18, EMNLP-18; Wu et al., NAACL-19;...]

Key component: initialization via unsupervised cross-lingual word embeddings

Image from [Artetxe et al., ICLR-18]

33 of 235

Unsupervised MT

Image from [Artetxe et al., ICLR-18]

Both translation directions handled together
Shared encoder
Two decoders for each language
Embeddings are fixed

Training regime in a nutshell:

Denoising autoencoder 1: noisy input in L1, try to reconstruct the input in the same language (E+L1)
Denoising autoencoder 2: noisy input in L2, try to

reconstruct the input in the same language (E+L2)

Back-translation: input in L1, translate E+D2, translate E+D1, output in L1
Back-translation: input in L2, translate E+D1, translate E+D2, output in L2

WMT now has the track on unsupervised MT!

34 of 235

Unsupervised MT: Further Improvements

Image and algorithm from [Lample et al., EMNLP-18], a similar idea in [Artetxe et al., EMNLP-18]

NMT:

Initialisation: CLWE-s directly; LM: denoising autoencoding;

Back-translation

PBSMT:

Phrase-tables generated from CLWE-based bilingual lexicons; n-gram based LM-s; Back-translation

Initialisation: Given the ill-posed nature of the task, model initialization expresses a natural prior over the space of solutions we expect to reach, jump-starting the process by leveraging approximate translations of words, short phrases or even sub-word units. Wword-by-word” translation may be poor if languages or corpora are not closely related, it still preserves some of the original semantics.
LMs express a data-driven prior about how sentences should read in each language, and they improve the quality of the translation models by performing local reorderings and substituions
In NMT, language modeling is accomplished via denoising autoencoding, by minimizing the standard log-likelihood loss
A shared encoder representation acts like an interlingua, which is translated in the decoder target language regardless of the input source language. This ensures that the benefits of language modeling, implemented via the denoising autoencoder objective, nicely transfer to translation from noisy sources and eventually help the NMT model to translate more fluently.
We populate the initial phrase tables (from source to target and from target to source) using an inferred bilingual dictionary built from monolingual corpora. Iterative back-translation than also improves on phrase tables, etc.

35 of 235

Why (Unsupervised) Cross-Lingual Word Representations?

Very recently (even going to the future…): “An effective approach to unsupervised MT”

[Artetxe et al., ACL-19]

A set of improvements to their unsupervised MT framework, e.g., using subword-level information, adapting MERT-style training to unsupervised settings: cyclic consistency loss based on the BLEU score plus a language modeling loss (based on n-gram LMs), NMT hybridization, etc.

The method is competitive to supervised systems from 2014, we’re moving forward so quickly...

Two (and a half) open questions:

What about unsupervised MT for truly distant language pairs? (It seems to work for English-Urdu…)
How different (unsupervised) CLWE-s affect unsupervised MT?
How important is the actual initialisation?

36 of 235

Why (Unsupervised) Cross-Lingual Word Representations?

Recently: unsupervised cross-lingual transfer of lexical resources

[Ponti et al., EMNLP-18; Glavaš and Vulić, NAACL-18, ACL-18; Jebbara and Cimiano, NAACL-19;...]

Means of transfer: unsupervised cross-lingual word embeddings

37 of 235

Cross-Lingual Word Embeddings

38 of 235

Cross-Lingual Word Embeddings

A large number of different methods, but the same end goal:

Induce a shared semantic vector space in which words with similar meaning end up with similar vectors, regardless of their actual language.

We need some bilingual supervision to

learn CLWE-s.

Fully unsupervised CLWE-s: they rely only on monolingual data

39 of 235

Cross-Lingual Word Embeddings

A large number of different methods, but the same end goal:

Induce a shared semantic vector space in which words with similar meaning end up with similar vectors, regardless of their actual language.

Typology of methods for inducing CLWE-s [Ruder et al., JAIR-19; Søgaard et al., M&C Book]

Type of bilingual signal
Document-level, sentence-level, word-level, no signal (i.e., unsupervised)

Comparability
Parallel texts, comparable texts, non-comparable

Point/Time of alignment
Joint embedding models vs. Post-hoc alignment vs. post-specialisation/retrofitting

Modality
Text only vs. using images for alignment, e.g., [Kiela et al., EMNLP-15; Vulić et al., ACL-16; Gella et al., EMNLP-17]

40 of 235

General (Simplified) CLWE Methodology

Previously: (bilingual) data sources seem more important than the chosen algorithm [Levy et al, EACL-17]
Most CLWE algorithms are formulated as:

Image adapted from [Gouws et al., ICML-15]

41 of 235

Joint CWLE Models (selection)

Using word-level cross-lingual signal: word alignments, bilingual dictionaries...
Bilingual extensions of the monolingual skip-gram and CBOW models

[Ammar et al., arXiv-15; Luong et al., NAACL-15; Guo et al., ACL-15; Shi et al., ACL-15]

Creating pseudo-bilingual corpora + training monolingual WE methods on such corpora

[Gouws and Søgaard, NAACL-15; Duong et al., EMNLP-16; Adams et al., EACL-17]

Using sentence-level cross-lingual signal
Compositional sentence model [Hermann and Blunsom, ACL-14]

Sentence-level bilingual skip-gram [Coulmance et al., EMNLP-15; Gouws et al., ICML-15]

Bilingual sentence autoencoders [Chandar et al., NeurIPS-14]

Using document-level cross-lingual signal

[Vulić and Moens, ACL-15, JAIR-16; Søgaard et al., ACL-15]

42 of 235

A Commercial Break

Book published in June 2019

It covers a lot of material we just skipped…

(An older overview is also in the EMNLP-17 tutorial)

43 of 235

2. Main Framework

44 of 235

Projection-based CLWE Learning

Post-hoc alignment of independently trained monolingual distributional word vectors

Alignment based on word translation pairs (dictionary D)

[Glavaš et al.; ACL-19]

45 of 235

Projection-based CLWE Learning

Most models learn a single projection matrix W_L1(i..e., W_L2 = I), but bidirectional learning is also common.

How do we find the “optimal” projection matrix W_L1?

Mean square error: [Mikolov et al., arXiv-13] and most follow-up work

...except…

Canonical methods [Faruqui et al., EACL-14; Lu et al., NAACL-15; Rotman et al., ACL-18]
Max-margin framework: [Lazaridou et al., ACL-15; Mrkšić et al., TACL-17]
Relaxed Cross-Domain Similarity Local Scaling: [Joulin et al., EMNLP-18]

46 of 235

Minimising Euclidean Distance

[Mikolov et al., arXiv-13] minimize the Euclidean distances for translation

pairs after projection

The optimisation problem has no closed-form solution

Iterative SGD-based optimisation was used initially

More complex mappings: e.g., non-linear DFFNs instead of linear projection matrix yield worse performance

Better (word translation) results when W_L1 is constrained to be orthogonal

This preserves monolingual vector space topology

47 of 235

Minimising Euclidean Distance or Cosine?

[Xing et al., NAACL-15]: there is a mismatch between the initial objective function, the distance measure, and the transformation objective

Solution: 1. Normalization of word vectors to unit length + 2. Replacing mean-square error with cosine similarity for learning the mapping

48 of 235

Orthogonal Mapping: Solving the Procrustes Problem

If W is orthogonal, the optimisation problem is the so-called Procrustes problem:

It has a closed form solution [Schönemann, 1966; Artetxe et al., EMNLP-16]

Important! Almost all projection-based CLWE methods, supervised and unsupervised alike, solve the Procrustes problem in the final step or during self-learning...

49 of 235

What is Hubness?

Hubness: the tendency of some vectors (i.e., “hubs”) to appear in the ranked lists of nearest neighbours of many other vectors in high-dimensional spaces

[Radovanović et al., JMLR-10; Dinu et al., ICLR-15]

Solutions:

Globally-corrected retrieval: : instead of returning the nearest neighbour of the query, it returns the target element which has the query ranked highest

[Dinu et al., ICLR-15; Smith et al., ICLR-17]

A margin-based ranking loss instead of mean-squared error

[Lazaridou et al., ACL-15]

Scaled similarity measures instead of cosine: discounting similarity in dense areas: CSLS

[Conneau et al., ICLR-18]

50 of 235

Relaxed Cross-Domain Similarity Local Scaling (RCSLS)

If our goal is to optimise for the word translation performance…

Improved word translation retrieval if the retrieval procedure is corrected for hubness [Radovanović et al., JMLR-10; Dinu et al., ICLR-15]

Cross-domain similarity local scaling (CSLS) proposed by [Conneau et al., ICLR-18]

[Joulin et al., EMNLP-18] maximise CSLS (after projection) instead of minimising Euclidean distance; they relax the orthogonality constraint: Relaxed CSLS

51 of 235

How Important is (the Noise in the) the Seed Lexicon?

Image adapted from [Lubin et al., NAACL-19]

Orthogonal Procrustes is less affected by noise (i.e., incorrect translation pairs in the training set)

Open question: how to reduce the noise in seed lexicons?

Are larger seed lexicons necessarily better seed lexicons?

52 of 235

How Important is (the Size of) the Seed Lexicon?

Image adapted from [Vulić and Korhonen, ACL-16]

Performance saturates or even drops when adding lower-frequency words into dictionaries.

Better results with fewer translation pairs (but less noisier pairs): Symmetric/mutual nearest neighbours

Identical words can also be useful, but they are heavily dependent on language proximity and writing scripts.

Can we reduce the requirements further?

53 of 235

Reducing the Cross-Lingual Signal Requirements Further

Image from [Artetxe et al., ACL-17]

Other sources of supervision:

Shared words and cognates
Numerals
...

[Vulić and Moens, EMNLP-13; Vulić and Korhonen, ACL-16; Zhang et al., NAACL-17; Artetxe et al., ACL-17, Smith et al., ICLR-17]

Reducing the dictionaries only to 25-40 examples.

The crucial idea with limited supervision is bootstrapping or self-learning

The final frontier… Unsupervised methods...

54 of 235

3. (Basic) Self-Learning

55 of 235

Self-Learning in a Nutshell

Image courtesy of Eneko Agirre

In theory, this is the

only true/core “unsupervised” part

The seed dictionary improves over time, but…

How do we start?

How do we choose new candidates?

How do we guarantee that we do not introduce noise?

How does the method converge?

56 of 235

Self-Learning in a Nutshell

Image courtesy of Eneko Agirre

In theory, this is the

only true/core “unsupervised” part

57 of 235

Bootstrapping with Orthogonal Procrustes

[Artetxe et al., ACL-17, Glavaš et al., ACL-19]

Source-to-target and target-to-source mappings

Symmetric nearest neighbours

Nothing useful is learned after the

first iteration (only lower-frequency

words and noise)

58 of 235

A General Framework (with all the “tricks of the trade”…)

59 of 235

A General Framework (with all the “tricks of the trade”…)

Typical focus: Seed dictionary induction and learning projections.

The only difference between weakly supervised and fully unsupervised methods is

than in C1.

60 of 235

A General Framework (with all the “tricks of the trade”…)

C1. Seed Lexicon Extraction

Supervised models assume that the lexicon is available (at least some pairs…)
Fully unsupervised models: automatically induce the lexicon

C2. Self-Learning Procedure

iteratively apply the Procrustes procedure (or something else)
different tricks to avoid suboptimal solutions (e.g., stochastic dropout of translation pairs, carefully-tuned frequency cut-offs)

C3. Preprocessing and Postprocessing Steps [Artetxe et al., AAAI-18]

length normalization
mean centering
whitening and de-whitening

61 of 235

C1. Seed Lexicon Extraction and C2. Self-Learning

The same general framework for all unsupervised CLWE methods:

Induce (automatically) initial seed lexicon D⁽¹⁾

Repeat:

2. Learn the projection W^(k) using D^(k)

3. Induce a new dictionary D^(k+1) from XW^(k) U Y

62 of 235

C1. Seed Lexicon Extraction and C2. Self-Learning

The same general framework for all unsupervised CLWE methods.

Different approaches to C1 (i.e., to obtain D⁽¹⁾), e.g.,:

Adversarial learning
Similarity of monolingual similarity distributions
PCA-based similarity
Solving optimal transport problem

All solutions assume approximate isomorphism

of monolingual embedding spaces

63 of 235

(Approximate) Isomorphism

“. . . we hypothesize that, if languages are used to convey thematically similar information in similar contexts, these random processes should be approximately isomorphic between languages, and that this isomorphism can be learned from the statistics of the realizations of these processes, the monolingual corpora, in principle without any form of explicit alignment.”

[Miceli Barone, RepL4NLP-16]

Fully unsupervised CLWE methods rely on

this assumption twice

Is this problematic?

(...to be continued...)

[Mikolov et al., arXiv-13]

64 of 235

C3. Preprocessing and Postprocessing Steps

A general projection-based supervised framework of [Artetxe et al., AAAI-18]

S0: Normalisation: Unit length normalisation, mean centering, or combination of both (preproc)

S1: Whitening: turning covariance matrices into the identity matrix: each dimension obtains unit variance

S2: Re-weighting: re-weigh each component according to its cross-correlation to increase the relevance of those that best match across languages (after the mapping)

S3: De-whitening: use only if S1 was used; restore the original variance in each dimension (after)

S4: Dimensionality reduction: keep only the first n components of the resulting embeddings (and set the rest to 0) (after)

65 of 235

C3. Preprocessing and Postprocessing Steps and Other Choices

A (most robust) unsupervised framework of [Artetxe et al., ACL-18]

Unit length normalisation #1 + mean centering + unit length normalisation #2

Re-weighting is done using the singular value matrix after the orthogonal Procrustes

Bidirectional mapping and dictionary induction

Symmetric re-weighting applied only once (and not in each iteration)

CSLS used for word retrieval instead of the simple cosine

Self-learning applied only on the top K (K=20,000) most frequent words�
Stochastic dictionary induction: dropout on current dictionary

66 of 235

Improving C3: Preprocessing Steps and Other “Tricks”

Another look into the future [Zhang et al., ACL-19]

Based on [Artetxe et al., ACL-18]

Iterative Normalization guarantees:

length-invariance
center-invariance

Preprocessing which makes two

monolingual embedding spaces

more isomorphic

Better results with orthogonal

projections especially for

distant language pairs.

67 of 235

From Bilingual to Multilingual Word Embeddings

Single Hub Space (SHS) versus Iterative Hub Space (IHS) [Heyman et al., NAACL-19]

N languages: N-1 seed lexicons needed...

N languages: (N-1)*N/2 seed lexicons needed...

68 of 235

A Quick Recap

Unsupervised CLWE-s are Projection-Based Methods (with some fancy improvements over the basic framework)

C1. Seed Lexicon Extraction

How do we extract the initial signal from monolingual data only?
How important is the initialisation actually?

C2. Self-Learning Procedure

How to denoisify the initial sub-optimal seed lexicon?
How to design a robust procedure that works for virtually any language pair and avoids poor local optima?

C3. Preprocessing and Postprocessing Steps

This comes next...

69 of 235

4. Unsupervised Seed Induction

70 of 235

Overview

Authors	Seed dictionary induction
Barone (2016)	GAN
Zhang et al. (2017a)
Conneau et al. (2018)
Zhang et al. (2017b)	Wasserstein GAN / Optimal transport
Xu et al. (2018)
Alvarez-Melis and Jaakkola (2018)
Artetxe et al. (2018)	Heuristic
Hoshen and Wolf (2018)	Point Cloud Matching


Adversarial





Non-�adversarial

71 of 235

Additional References (with Similar High-Level Ideas)

GAN

Chen and Cardie (2018): a more robust and multilingual variant of the MUSE model

Wasserstein GAN / optimal transport

Grave et al. (2018): convex relaxation for solving optimal transport with orthogonality constraint
Alaux et al. (2019): extending the approach of Grave et al. to multilingual settings, using the RCSLS loss instead of the Orthogonal Procrustes formulation

Heuristic

Aldarmaki et al. (2018): structural similarity computed based on adjacency matrices and preserving relative distances in two monolingual spaces
Heyman et al. (2019): multilingual generalisations of the robust framework from Artetxe et al. (later)

One promising direction for future research?

Jawanpuria et al. (2019): simultaneously learning language-specific transformations to a shared latent space and a similarity metric in the shared space, but still supervised...

72 of 235

Overview

Authors	Seed dictionary induction
Barone (2016)	GAN
Zhang et al. (2017a)
Conneau et al. (2018)
Zhang et al. (2017b)	Wasserstein GAN / Optimal transport
Xu et al. (2018)
Alvarez-Melis and Jaakkola (2018)
Artetxe et al. (2018)	Heuristic
Hoshen and Wolf (2018)	Point Cloud Matching


Adversarial





Non-�adversarial

73 of 235

Discriminator: differentiate between “fake” projected embeddings and “true” target language embeddings

Adversarial mapping: Main idea

Generator: projects source word embedding into the target language

L1 word embedding

Image credit: Luis Prado, Jojo Ticon, Graphic Tigers

Generator

“true” L2 embedding

“fake” L2 embedding

Discriminator

74 of 235

Adversarial mapping: Main idea

Generator needs to match distribution of target language in order to fool discriminator consistently.
Hypothesis: Best way to do this is to align words with their translations.

L1 word embedding

Image credit: Luis Prado, Jojo Ticon, Graphic Tigers

Generator

“true” L2 embedding

“fake” L2 embedding

Discriminator

75 of 235

Adversarial mapping: Main idea

L1 word embedding

Generator

“true” L2 embedding

“fake” L2 embedding

Discriminator

76 of 235

Why should this work at all?

Main assumption:�Embedding spaces in different languages have similar structure so that the same transformation can align source language words with target language words.

More specifically:�Embeddings spaces should be approximately isomorphic.

77 of 235

Supervised alignment

cat

horse

dog

house

table

tree

street

city

river

caballo

perro

gato

mesa

casa

árbol

río

ciudad

calle

Inspired by: Slides from Eneko Agirre, Mikel Artetxe

78 of 235

Supervised alignment

cat

horse

dog

perro

caballo

gato

árbol

house

mesa

table

tree

street

calle

ciudad

city

river

río

caballo

perro

gato

casa

mesa

casa

árbol

río

ciudad

calle

Inspired by: Slides from Eneko Agirre, Mikel Artetxe

79 of 235

Unsupervised alignment

Inspired by: Slides from Eneko Agirre, Mikel Artetxe

80 of 235

Unsupervised alignment

Inspired by: Slides from Eneko Agirre, Mikel Artetxe

81 of 235

Unsupervised alignment

Inspired by: Slides from Eneko Agirre, Mikel Artetxe

82 of 235

Unsupervised alignment

Inspired by: Slides from Eneko Agirre, Mikel Artetxe

83 of 235

Unsupervised alignment

Inspired by: Slides from Eneko Agirre, Mikel Artetxe

84 of 235

Unsupervised alignment

Inspired by: Slides from Eneko Agirre, Mikel Artetxe

85 of 235

Unsupervised alignment

Inspired by: Slides from Eneko Agirre, Mikel Artetxe

86 of 235

Early adversarial approaches

Training GANs is often unstable, gets stuck in poor minima.

Early approaches were not as successful…

The generator is not able to find a good alignment (Barone, 2016).
Training requires careful regularization and hyper-parameter tuning; only works for small embedding sizes (50 dimensions) and on non-standard benchmarks (Zhang et al., 2017a).
More robust approach in certain settings (Conneau et al., 2018)

For more on GANs: NAACL 2019 Deep Adversarial Learning tutorial

87 of 235

Adversarial mapping: �Learn a translation matrix . Train discriminator to discriminate samples from and .

Conneau et al. (2018)

Monolingual word embeddings: �Learn monolingual vector spaces and .

88 of 235

Conneau et al. (2018)

Refinement (Procrustes analysis): �Build bilingual dictionary of frequent words using . Learn a new based on frequent word pairs.

Cross-domain similarity local scaling (CSLS): �Use similarity measure that increases similarity of isolated word vectors, decreases similarity of vectors in dense areas.

89 of 235

Discriminator: differentiate between the “fake” projected�source samples and the “true” target samples .

Adversarial mapping in detail

Generator: the projection matrix

For every input sample, generator and discriminator are trained successively with SGD to minimize their losses.

Maximize probability of predicting correct source

Maximize probability of “fooling” discriminator

an embedding in

90 of 235

Overview

Authors	Seed dictionary induction
Barone (2016)	GAN
Zhang et al. (2017a)
Conneau et al. (2018)
Zhang et al. (2017b)	Wasserstein GAN / Optimal transport
Xu et al. (2018)
Alvarez-Melis and Jaakkola (2018)
Artetxe et al. (2018)	Heuristic
Hoshen and Wolf (2018)	Point Cloud Matching


Adversarial





Non-�adversarial

91 of 235

Wasserstein GAN / Optimal Transport

Wasserstein GAN�
Earth Mover’s Distance (EMD)�
EMD and Wasserstein Distance�
Solving the optimal transport problem�
Gromov Wasserstein Distance

92 of 235

Wasserstein GAN

Discriminator: assign higher scores to “true” than to “fake” samples

Generator: “fool” discriminator to predict high scores for “fake” samples

Wasserstein GAN (WGAN; Arjovsky et al., 2017) has been shown to be more stable than the original GAN

L1 word embedding

Generator

“true” L2 embedding

“fake” L2 embedding

Discriminator

93 of 235

Wasserstein GAN

Wasserstein GAN (WGAN; Arjovsky et al., 2017) implicitly minimizes Wasserstein distance

Discriminator D takes source and target word embeddings as input and estimates Wasserstein distance
Generator G uses this information to try to minimize Wasserstein distance

Zhang et al. (2017b)

Wasserstein distance is the Earth Mover’s distance for continuous distributions

→ What is the Earth Mover’s Distance?

94 of 235

Earth Mover’s Distance (EMD)

Distance measure between (discrete) probability distributions.

Red distribution: “dirt”
Blue distribution: “holes”
Minimize overall distance of dirt moved into holes
Distance between points can be any distance (Euclidean, etc.)

95 of 235

Earth Mover’s Distance (EMD)

Distance measure between (discrete) probability distributions.

Discrete probability distributions can be represented as sum of Dirac delta functions: and

probability of i-th word, assumed to be uniform

Dirac delta function: unit point mass at point

cost: distance between and

transport polytope:

transport matrix: indicates assignment,�i.e. how much probability mass is transported�from to

�

dog

cat

horse

street

pero

gato

caballo

calle

1/4

d

96 of 235

Learning cross-lingual embeddings with EMD

Need to learn alignment �together with .

pero

gato

caballo

calle

1/4

d

dog

cat

horse

street

1/4

d

97 of 235

Learning cross-lingual embeddings with EMD

In linear optimization, solutions to the OT problem are always on a vertex of �i.e. is a sparse matrix with only (Brualdi, 2006; §8.1.3).

→ Good fit for matching words across languages.

→ Probability mass preservation: We want to translate all words.

EMD has been used to learn an alignment between words and their translations based on seed words (Zhang et al., 2016a; 2016b)
EMD has also been used for measuring document similarity (Kusner et al., 2015)

98 of 235

EMD and Wasserstein distance

Wasserstein distance generalizes EMD to continuous distributions.
Not necessary for cross-lingual word embeddings, but theoretically motivates WGAN.

set of all joint distributions

infimum: greatest lower bound

99 of 235

Wasserstein GAN

Objective to minimize: Wasserstein distance between transformed source and target embeddings:

Zhang et al. (2017b)

distributions of transformed source and target word embeddings

supremum:�least upper bound

can replace these with neural network discriminator with weight clipping

If cost c is Euclidean distance, can be written as (Kantorovich-Rubinstein duality, Villani, 2006):

100 of 235

Wasserstein GAN

Discriminator: assign higher scores to “true” than to “fake” samples

Generator: minimize approximate Wasserstein distance

Zhang et al. (2017b)

101 of 235

Orthogonality of generator: � (if orthogonality�is not strictly enforced)

How to track performance

Accuracy is not available in the unsupervised setting

Wasserstein estimate: for free in WGAN, correlates with accuracy

Zhang et al. (2017b)

102 of 235

Orthogonality of generator: � (if orthogonality�is not strictly enforced)

How to track performance

Accuracy is not available in the unsupervised setting

Wasserstein estimate: for free in WGAN, correlates with accuracy

Average cosine distance of translations / reconstruction cost

Number of mutual nearest neighbours

Conneau et al. (2018)

Hoshen & Wolf (2018)

103 of 235

Wasserstein GAN / Optimal Transport

Wasserstein GAN�
Earth Mover’s Distance (EMD)�
EMD and Wasserstein Distance�
Solving the optimal transport problem�
Gromov Wasserstein Distance

104 of 235

Break

105 of 235

Overview

Authors	Seed dictionary induction
Barone (2016)	GAN
Zhang et al. (2017a)
Conneau et al. (2018)
Zhang et al. (2017b)	Wasserstein GAN / Optimal transport
Xu et al. (2018)
Alvarez-Melis and Jaakkola (2018)
Artetxe et al. (2018)	Heuristic
Hoshen and Wolf (2018)	Point Cloud Matching


Adversarial





Non-�adversarial

106 of 235

Wasserstein GAN / Optimal Transport

Wasserstein GAN�
Earth Mover’s Distance (EMD)�
EMD and Wasserstein Distance�
Solving the optimal transport problem�
Gromov Wasserstein Distance

107 of 235

EMD with orthogonality constraint

In order to incorporate the orthogonality constraint, we need to alternate optimizing the transport matrix and the translation matrix at every step :

keep this fixed

108 of 235

Alternating optimization of T and W

Optimize .

109 of 235

Alternating optimization of T and W

Optimize .

Optimize .

110 of 235

Alternating optimization of T and W

Optimize .

Optimize .

111 of 235

Alternating optimization of T and W

Optimize .

Optimize .

112 of 235

Alternating optimization of T and W

Optimize .

Optimize .

113 of 235

Alternating optimization of T and W

Optimize .

Optimize .

114 of 235

EMD with orthogonality constraint

In order to incorporate the orthogonality constraint, we need to alternate optimizing the transport matrix T and the translation matrix W at every step k:

can be optimized with approximate optimal transport solver (Cuturi, 2013)

if c is squared L2 distance, this is just Procrustes problem

115 of 235

Solving the optimal transport (OT) problem

116 of 235

Solving the optimal transport (OT) problem

117 of 235

Solving the optimal transport (OT) problem

118 of 235

Two reasons for adding an entropy constraint (Cuturi, 2013):

Enables efficient computation.
Encourages smooth solutions.

Solving the optimal transport (OT) problem

Entropy constraint

Lagrange multiplier

119 of 235

Solution has the form (Cuturi, 2013):��where and are non-negative�vectors and .

Solving the optimal transport (OT) problem

Entropy constraint

Known as Sinkhorn distance (“dual-Sinkhorn divergence” specifically)

Lagrange multiplier

120 of 235

Can be efficiently computed with using only matrix-vector multiplication.�→ can back-propagate through it

Solving the optimal transport (OT) problem

Solution has the form (Cuturi, 2013):��where and are non-negative�vectors and .

Zhang et al. (2017b)

121 of 235

Cosine distance is not a valid metric (does not satisfy triangle inequality).

→ Xu et al. (2018) use square root cosine distance instead.

Solving the optimal transport (OT) problem

Distance should be a valid metric, e.g. Euclidean distance

122 of 235

Gromov Wasserstein distance

Main idea: Compare metric spaces directly (instead of just comparing samples) by comparing distances between pairs of points (distance between distances).

Alvarez-Melis & Jaakkola (2018)

measures cost of matching to and to

123 of 235

Gromov Wasserstein distance

Main idea: Compare metric spaces directly (instead of just comparing samples) by comparing distances between pairs of points (distance between distances).

Can be optimized efficiently with first-order methods (Peyré et al., 2016)

Alvarez-Melis & Jaakkola (2018)

requires operating over a fourth-order tensor!

124 of 235

Gromov Wasserstein distance

Compute a pseudo-cost matrix:

intra-language similarities� and

cross-language similarity

Solve optimal transport problem using as cost matrix:

Alvarez-Melis & Jaakkola (2018)

Repeat this process until convergence.

125 of 235

Finding a good initialisation

Optimize .

Optimize .

Problem is non-convex. Without a good initialisation, can be easy to get stuck in poor local optima!

126 of 235

Finding a good initialisation

In practice, first use WGAN to find a good initialisation and then solve optimal transport problem with orthogonality constraint.

Zhang et al. (2017b)

127 of 235

can be a new transformation or .

Bidirectionality

Can do projection in both ways.

Can use a GAN instead of EMD.

Zhang et al. (2017a), Xu et al. (2018)

128 of 235

Reconstruction

Enforce consistency by reconstructing the projected embeddings.

Reconstruction loss�Minimise cosine distance between original and twice projected embeddings:�

Can also be seen as adversarial autoencoder, CycleGAN, or back-translation

Cosine�distance

Zhang et al. (2017a), Xu et al. (2018)

129 of 235

Overview

Authors	Seed dictionary induction
Barone (2016)	GAN
Zhang et al. (2017a)
Conneau et al. (2018)
Zhang et al. (2017b)	Wasserstein GAN / Optimal transport
Xu et al. (2018)
Alvarez-Melis and Jaakkola (2018)
Artetxe et al. (2018)	Heuristic
Hoshen and Wolf (2018)	Point Cloud Matching


Adversarial





Non-�adversarial

130 of 235

Heuristic seed induction

We have seen heuristics for specifying a seed dictionary (“same spelling”, “numerals”) but these make strong assumptions about the writing system.

Can we come up with a heuristic that is independent of the writing system?

131 of 235

Heuristic seed induction

Main idea: Translations are similar to other words in the same way across languages.

Specifically: Words with similar meaning have similar monolingual similarity distributions (i.e. distributions of similarity across all words of the same language).

for x in vocab:� sim(x, “two”)

Similar to going from�“distance” to “distance between distances”, we now go from�“similarity” to “similarity between similarities”…

Artetxe et al. (2018)

intra-language similarity

Gromov-Wasserstein distance (Alvarez-Melis & Jaakkola, 2018) incorporates this, too

132 of 235

Heuristic seed induction

Main idea: Translations are similar to other words in the same way across languages.

Specifically: Words with similar meaning have similar monolingual similarity distributions (i.e. distributions of similarity across all words of the same language).

Artetxe et al. (2018)

133 of 235

Heuristic seed induction

Get similarity matrices and for both languages
Sort values in each row of and to get similarity distribution for every word; take square root to get smoothed density estimate

Artetxe et al. (2018)

134 of 235

Heuristic seed induction

Nearest neighbours from similarity distributions used as initial dictionary
Additional tricks to make self-learning more robust

Artetxe et al. (2018)

135 of 235

Overview

Authors	Seed dictionary induction
Barone (2016)	GAN
Zhang et al. (2017a)
Conneau et al. (2018)
Zhang et al. (2017b)	Wasserstein GAN / Optimal transport
Xu et al. (2018)
Alvarez-Melis and Jaakkola (2018)
Artetxe et al. (2018)	Heuristic
Hoshen and Wolf (2018)	Point Cloud Matching


Adversarial





Non-�adversarial

136 of 235

Word embedding spaces as point clouds

137 of 235

Point cloud matching

138 of 235

Approximate distribution alignment via PCA

Project embeddings to principal axes of variation via PCA
Then solve this easier problem first via self-learning
Use the solution as initialization for solving the original problem
PCA-based alignment is popular in point cloud matching

139 of 235

Iterative closest point (ICP)

ICP is very similar to our self-learning loop:

For each point in source space, find closest point in target space.
Estimate transformation that best aligns (minimizes distance between) source point and its match found in previous step.
Transform source points with estimated transformation.
Iterate.

Hoshen & Wolf (2018) do this in mini-batches, add bidirectionality + reconstruction.

140 of 235

Mini-Batch Cycle Iterative Closest Point (MBC-ICP)

For each embedding�in mini-batch, find closest�embedding.

Do this for both languages.

Hoshen & Wolf (2018)

141 of 235

Mini-Batch Cycle Iterative Closest Point (MBC-ICP)

Learn new�transformations that�minimize the distance between matched embeddings.

Hoshen & Wolf (2018)

142 of 235

Mini-Batch Cycle Iterative Closest Point (MBC-ICP)

Initialization is important!

In practice, MBC-ICP is run three times:

On 5k most frequent words mapped to principal components.
On all words.
On mutual nearest neighbours (7.5k words).

Hoshen & Wolf (2018)

143 of 235

Overview

Authors	Seed dictionary induction
Barone (2016)	GAN
Zhang et al. (2017a)
Conneau et al. (2018)
Zhang et al. (2017b)	Wasserstein GAN / Optimal transport
Xu et al. (2018)
Alvarez-Melis and Jaakkola (2018)
Artetxe et al. (2018)	Heuristic
Hoshen and Wolf (2018)	Point Cloud Matching

144 of 235

Take-aways

Existing methods for unsupervised seed induction share many commonalities:

Adversarial term
Optimal transport
Computing intra-language similarity
Iteratively optimizing a distance metric

Learning an unsupervised alignment of word representations is a general problem
Inspiration can come from many different domains:

Transportation theory
Computer vision
...

145 of 235

5. Systematic Comparisons

146 of 235

State of the art

New papers compare to previous work by reporting numbers of their systems on the same datasets.
This systematically compares multi-component systems, but not modeling choices.
We know VecMap > MUSE, but not whether heuristic initialization is better than GANs, for example.

147 of 235

State of the art

Common wisdom	Unsupervised is better than supervised (unless you have several thousands of good seeds)?
	VecMap (heuristic initialization and stochastic dictionary induction) is better than MUSE (GANs and Procrustes Analysis)?
Our message

148 of 235

State of the art

Common wisdom	Unsupervised is better than supervised (unless you have several thousands of good seeds)?
	VecMap (heuristic initialization and stochastic dictionary induction) is better than MUSE (GANs and Procrustes Analysis)?
Our message	Unsupervised is mostly worse than supervised.�Stochastic dictionary induction is a key component in the self-learning step.�GANs are competitive with heuristic (second-order) initialization.

149 of 235

Reminder

The difference between unsupervised and supervised approaches to cross-lingual learning is how we obtain our seed alignments (typically a dictionary).
The choice of iterative refinement method is therefore orthogonal to the source of the seed alignments.

150 of 235

Footnote on Stochasticity and Optimization

151 of 235

Procrustes Analysis overfits the seed

Procrustes Analysis (PA) minimizes the squared Euclidean distance between the seed points. If these are sampled at random, we minimize the expected overall distance.
Problem: Seeds are not selected at random.

152 of 235

PA overfits the seed

Supervised: Our dictionaries may be biased toward frequent words, common nouns, etc.
Unsupervised: Our dictionaries are both biased and error-prone.

153 of 235

PA overfits the seed

Supervised: Our dictionaries may be biased toward frequent words, common nouns, etc.
Unsupervised: Our dictionaries are both biased and error-prone.
While Lubin et al. (2019) have shown PA is relatively robust, they only investigated moderate noise levels.

154 of 235

PA overfits the seed

Supervised: Our dictionaries may be biased toward frequent words, common nouns, etc.
Unsupervised: Our dictionaries are both biased and error-prone.
While Lubin et al. (2019) have shown PA is relatively robust, they only investigated moderate noise levels. �
This may be why stochastic dictionary induction (i.e. dropout on current dictionary), a simple regularization technique (Artetxe et al., 2018) led to huge improvements over the state of the art.
The variance induced by the regularization may prevent the model from getting stuck in poor local optima.

155 of 235

Reminder: Why SGD works so well

Both GD and SGD iteratively update a set of parameters.
Both GD and SGD iteratively update a set of parameters. GD: All parameters at once. SGD: Random samples (minibatches).
GD: Great for convex, relatively smooth manifolds, but often trapped into local minima; the turbulence of SGD can get you out of such minima.

156 of 235

Reminder: Why SGD works so well

SGD induces a strong inductive bias, provably learns a network close to the random initialization and with a *small generalization error (Li and Liang, 2018); also in the context of matrix factorization (Gunasekar et al., 2017), which can be used for cross-lingual learning (Zou et al., 2013). Drop-out and noise injection is extremely important for stable training of GANs (Chintala, 2016).

157 of 235

Reminder: Why SGD works so well

SGD induces a strong inductive bias, provably learns a network close to the random initialization and with a *small generalization error (Li and Liang, 2018); also in the context of matrix factorization (Gunasekar et al., 2017), which can be used for cross-lingual learning (Zou et al., 2013). Drop-out and noise injection is extremely important for stable training of GANs (Chintala, 2016).
*: SGD is biased toward wide valleys, which tend to generalize better (Chaudhari et al., 2017).

158 of 235

Systematic comparisons of unsupervised methods

159 of 235

A General Framework (with all the tricks…)

160 of 235

A General Framework (with all the tricks…)

Typical focus: Seed dictionary induction and learning projections.

Common wisdom: Artetxe et al. (2018) state of the art, with incremental improvements in 2019.

What we do: Fix learning projections and compare approaches to C1.

161 of 235

Comparing GAN, ICP and Heuristic

162 of 235

Overview

Authors	Seed dictionary induction
Barone (2016)	GAN
Zhang et al. (2017a)
Conneau et al. (2018)
Zhang et al. (2017b)	Wasserstein GAN / Optimal transport
Xu et al. (2018)
Alvarez-Melis and Jaakkola (2018)
Artetxe et al. (2018)	Heuristic
Hoshen and Wolf (2018)	Point Cloud Matching

163 of 235

GAN, ICP, and GWA (with no refinement)

164 of 235

GAN, ICP, and GWA with Procrustes Analysis

165 of 235

GAN and GWA with SDI

166 of 235

State of the art

Common wisdom	Unsupervised is better than supervised (unless you have several thousands of good seeds)?
	VecMap (second order initialization and stochastic dictionary induction) is better than MUSE (GANs and Procrustes Analysis)?
Our message	Stochastic dictionary induction improves GAN-based seed induction.�How about other flavors of GANs?

167 of 235

Comparing flavors of GAN

168 of 235

Overview

Authors	Seed dictionary induction
Barone (2016)	GAN
Zhang et al. (2017a)
Conneau et al. (2018)
Zhang et al. (2017b)	Wasserstein GAN / Optimal transport
Xu et al. (2018)
Alvarez-Melis and Jaakkola (2018)
Artetxe et al. (2018)	Heuristic
Hoshen and Wolf (2018)	Point Cloud Matching

169 of 235

Flavors of GANs

Ivan and Sebastian introduced vanilla GANs and WGANs.
WGANs: Motivated by mode collapse and hubs. Replaces discriminator by real-valued function. Uses gradient clipping.
WGAN-GPs (Gulrajani et al., 2017): Provide gradient regularization (gradient penalty; GP) for more stable training (as an alternative to gradient clipping): norm of the discriminator gradients regularized to be 1 almost everywhere.
CT-GANs (Wei et al., 2018): Use data augmentation (consistency regularization; CT) instead of regularization. Perturb each data point twice, bounding the difference between the discriminator responses to the two datapoints.

170 of 235

GAN, WGAN, WGAN-GP and CT-GAN (w/o refinement)

171 of 235

State of the art

Common wisdom	Unsupervised is better than supervised (unless you have several thousands of good seeds)?
	VecMap (second order initialization and stochastic dictionary induction) is better than MUSE (GANs and Procrustes Analysis)?
Our message	Stochastic dictionary induction improves GAN-based seed induction. WGAN-GP works best as GAN-based seed induction.

172 of 235

Systematic comparisons of unsupervised vs. supervised

173 of 235

Unsupervised vs. supervised

[Conneau et al.; ICLR 2018]: “Without using any character information, our model even outperforms existing supervised methods on cross-lingual tasks for some language pairs”

[Artetxe et al.; ACL 2018]: “Our method succeeds in all tested scenarios and obtains the best published results in standard datasets, even surpassing previous supervised systems”

[Hoshen and Wolf; EMNLP 2018]: “...our method achieves better performance than recent state-of-the-art deep adversarial approaches and is competitive with the supervised baseline”

[Xu et al.; EMNLP 2018]: “Our evaluation (...) shows stronger or competitive performance of the

proposed method compared to other state-of-the-art supervised and unsupervised methods...”

[Chen and Cardie; EMNLP 2018]: “In addition, our model even beats supervised approaches trained with cross-lingual resources.”

174 of 235

Unsupervised vs. supervised

How come unsupervised is reportedly better than supervised?

175 of 235

Unsupervised vs. supervised

How come unsupervised is reportedly better than supervised?
Argument 1: Supervision is poor quality.

176 of 235

Unsupervised vs. supervised

How come unsupervised is reportedly better than supervised?
Argument 1: Supervision is poor quality. Counter-argument: We evaluate on the same data. Possible counter-counter-argument: Maybe the train splits are particularly poor?

177 of 235

Unsupervised vs. supervised

How come unsupervised is reportedly better than supervised?
Argument 1: Supervision is poor quality. Counter-argument: We evaluate on the same data. Possible counter-counter-argument: Maybe the train splits are particularly poor? Argument 2: Supervision is too limited.

178 of 235

Unsupervised vs. supervised

How come unsupervised is reportedly better than supervised?
Argument 1: Supervision is poor quality. Counter-argument: We evaluate on the same data. Possible counter-counter-argument: Maybe the train splits are particularly poor? Argument 2: Supervision is too limited.
Note the standard motivation - that resources are lacking - often is not true: To see this, try to think of a language for which we can induce good embeddings, but for which we cannot collect 200 translations into a major language (say from PanLex or ASJP).

179 of 235

Unsupervised vs. supervised

{Bulgarian, Catalan, Esperanto, Estonian, Basque, Finnish, Hebrew, Hungarian, Indonesian, Georgian, Korean, Lithuanian, Bokmål, Thai, Turkish}

x

{Bulgarian, Catalan, Esperanto, Estonian, Basque, Finnish, Hebrew, Hungarian, Indonesian, Georgian, Korean, Lithuanian, Bokmål, Thai, Turkish}

180 of 235

Unsupervised vs. supervised: similar languages

While fully unsupervised CLWEs really show impressive performance for similar language pairs, they are still worse than weakly supervised methods…

Furthermore, we don’t really need them for these scenarios...

181 of 235

Unsupervised vs. supervised: all languages

Supervised never fails!

Unsupervised never wins!

182 of 235

Unsupervised vs. supervised: distant languages

183 of 235

State of the art

Common wisdom	Unsupervised is better than supervised (unless you have several thousands of good seeds)?
	VecMap (second order initialization and stochastic dictionary induction) is better than MUSE (GANs and Procrustes Analysis)?
Our message	Stochastic dictionary induction improves GAN-based seed induction. WGAN-GP works best as GAN-based seed induction.
	Supervision doesn’t seem to hurt.

184 of 235

6. Open Problems

Part VI: Robustness and (In)stability

Unsupervised methods rely on the assumption that monolingual word vector spaces are approximately isomorphic and there exists a linear mapping between the two spaces. This assumption is not true for many cases, which leads to degenerate or suboptimal solutions. The efficacy and stability of unsupervised methods relies on multiple factors such as: monolingual representation models, domain (dis)similarity, language pair proximity and other typological properties, chosen hyper-parameters, etc. In this part, we will analyze the current problems with robustness and stability of weakly-supervised and unsupervised alignment methods in relation to all these factors, and introduce latest solutions to alleviate these problems. We will provide advice on how to approach weakly-supervised and unsupervised training based on a series of empirical observations available in recent literature \cite{Sogaard:2018acl,Hartmann:2018emnlp}. We will also discuss the (im)possibility of learning non-linear mappings using either non-linear generators or locally linear maps \cite{Nakashole:2018emnlp}.

185 of 235

6. Systematic Comparisons

186 of 235

Open problems

Robustness and Instability*
Morphology
Isomorphism
Evaluation

*We argue: Maybe Instability is not a problem.

187 of 235

(Robustness and) Instability

188 of 235

Robustness and instability

Robustness is (lack of) sensitivity to language, domain, algorithm, etc.
Instability is (lack of) sensitivity to random seed.

189 of 235

Robustness and instability

MUSE was shown to lack robustness and stability in Søgaard et al. (2018), Hartmann et al. (2018), and Artetxe et al. (2018).
ICP requires 600 random restarts to work reasonably.
More robust and stable methods have been presented, but we’ve just seen that even VecMap is not stable on hard language pairs.
Let’s briefly review the ways in which MUSE lacks robustness and stability…

190 of 235

Sensitive to languages [Søgaard et al., ACL-18]

Adversarial training fails for more distant language pairs

191 of 235

Sensitive to domains [Søgaard et al., ACL-18]

Even the most robust unsupervised model breaks down when the domains are dissimilar, even for similar language pairs

192 of 235

Sensitive to both [Søgaard et al., ACL-18]

Domain differences may exacerbate difficulties of generalising across dissimilar languages

193 of 235

Sensitive to algorithm [Hartmann et al., EMNLP-18]

194 of 235

Sensitive to random seed [Søgaard et al.; Artetxe et al.]

Søgaard et al. (2018): We note, though, that varying our random seed, performance for Estonian, Finnish, and Greek is sometimes (approximately 1 out of 10 runs) on par with Turkish. Detecting main causes and remedies for the inherent instability of adversarial training is one the most important avenues for future research.

Artetxe et al. (2018): Given the instability of these methods, we perform 10 runs for each, and report the best and average accuracies, the number of successful runs (those with >5% accuracy)...

195 of 235

Is instability a problem?

Instability is not a new phenomenon in NLP.
EM-HMM training is highly unstable (Goldberg et al., 2008); MERT training is highly unstable (Moore & Quirk, 2008); Gibbs sampling for dependency parsing is highly unstable (Naseem and Barzilay, 2011), etc.
Common solution: Use multiple random (or non-random) restarts and unsupervised selection criterion.
High level view: This form of hill-climbing is not different from tricks in other learning algorithms (drop-out, exploration, etc.).

196 of 235

Morphology

197 of 235

Morphology

Two basic assumptions: isomorphism and 1:1 correspondence.
Morphologically rich languages have a high average number of morphemes per word. This often breaks 1:1 correspondence.

198 of 235

Morphology

Two basic assumptions: isomorphism and 1:1 correspondence.
Morphologically rich languages have a high average number of morphemes per word. This often breaks 1:1 correspondence.
Instead it leads to 1:m correspondences and less support.

199 of 235

Morphology

Two basic assumptions: isomorphism and 1:1 correspondence.
Morphologically rich languages have a high average number of morphemes per word. This often breaks 1:1 correspondence.
Instead it leads to 1:m correspondences and less support.
Lemmatization?

200 of 235

Morphology

Two basic assumptions: isomorphism and 1:1 correspondence.
Morphologically rich languages have a high average number of morphemes per word. This often breaks 1:1 correspondence.
Instead it leads to 1:m correspondences and less support.
Lemmatization? Often just leads to m:n.
Segmentation?

atuaraangama

whenever I read

atuaruma

when I will read

atuarpoq

read

201 of 235

Morphology

Morphology has been shown to lead to poor performance (Adams et al., 2017; Søgaard et al., 2018).
… but so far, no flag’s been planted.

202 of 235

Morphology

Morphology has been shown to lead to poor performance (Adams et al., 2017; Søgaard et al., 2018).
… but so far, no flag’s been planted.
Except maybe: Zhang et al. (2019) use subword information for cross-lingual document classification and show competitive performance on BDI for related languages.

203 of 235

Open questions

Does cross-lingual embeddings - and/or BDI - even make sense in the context of different morphologies?

204 of 235

Isomorphism

205 of 235

Isomorphism

Philosophy class: :) Why would embedding spaces be isomorphic in the first place?
Anna Wierzbicka’s Semantic primes: I am also positing certain innate and universal rules of syntax – not in the sense of some intuitively unverifiable formal syntax à la Chomsky, but in the sense of intuitively verifiable patterns determining possible combinations of primitive concepts
Universal structure of lexical semantics (Youn et al., 2016): Indeed, our results are consistent with the hypothesis that cultural and environmental factors have little statistically significant effect on the semantic network of the subset of basic concepts studied here. To a large extent, the semantic network appears to be a human universal: For instance, SEA/OCEAN and SALT are more closely related to each other than either is to SUN, and this pattern is true for both coastal and inland languages.

206 of 235

Measuring isomorphism

More specifically, monolingual embedding spaces should be approximately isomorphic, i.e. same number of vertices, connected the same way
Does not strictly hold even for related languages
Can characterise similarity based on structure of nearest neighbour graphs.
Eigenvector similarity:

Nearest neighbour (NN) graphs of ten most frequent nouns in English and their German translations

where

Laplacian eigenvalues

smallest k so that sum of largest k eigenvalues is�> 90% of sum of all eigenvalues

Other measures of isomorphism: [Patra et al., ACL-19]

207 of 235

Measuring isomorphism

Eigenvector similarity correlates strongly with bilingual dictionary induction performance ( ).

EN-ES

EN-{ET, FI, EL}

Søgaard et al. (2018)

208 of 235

Isomorphism

Non-linear methods have been proposed by Nakashole (2018) and Zhang et al. (2019).

209 of 235

Isomorphism

Non-linear methods have been proposed by Nakashole (2018) and Zhang et al. (2019).
Nakashole (2018) combine several, independent linear maps to align two vector spaces.
Zhang et al. (2019) use an iterative normalization (alternating projection) technique that enables alignment of non-isomorphic vector spaces.

210 of 235

Open questions

Does cross-lingual embeddings - and/or BDI - even make sense in the context of different morphologies?
Is non-isomorphism a problem we should deal with during alignment? Or a problem we should deal with when inducing monolingual embeddings?

211 of 235

Evaluation

212 of 235

Open questions

Does cross-lingual embeddings - and/or BDI - even make sense in the context of different morphologies?
Is non-isomorphism a problem we should deal with in alignment? Or a problem we should deal with when inducing monolingual embeddings?
If the purpose of cross-lingual word embeddings is unsupervised MT, bilingual dictionary induction, and cross-lingual transfer, how should we best evaluate them?

213 of 235

Standard approach

Bilingual dictionary induction - using the MUSE dictionaries.

214 of 235

Standard approach

Bilingual dictionary induction - using the MUSE dictionaries.

We argue:

Bad choice!

215 of 235

Bilingual dictionaries

MUSE	45
Wiktionary	400+
Panlex	9000+
ASJP	9000+

216 of 235

The problem with MUSE

Dictionaries differ. MUSE is the de facto standard, but suffers from several weaknesses:
Coverage gaps: Manual inspection shows many predictions by state of the art methods that are true, but forms/senses are absent from MUSE.
Proper nouns (25% on average).
Errors (1%).

217 of 235

Alternatives to Bilingual Lexicon Induction

Downstream tasks
Document classification	Klementiev et al. (2003), etc.
Syntactic analysis	Gouws and Søgaard (2015), etc.
Semantic analysis	Glavas et al. (2019), etc.
Machine translation	Conneau et al. (2017), etc.
Word alignment	Levy et al. (2017), etc.

218 of 235

BLI correlation with downstream tasks

Without RCSLS, BLI performance correlates almost perfectly with downstream performance for XNLI and CLIR, weakly for CLDC
Why is RCSLS different?
RCSLS relaxes orthogonality constraint on projection matrix

→ For non-orthogonal projections, downstream evaluation is particularly important

[Glavaš et al., ACL-19]

Models	XNLI	CLDC	CLIR
All models	0.269	0.390	0.764
All w/o RCSLS	0.951	0.266	0.910

Correlations of model-level results between BLI and each of the downstream tasks.

219 of 235

The problem (or beauty) of downstream evaluation

Results are likely to be all over the map (Elming et al., 2013). �⇒ Statistical testing over tasks is unlikely to lead to significance.�⇒ Raises the bar for publication.

220 of 235

Metrics for BDI

P@1
P@10
MRR
Spearman’s

221 of 235

Other methodological weaknesses

Unclear what task to evaluate on.
Often unclear what metric to use to measure performance.
Community-wide overfitting to a small set of Indo-European languages. �BUT: More and more people are working on more distant and low-resource language pairs.

222 of 235

Other methodological weaknesses

Unclear what task to evaluate on.
Often unclear what metric to use to measure performance.
Community-wide overfitting to a small set of Indo-European languages.
BUT: More and more people are working on more distant and low-resource language pairs.
No agreement how to set hyper-parameters in experiments; in practice, making it an advantage to relegate parameters to hyper-parameters.

223 of 235

Other methodological weaknesses

Unclear what task to evaluate on.
Often unclear what metric to use to measure performance.
Community-wide overfitting to a small set of Indo-European languages.
BUT: More and more people are working on more distant and low-resource language pairs.
No agreement how to set hyper-parameters in experiments; in practice, making it an advantage to relegate parameters to hyper-parameters.
Not clear how to best think about and quantify instability.

224 of 235

Conclusions

225 of 235

7. Conclusions and Future Directions

226 of 235

Conclusions: Uncertainty prevails

Comparisons have not been systematic so far.
It seemed that VecMap was superior, even to supervised approaches; instead, while stochastic dictionary induction (and symmetry) is very useful with bad seeds, (W)GAN(-GP) is a competitive initialization strategy
Supervision does not hurt and is better if you have >1k alignments.
Such a ranking of methods, of course, assumes a set of languages, a task, a dataset, and a metric.
Fixing that, our benchmarks are still of poor quality. In sum, we need better benchmarks, a way of dealing with morphology, and a philosophy of isomorphism.

227 of 235

Open questions

Does cross-lingual embeddings - and/or BDI - even make sense in the context of different morphologies?
Is non-isomorphism a problem we should deal with in alignment? Or a problem we should deal with when inducing monolingual embeddings?
If the purpose of cross-lingual word embeddings is unsupervised MT, bilingual dictionary induction, and cross-lingual transfer, how should we best evaluate them?
How strong is the correlation between extrinsic and intrinsic evaluation tasks?
Denoising seed lexica? (Beyond dropout and symmetry?)
Cross-lingual contextualized embeddings [Schuster et al., NAACL-19; Aldarmaki and Diab, NAACL-19]

228 of 235

Open questions

Does cross-lingual embeddings - and/or BDI - even make sense in the context of different morphologies?
Is non-isomorphism a problem we should deal with in alignment? Or a problem we should deal with when inducing monolingual embeddings?
If the purpose of cross-lingual word embeddings is unsupervised MT, bilingual dictionary induction, and cross-lingual transfer, how should we best evaluate them?
How strong is the correlation between extrinsic and intrinsic evaluation tasks?
Denoising seed lexica? (Beyond dropout and symmetry?)
Cross-lingual contextualized embeddings, e.g., [Aldarmaki and Diab, NAACL-19]

Are there scenarios in cross-lingual NLP where we can really really benefit from fully unsupervised cross-lingual word embeddings?

229 of 235

Useful Links and Resources

230 of 235

Useful Links

Multilingual fastText vectors in 78 languages:

https://github.com/Babylonpartners/fastText_multilingual

More multilingual fastText vectors (44 languages, aligned with RCSLS):

https://fasttext.cc/docs/en/aligned-vectors.html

Cross-lingual language modeling pretraining:

https://github.com/facebookresearch/XLM

Multilingual BERT:

https://github.com/google-research/bert/blob/master/multilingual.md

Unsupervised MT (SMT and NMT):

https://github.com/facebookresearch/UnsupervisedMT

https://github.com/artetxem/monoses

https://github.com/artetxem/undreamt

231 of 235

Useful Links

VecMap…

https://github.com/artetxem/vecmap

...and its latent variable variant:

https://github.com/sebastianruder/latent-variable-vecmap

GAN-based seed induction (MUSE):

https://github.com/facebookresearch/MUSE

Gromov-Wasserstein seed induction:

https://github.com/dmelis/otalign

Seed induction based on point cloud matching (non-adversarial):

https://github.com/facebookresearch/NAM

232 of 235

Useful Links

GANs + Sinkhorn:

https://github.com/xrc10/unsup-cross-lingual-embedding-transfer

Unsupervised cross-lingual IR:

https://github.com/rlitschk/UnsupCLIR

BLI - classification framework:

https://github.com/geert-heyman/BLI_classifier

(Cross-lingual) dialogue state tracking:

https://github.com/nmrksic/neural-belief-tracker

https://github.com/salesforce/glad

https://github.com/wenhuchen/Cross-Lingual-NBT

233 of 235

...And Some Resources (!)

https://panlex.org/
https://babelnet.org
https://asjp.clld.org/ (The ASJP Database)

135M parallel sentences in 1,620 language pairs:

https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix

Coming soon: JW300 (presented at ACL-19):

At least 50k-100k parallel sentences for 54,376 language pairs

Plus OPUS and other similar repos...

234 of 235

Questions?

Slides: https://tinyurl.com/xlingual

If you found these slides helpful, consider citing the tutorial as:

@inproceedings{ruder2019unsupervised,� title={Unsupervised Cross-Lingual Representation Learning},� author={Ruder, Sebastian and S{\o}gaard, Anders and Vuli{\'c}, Ivan},� booktitle={Proceedings of ACL 2019, Tutorial Abstracts},� pages={31--38},� year={2019}�}

234

235 of 235

Some Important Aspects Not Covered Here

Other evaluation tasks beyond BLI

[Upadhyay et al., ACL-16; Heyman et al., NAACL-19; Glavaš et al., ACL-19]

Different ways to use cross-lingual word embeddings in a variety of applications

[Guo et al., ACL-15; Mrkšić et al., TACL-17; Vulić et al., EMNLP-17, Upadhyay et al., NAACL-18]

Quantifying approximate isomorphism and correlating it to CLWE task performance

[Søgaard et al., ACL-18; Hartmann et al., arXiv-18; Alvarez Melis and Jaakkola, EMNLP-18, Patra et al., ACL-19; Fujinuma et al., ACL-19]

Different methods to reduce anisomorphism, or to improve model selection, or to refine the initial mapping

[Doval et al., EMNLP-18, Zhang et al., ACL-19, Patra et al., ACL-19]

Other probing tests: hyper-parameter variation, different algorithms, word frequency

[Søgaard et al., ACL-18; Hartmann et al., EMNLP-18; Braune et al., NAACL-18]

Other similar approaches: Sinkhorn distances, multilingual learning

[Grave et al., ICLR-18; Wu et al., EMNLP-18; Heyman et al., NAACL-19, Alaux et al., ICLR-19]