JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 27

Unsupervised Approaches

for Neural Machine Translation

Facebook: Howard Lo (羅右鈞)

GitHub: howardyclo

2 of 27

Outline

Unsupervised Neural Machine Translation using Monolingual Corpora Only by Lamble et al.
Unsupervised Neural Machine Translation by Artetxe et al.

3 of 27

Motivation

Neural Machine Translation (NMT) requires large-scale parallel corpora in order to achieve good performance.
Even translating low resource language pairs requires tens of thousands of data.
Monolingual data is much easier to find.

4 of 27

Core Techniques

Word embeddings (Bilingual/Cross-lingual)
Sequence-to-sequence models
Adversarial training (Domain Adaptation)
Denoising & Back-Translation

5 of 27

Unsupervised Machine Translation using Monolingual Corpora Only

Authors: Guillaume Lample, Ludovic Denoyer, Marc'Aurelio Ranzato
Organization: Facebook AI Research
Conference: ICLR 2018
Link: https://openreview.net/forum?id=rkYTTf-AZ�

6 of 27

Model

Pretrained fastText embeddings (Bojanowski et al. 2017) for both languages then apply unsupervised word translation method (Conneau et al. 2017) to infer a bilingual dictionary.
Use a single bidirectional RNN encoder and a single RNN decoder for both languages.
Use sequence-to-suquence model with attention (Bahdanau et al. 2015).

7 of 27

Objective 1: Denoising Autoencoder

8 of 27

Objective 2: Back-Translation (Cross Domain Training)

9 of 27

Objective 3: Adversarial Training

Adversarial loss for discriminator (3-layer, 1024, ReLU, Feedforward NN):

Adversarial loss for encoder:

10 of 27

Final Objective

11 of 27

Unsupervised Training Algorithm

12 of 27

Model Selection

Since no parallel data for validation, they use the surrogate criterion:

13 of 27

Datasets

WMT’14 English-French (En-Fr): 36 millions pairs. After preprocessing, resulting in about 30 million pairs.
Then, extracting English monolingual corpus by randomly selecting 15 million sentences and French monolingual corpus from the complementary set. Further extract 3000 En/Fr sentences from En/Fr training data.
WMT’16 English-German: Same procedure as described above, resulting 1.8 million sentences for each language. Test set uses newstest2016 dataset.
Multi30k-Task1: A multi-lingual image descriptions containing ~30000 images. Training/Validation/Test set: 14500/500/1000 sentences after preprocessing.

14 of 27

Baselines

Word-by-word translation (WBW)
Word reordering (WR): Use LSTM-based Language model rerank WBW results. (Swapping the neighbor words pairwisely)
Oracle word reordering (ORW): Upper-bound baselines

15 of 27

Experiment Results

16 of 27

Experiment Results

Horizontal lines: Performance of unsupervised NMT that leverages 15 million monolingual sentences.

Close to the performance of supervised NMT trained on 100,000 parallel sentences.

17 of 27

Ablation Study

The most critical part is the both unsupervised word translation and back-translation.
The second critical part is the denoising auto-encoder, especially the noisy model.
The last but not the less critical part is adversarial training, ensuring that to really benefit from back-translation, one has to ensure that the distribution of latent sentence representations is similar across the two languages.

18 of 27

Unsupervised Neural Machine Translation

Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre & Kyunghyun Cho*
Organization: IXA NLP Group (University of the Basque Country), NYU*
Conference: ICLR 2018
Link: https://arxiv.org/pdf/1710.11041.pdf
The proposed method is highly overlapped with the previous paper (Paper1).

19 of 27

Similarities between the Papers

Both of them use monolingual corpora only.
Both use denoising autoencoders.
Both use back-translation.
Both use a shared bidirectional RNN encoder.

20 of 27

Differences between the Papers

Paper1 uses additional adversarial training while Paper2 doesn’t.
For cross-lingual embeddings, Paper1 uses fastText (Bojanowski et al. 2017) + MUSE (Conneau et al., 2017) while Paper2 uses word2vec (Mikolov et al. 2013) + VecMap (Artetxe et al. 2017)
Paper1 updates word embeddings while Paper2 fixes word embeddings.

21 of 27

Differences between the Papers

Paper1 uses a shared decoder, while Paper2 uses two language-specific decoders.

22 of 27

Differences between the Papers

For noisy model in denosing method, Paper1 uses swapping & dropping, while Paper2 only uses swapping.
For training method, Paper1 performs denoising, back-translation and adversarial training in a single iteration, while Paper2 iteratively performs denoising and back-translation.
Paper1 purposes model selection method, while Paper2 doesn’t.
Monolingual dataset used in Paper2 is also different from Paper1. (See paper for more detail)
Preprocessing in Paper2 additionally uses byte pair encoding (BPE) with 50000 vocabulary size, yielding a slight improvement on English-German translation.

23 of 27

Differences between the Papers

Paper2 experiments semi-supervised settings (10,000 and 100,000 parallel sentences), while Paper1 doesn’t.

24 of 27

Quantitative Analysis in Paper2

Preserve keywords but lack of fluency and adequacy.

Lack of adequacy.

Beyond word-by-word translation.

Can model structural differences between languages.

The model translate “a eu liea” (literally “has had place”) as “occurred”, going beyond a literal word-by-word translation.
At the same time, it correctly translates l’aeroport international de Los Angeles ´ as Los Angeles International Airport, properly modeling structural differences between the languages
Limitation: In particular, we observe that the proposed model has difficulties to preserve some concrete details�from source sentences. For instance, in the third example April and 2008 are properly translated, but�octobre (”October”) is mistranslated as May and 1 073 as 1 064. While these clearly point to some�adequacy issues, they are also understandable given the unsupervised nature of the system, and it�is remarkable that the system managed to at least replace a month by another month and a number�by another close number. (character level information might help.)
There are also some cases where there are both fluency and adequacy problems that severely�hinders understanding the original message from the proposed translation. For instance, in the last�example our system preserves most keywords in the original sentence, but it would be difficult to�correctly guess its meaning just by looking at its translation.

25 of 27

Code

MUSE: A library for Multilingual Unsupervised or Supervised word Embeddings
VecMap: A framework to learn bilingual word embedding mappings
Unsupervised Neural Machine Translation using Monolingual Corpora Only (Non-Official)
Unsupervised Neural Machine Translation

26 of 27

References

Word Translation without Parallel Data by Conneau et al., 2017.
Domain-Adversarial Training of Neural Networks by Ganin et al., 2016.
Improving Neural Machine Translation Models with Monolingual Data by Sennrich et al., 2015.
Neural Machine Translation by Jointly Learning to Align and Translate by Bahdanau et al., 2015.
Domain-Adversarial Training of Neural Networks by Ganin et al., 2016.
Enriching Word Vectors with Subword Information by Bojanowski et al., 2017.
NIPS 2016 Tutorial: Generative Adversarial Networks by Ian Goodfellow, 2016.

27 of 27

References

Learning bilingual word embeddings with (almost) no bilingual data by Artetxe et al., 2017.
Efficient Estimation of Word Representations in Vector Space by Mikolov et al., 2013.