1 of 27

Unsupervised Approaches

for Neural Machine Translation

Facebook: Howard Lo (羅右鈞)

GitHub: howardyclo

2 of 27

Outline

3 of 27

Motivation

  • Neural Machine Translation (NMT) requires large-scale parallel corpora in order to achieve good performance.
  • Even translating low resource language pairs requires tens of thousands of data.
  • Monolingual data is much easier to find.

4 of 27

Core Techniques

  • Word embeddings (Bilingual/Cross-lingual)
  • Sequence-to-sequence models
  • Adversarial training (Domain Adaptation)
  • Denoising & Back-Translation

5 of 27

Unsupervised Machine Translation using Monolingual Corpora Only

  • Authors: Guillaume Lample, Ludovic Denoyer, Marc'Aurelio Ranzato
  • Organization: Facebook AI Research
  • Conference: ICLR 2018
  • Link: https://openreview.net/forum?id=rkYTTf-AZ�

6 of 27

Model

  • Pretrained fastText embeddings (Bojanowski et al. 2017) for both languages then apply unsupervised word translation method (Conneau et al. 2017) to infer a bilingual dictionary.
  • Use a single bidirectional RNN encoder and a single RNN decoder for both languages.
  • Use sequence-to-suquence model with attention (Bahdanau et al. 2015).

7 of 27

Objective 1: Denoising Autoencoder

8 of 27

Objective 2: Back-Translation (Cross Domain Training)

9 of 27

Objective 3: Adversarial Training

  • Adversarial loss for discriminator (3-layer, 1024, ReLU, Feedforward NN):

  • Adversarial loss for encoder:

10 of 27

Final Objective

11 of 27

Unsupervised Training Algorithm

12 of 27

Model Selection

Since no parallel data for validation, they use the surrogate criterion:

13 of 27

Datasets

  • WMT’14 English-French (En-Fr): 36 millions pairs. After preprocessing, resulting in about 30 million pairs.
  • Then, extracting English monolingual corpus by randomly selecting 15 million sentences and French monolingual corpus from the complementary set. Further extract 3000 En/Fr sentences from En/Fr training data.
  • WMT’16 English-German: Same procedure as described above, resulting 1.8 million sentences for each language. Test set uses newstest2016 dataset.
  • Multi30k-Task1: A multi-lingual image descriptions containing ~30000 images. Training/Validation/Test set: 14500/500/1000 sentences after preprocessing.

14 of 27

Baselines

  • Word-by-word translation (WBW)
  • Word reordering (WR): Use LSTM-based Language model rerank WBW results. (Swapping the neighbor words pairwisely)
  • Oracle word reordering (ORW): Upper-bound baselines

15 of 27

Experiment Results

16 of 27

Experiment Results

Horizontal lines: Performance of unsupervised NMT that leverages 15 million monolingual sentences.

Close to the performance of supervised NMT trained on 100,000 parallel sentences.

17 of 27

Ablation Study

  • The most critical part is the both unsupervised word translation and back-translation.
  • The second critical part is the denoising auto-encoder, especially the noisy model.
  • The last but not the less critical part is adversarial training, ensuring that to really benefit from back-translation, one has to ensure that the distribution of latent sentence representations is similar across the two languages.

18 of 27

Unsupervised Neural Machine Translation

  • Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre & Kyunghyun Cho*
  • Organization: IXA NLP Group (University of the Basque Country), NYU*
  • Conference: ICLR 2018
  • Link: https://arxiv.org/pdf/1710.11041.pdf
  • The proposed method is highly overlapped with the previous paper (Paper1).

19 of 27

Similarities between the Papers

  • Both of them use monolingual corpora only.
  • Both use denoising autoencoders.
  • Both use back-translation.
  • Both use a shared bidirectional RNN encoder.

20 of 27

Differences between the Papers

  • Paper1 uses additional adversarial training while Paper2 doesn’t.
  • For cross-lingual embeddings, Paper1 uses fastText (Bojanowski et al. 2017) + MUSE (Conneau et al., 2017) while Paper2 uses word2vec (Mikolov et al. 2013) + VecMap (Artetxe et al. 2017)
  • Paper1 updates word embeddings while Paper2 fixes word embeddings.

21 of 27

Differences between the Papers

  • Paper1 uses a shared decoder, while Paper2 uses two language-specific decoders.

22 of 27

Differences between the Papers

  • For noisy model in denosing method, Paper1 uses swapping & dropping, while Paper2 only uses swapping.
  • For training method, Paper1 performs denoising, back-translation and adversarial training in a single iteration, while Paper2 iteratively performs denoising and back-translation.
  • Paper1 purposes model selection method, while Paper2 doesn’t.
  • Monolingual dataset used in Paper2 is also different from Paper1. (See paper for more detail)
  • Preprocessing in Paper2 additionally uses byte pair encoding (BPE) with 50000 vocabulary size, yielding a slight improvement on English-German translation.

23 of 27

Differences between the Papers

  • Paper2 experiments semi-supervised settings (10,000 and 100,000 parallel sentences), while Paper1 doesn’t.

24 of 27

Quantitative Analysis in Paper2

Preserve keywords but lack of fluency and adequacy.

Lack of adequacy.

Beyond word-by-word translation.

Can model structural differences between languages.

25 of 27

Code

26 of 27

References

27 of 27

References