1 of 10

Supervised Alignment in Attention Model

Pengyu He

Becky Marvin

2 of 10

Attentions in Neural Machine Translation

  • Training target

  • Modelling with neural network

𝛂t,i are components of the attention matrix. It‘s an alignment generated by the neural network.

3 of 10

Supervised Attentions

  • Adding aligning cost to the target function

  • Different forms of aligning cost

  • Mean Squared Error

  • Multiplication

  • Cross Entropy

4 of 10

Supervised Attentions

  • Adding aligning cost to the target function

  • Different forms of aligning cost

  • Mean Squared Error

  • Multiplication

  • Cross Entropy

5 of 10

Word Alignment

  • Aligning methods

  • GIZA++ (Liu et al., 2016; Chen et al., 2016, Mi et al., 2016) (One or Multi-hot alignment)

- Based on IBM model 4 or HMM

  • MaxEnt (Mi et al., 2016) (One or Multi-hot alignment)

- Based on maximum entropy method

  • Neural aligner (Alkhouli et al., 2016) (Soft alignment)

- Modeling jumps instead of alignments

- 2m+1 window size for source language, n-gram for target language

- Training set was produced by GIZA++, decoder was based on the neural aligner.

6 of 10

Alignment-Based Decoder (Alkhouli et al., 2016)

  • Translation model

  • Lexical model: Feed-forward Joint Model (FFJM)

  • Alignment model: Feed-forward Alignment Model (FFAM)

7 of 10

Advantages of Supervised Attentions

  • GIZA++ (Liu et al., 2016)

- Implements supervision of attention as a regularization in joint training objective

- Serves to mitigate the vanishing gradient problem

  • GIZA++ (Chen et al., 2016)

- Deals with problems of disordered output and word repetition seen in unsupervised attention, due

to long input sentences, OOV words and placeholders.

  • Neural aligner (Alkhouli et al., 2016)

Retraining using forced-alignments has two benefits:

- Since alignments are produced using both the lexical and alignment models, this can be seen as

joint training of the two models

- Since the neural decoder generate these alignments, training neural models based on them yields

models that are more consistent with the neural decoder.

8 of 10

Results

  • Aligner = GIZA++, Aligning Cost = Cross Entropy (Liu et al., 2016)

  • Aligner=GIZA++, Aligning Cost = Cross Entropy & Squared Error (Liu et al., 2016)

  • NMT1,2 are implementations of (Bahdanau et al., 2015)
  • SA-NMT means Supervised Alignment NMT
  • Development set is nist02. Test sets are nist05,06,08.
  • Baseline NMT is trained on English-French WMT data (common-crawl, Europarl v7)
  • Numbers means the ratio of likelihood to aligning cost
  • Decay means reducing the aligning weight to 90% after every epoch

9 of 10

Results

  • Aligner = Neural aligner (Alkhouli et al., 2016)

  • Task: IWSLT2013 German English
  • Monoligual data: language model is trained on additional data
  • Fine tuning: Model was tuned in the typical domain of data
  • dp and wp : linear distortion penalty and word penalty
  • Did not outperformed phrase-based system….

10 of 10

Summary

  • Adding aligning cost to the target function

  • Different forms of aligning cost

  • Alignment methods

  • Advantages

  • Mean Squared Error

  • Multiplication

  • Cross Entropy

  • GIZA++(most popular), MaxEnt, Neural models.

  • Regularization: deal with vanishing gradient problem
  • Address disordered output and word repetition.