1 of 10

Supervised Alignment in Attention Model

Pengyu He

Becky Marvin

2 of 10

Attentions in Neural Machine Translation

Training target

Modelling with neural network

𝛂_t,i are components of the attention matrix. It‘s an alignment generated by the neural network.

3 of 10

Supervised Attentions

Adding aligning cost to the target function

Different forms of aligning cost

Mean Squared Error

Multiplication

Cross Entropy

4 of 10

Supervised Attentions

Adding aligning cost to the target function

Different forms of aligning cost

Mean Squared Error

Multiplication

Cross Entropy

5 of 10

Word Alignment

Aligning methods

GIZA++ (Liu et al., 2016; Chen et al., 2016, Mi et al., 2016) (One or Multi-hot alignment)

- Based on IBM model 4 or HMM

MaxEnt (Mi et al., 2016) (One or Multi-hot alignment)

- Based on maximum entropy method

Neural aligner (Alkhouli et al., 2016) (Soft alignment)

- Modeling jumps instead of alignments

- 2m+1 window size for source language, n-gram for target language

- Training set was produced by GIZA++, decoder was based on the neural aligner.

6 of 10

Alignment-Based Decoder (Alkhouli et al., 2016)

Translation model

Lexical model: Feed-forward Joint Model (FFJM)

Alignment model: Feed-forward Alignment Model (FFAM)

7 of 10

Advantages of Supervised Attentions

GIZA++ (Liu et al., 2016)

- Implements supervision of attention as a regularization in joint training objective

- Serves to mitigate the vanishing gradient problem

GIZA++ (Chen et al., 2016)

- Deals with problems of disordered output and word repetition seen in unsupervised attention, due

to long input sentences, OOV words and placeholders.

Neural aligner (Alkhouli et al., 2016)

Retraining using forced-alignments has two benefits:

- Since alignments are produced using both the lexical and alignment models, this can be seen as

joint training of the two models

- Since the neural decoder generate these alignments, training neural models based on them yields

models that are more consistent with the neural decoder.

8 of 10

Results

Aligner = GIZA++, Aligning Cost = Cross Entropy (Liu et al., 2016)

Aligner=GIZA++, Aligning Cost = Cross Entropy & Squared Error (Liu et al., 2016)

NMT1,2 are implementations of (Bahdanau et al., 2015)
SA-NMT means Supervised Alignment NMT
Development set is nist02. Test sets are nist05,06,08.

Baseline NMT is trained on English-French WMT data (common-crawl, Europarl v7)
Numbers means the ratio of likelihood to aligning cost
Decay means reducing the aligning weight to 90% after every epoch

9 of 10

Results

Aligner = Neural aligner (Alkhouli et al., 2016)

Task: IWSLT2013 German English
Monoligual data: language model is trained on additional data
Fine tuning: Model was tuned in the typical domain of data
dp and wp : linear distortion penalty and word penalty
Did not outperformed phrase-based system….

10 of 10

Summary

Adding aligning cost to the target function

Different forms of aligning cost

Alignment methods

Advantages

Mean Squared Error

Multiplication

Cross Entropy

GIZA++(most popular), MaxEnt, Neural models.

Regularization: deal with vanishing gradient problem
Address disordered output and word repetition.