Supervised Alignment in Attention Model
Pengyu He
Becky Marvin
Attentions in Neural Machine Translation
𝛂t,i are components of the attention matrix. It‘s an alignment generated by the neural network.
Supervised Attentions
Supervised Attentions
Word Alignment
- Based on IBM model 4 or HMM
- Based on maximum entropy method
- Modeling jumps instead of alignments
- 2m+1 window size for source language, n-gram for target language
- Training set was produced by GIZA++, decoder was based on the neural aligner.
Alignment-Based Decoder (Alkhouli et al., 2016)
Advantages of Supervised Attentions
- Implements supervision of attention as a regularization in joint training objective
- Serves to mitigate the vanishing gradient problem
- Deals with problems of disordered output and word repetition seen in unsupervised attention, due
to long input sentences, OOV words and placeholders.
Retraining using forced-alignments has two benefits:
- Since alignments are produced using both the lexical and alignment models, this can be seen as
joint training of the two models
- Since the neural decoder generate these alignments, training neural models based on them yields
models that are more consistent with the neural decoder.
Results
Results
Summary