Neural Machine Translation
Human Language Technologies
Università di Pisa
Università di Pisa
Slides adapted from Abigail See
1990s-2010s: Statistical Machine Translation
2014: Neural Machine Translation
Huge Impact on
Machine Translation Research
What is Neural Machine Translation
Neural Machine Translation (NMT)
The sequence to sequence model
the poor don’t have any money <END>
<START> the poor don’t have any money
les pauvres sont démunis
Encoding of the source sentence.
Provides initial hidden state
for Decoder RNN.
Encoder RNN produces
an encoding of the
source sentence.
Decoder RNN is a Language Model that generates
target sentence conditioned on encoding.
Note: This diagram shows test time behavior:
decoder output is fed in as next step’s input
Source sentence (input)
Encoder RNN
Decoder RNN
argmax
argmax
argmax
argmax
argmax
argmax
argmax
Target sentence (output)
Sequence-to-sequence model is versatile
Neural Machine Translation (NMT)
P(y|x) = P(y1|x) P(y2|y1, x) P(y3|y1, y2, x) …, P(yT|y1,…, yT-1, x)
Probability of next target word, given
target words so far and source sentence x
Training a Neural Machine Translation system
Seq2seq is optimized as a single system.
Backpropagation operates “end to end”
Source sentence (from corpus)
<START> the poor don’t have any money
les pauvres sont démunis
Encoder RNN
Decoder RNN
Target sentence (from corpus)
J1 + J2 + J3 + J4 + J5 + J6 + J7
ŷ1 ŷ2 ŷ3 ŷ4 ŷ5 ŷ6 ŷ7
= negative log
prob of “the”
= negative log
prob of “have”
= negative log
prob of <END>
Better-than-greedy decoding?
<START> the poor don’t have any money
the poor don’t have any money <END>
argmax
argmax
argmax
argmax
argmax
argmax
argmax
Better-than-greedy decoding?
Exhaustive search decoding
Beam search decoding
Beam search decoding: example
<START>
the
poor
people
a
poor
person
are
don’t
person
but
always
not
have
take
in
with
any
enough
money
funds
money
funds
Beam size = 2
Beam search decoding: stopping criterion
Beam search decoding: finishing up
Pro/Cons of NMT wrt SMT
Disadvantages
Benefits
MT progress over time
Source: http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf
[Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal
NMT: the biggest success story of NLP Deep Learning
Neural Machine Translation went from a fringe research activity in 2014 to the leading standard method in 2016
9/2016 Google Brain announces NMT translator
10/2016 Systran follows up
11/2016 Microsoft does the same
Is MT a solved problem?
Is Machine Translation solved?
slide by C. Manning
Is MT a solved problem?
Didn’t specify gender
slide by C. Manning
Is MT a solved problem?
slide by C. Manning
NMT research continues
NMT is the flagship task for NLP Deep Learning
ATTENTION
Attention
Sequence-to-sequence: the bottleneck problem
the poor don’t have any money <END>
<START> the poor don’t have any money
les pauvres sont démunis
Encoding of the source sentence.
Problems with this architecture?
Source sentence (input)
Encoder RNN
Decoder RNN
argmax
argmax
argmax
argmax
argmax
argmax
argmax
Target sentence (output)
Sequence-to-sequence: the bottleneck problem
the poor don’t have any money <END>
<START> the poor don’t have any money
les pauvres sont démunis
Encoding of the source sentence. This needs to capture all information about the source sentence. Information bottleneck!
Problems with this architecture?
Source sentence (input)
Encoder RNN
Decoder RNN
argmax
argmax
argmax
argmax
argmax
argmax
argmax
Target sentence (output)
Attention
Sequence-to-sequence with attention
<START>
les pauvres sont démunis
Encoder RNN
Decoder RNN
Attention
scores
dot product
Sequence-to-sequence with attention
<START>
les pauvres sont démunis
Encoder RNN
Decoder RNN
Attention
scores
On this decoder timestep, we’re
mostly focusing on the first
encoder hidden state (”les”
Attention
distribution
Take softmax to turn the scores
into a probability distrìbution
Sequence-to-sequence with attention
<START>
les pauvres sont démunis
Encoder RNN
Decoder RNN
Attention
scores
Attention
distribution
Use the attention distribution to take a
weighted sum of the encoder hidden
states.
Attention
output
The attention output mostly contains
information the hidden states that
received high atten
Sequence-to-sequence with attention
<START>
les pauvres sont démunis
Encoder RNN
Decoder RNN
Attention
scores
Attention
distribution
Attention
output
Concatenate attention output
with decoder hidden state, then
use to compute ŷ1 as before
ŷ1
the
Sequence-to-sequence with attention
<START> the
les pauvres sont démunis
Encoder RNN
Decoder RNN
Attention
scores
Attention
distribution
Attention
output
ŷ2
poor
Sequence-to-sequence with attention
<START> the poor
les pauvres sont démunis
Encoder RNN
Decoder RNN
Attention
scores
Attention
distribution
Attention
output
ŷ3
don’t
Sequence-to-sequence with attention
<START> the poor don’t
les pauvres sont démunis
Encoder RNN
Decoder RNN
Attention
scores
Attention
distribution
Attention
output
ŷ4
have
Sequence-to-sequence with attention
<START> the poor don’t have
les pauvres sont démunis
Encoder RNN
Decoder RNN
Attention
scores
Attention
distribution
Attention
output
ŷ5
any
Sequence-to-sequence with attention
<START> the poor don’t have any money
les pauvres sont démunis
Encoder RNN
Decoder RNN
Attention
output
ŷ6
money
Attention
scores
Attention
distribution
Attention in Equations
Attention is great
There are several attention variants
Attention is a general Deep Learning technique
Attention is all you need
Self Attention to contextualize words
Machine Translation
Self Attention + Encoder-Decoder Attention
Application to other tasks
Task Specific Models
Transformer + Extra Layer fine tuned to task
SuperGlue: benchmark on 10 tasks
Transformer Models
Dimensions
Model | Dimensions | Parameters |
BERT | 24-layer, 1024-hidden, 16-heads. | 340M |
GPT-2 | 48-layer, 1600-hidden, 25-heads | 1558M |
Transformer XL | 18-layer, 1024-hidden, 16-heads,. | 257M |
XLM | 12-layer, 1024-hidden, 8-heads | |
RoBERTa | 24-layer, 1024-hidden, 16-heads | 355M |
DistilBERT | 6-layer, 768-hidden, 12-heads | 134M |
CTRL | 48-layer, 1280-hidden, 16-heads | 1.6B |
alBERT | 12 repeating layer, 128 embedding, 4096-hidden, 64-heads | 223M |
T5 | 24-layers, 1024-hidden-state, 65536 feed-forward hidden-state, 128-heads | 11B |
GPT-3 | 96-layers, 96x 128-heads | 175B |
Google Neural MT Architecture
MT State of the Art
Language | Model | BLEU |
EN-DE | Transformer Big + BT (Edunov et al., 2018) | 35.0 |
EN-DE | Transformer Big (Vaswani et al., 2017) | 28.4 |
EN-FR | DeepL | 45.9 |
EN-FR | Transformer Big (Vaswani et al., 2017) | 41.0 |
Training Costs
Costs of training a NMT model
Model | BLEU (en-fr) | Training Cost (FLOPS) | Time |
ConvS2S | 40.56 | 1.5 1020 | |
MOE | 40.56 | 1.2 1020 | |
ConvS2S Ensemble | 41.29 | 1.2 1021 | 35 gg |
Multilingual Translation
Monolingual Translation
One Encoder-Decoder for each language pair
En-Es Encoder
En-Es Decoder
En-Nl Encoder
En-Nl Decoder
En-Fr Encoder
En-Fr Decoder
Multilingual Zero-shot Translation
Pairwise translation
Induced translation
Multilanguage NMT
Hello, how are you? ->
Hola, ¿cómo estás?
<2es> Hello, how are you? ->
Hola, ¿cómo estás?
Multilingual NMT
Pairwise
Only target language
Input is irrelevant
Capable of translating between any language pair
Zero-shot translation
Learns Japanese->Korean from pairs
English->Japanese, Japanese->English, English->Korean, Korean->English
Quality: BLEU Score
Model | Single | Multi | Multi | Multi | Multi |
#nodes | 1024 | 1024 | 1280 | 1536 | 1792 |
#params | 3B | 255M | 367M | 499M | 650M |
Prod English→Japanese | 23.66 | 21.10 | 21.17 | 21.72 | 21.70 |
Prod English→Korean | 19.75 | 18.41 | 18.36 | 18.30 | 18.28 |
Prod Japanese→English | 23.41 | 21.62 | 22.03 | 22.51 | 23.18 |
Prod Korean→English | 25.42 | 22.87 | 23.46 | 24.00 | 24.67 |
Prod English→Spanish | 34.50 | 34.25 | 34.40 | 34.77 | 34.70 |
Prod English→Portuguese | 38.40 | 37.35 | 37.42 | 37.80 | 37.92 |
Prod Spanish→English | 38.00 | 36.04 | 36.50 | 37.26 | 37.45 |
Prod Portuguese→English | 44.40 | 42.53 | 42.82 | 43.64 | 43.87 |
Prod English→German | 26.43 | 23.15 | 23.77 | 23.63 | 24.01 |
Prod English→French | 35.37 | 34.00 | 34.19 | 34.91 | 34.81 |
Prod German→English | 31.77 | 31.17 | 31.65 | 32.24 | 32.32 |
Prod French→English | 36.47 | 34.40 | 34.56 | 35.35 | 35.52 |
ave diff | - | -1.72 | -1.43 | -0.95 | -0.76 |
vs single | - | -5.6% | -4.7% | -3.1% | -2.5% |
No Language Left Behind
Project by Meta to deliver translations directly between 200 languages—including low-resource languages like Asturian, Luganda, Urdu and more.
https://ai.facebook.com/research/no-language-left-behind/
Stage 1: Automatic dataset construction
Collect sentences in the input language and desired output language
Stage 2: Training
Encoder-decoder model trained on millions of example translations
Stage 3: Evaluation
Evaluate model against a human-translated set of sentence translations
Does NMT understand meaning of sentences?
Visualization of translations
translated sentences colored by their meaning
one sentence in 3 languages
one sentence in 3 languages in 3 colors
Conclusions
We learned the history of Machine Translation (MT)