1 of 62

Neural Machine Translation

Human Language Technologies

Università di Pisa

Università di Pisa

Slides adapted from Abigail See

2 of 62

1990s-2010s: Statistical Machine Translation

  • SMT is a huge research field
    • The best systems are extremely complex
    • Hundreds of important details, too many to mention
    • Systems have many separately-designed subcomponents
    • Lots of feature engineering
      • Need to design features to capture particular language phenomena
  • Require compiling and maintaining extra resources
    • Like tables of translated phrases
  • Lots of human effort to maintain
    • Repeated effort for each language pair!

3 of 62

2014: Neural Machine Translation

Huge Impact on

Machine Translation Research

4 of 62

What is Neural Machine Translation

  • Neural Machine Translation (NMT) is a way to do Machine Translation with a single neural network
  • The neural network architecture is called sequence-to-sequence (aka seq2seq) and it involves two RNNs.

5 of 62

Neural Machine Translation (NMT)

The sequence to sequence model

the poor don’t have any money <END>

<START> the poor don’t have any money

les pauvres sont démunis

Encoding of the source sentence.

Provides initial hidden state

for Decoder RNN.

Encoder RNN produces

an encoding of the

source sentence.

Decoder RNN is a Language Model that generates

target sentence conditioned on encoding.

Note: This diagram shows test time behavior:

decoder output is fed in as next step’s input

Source sentence (input)

Encoder RNN

Decoder RNN

argmax

argmax

argmax

argmax

argmax

argmax

argmax

Target sentence (output)

6 of 62

Sequence-to-sequence model is versatile

  • The general notion here is an encoder-decoder model
    • One neural network takes input and produces a neural representation
    • Another network produces output based on that neural representation
    • If the input and output are sequences, we call it a seq2seq model
  • Sequence-to-sequence is useful for more than just MT
  • Many NLP tasks can be phrased as sequence-to-sequence:
    • Summarization (long text → short text)
    • Dialogue (previous utterances → next utterance)
    • Parsing (input text → output parse as sequence)
    • DB interfaces (natural language → SQL)
    • Code generation (natural language → Python code)

7 of 62

Neural Machine Translation (NMT)

  • The sequence-to-sequence model is an example of a Conditional Language Model.
    • Language Model because the decoder is predicting the next word of the target sentence y
    • Conditional because its predictions are also conditioned on the source sentence x
  • NMT directly calculates P(y|x):

P(y|x) = P(y1|x) P(y2|y1, x) P(y3|y1, y2, x) …, P(yT|y1,…, yT-1, x)

  • Question: How to train a NMT system?
  • Answer: Get a big parallel corpus…

Probability of next target word, given

target words so far and source sentence x

8 of 62

Training a Neural Machine Translation system

Seq2seq is optimized as a single system.

Backpropagation operates “end to end”

Source sentence (from corpus)

<START> the poor don’t have any money

les pauvres sont démunis

Encoder RNN

Decoder RNN

Target sentence (from corpus)

J1 + J2 + J3 + J4 + J5 + J6 + J7

 

ŷ1 ŷ2 ŷ3 ŷ4 ŷ5 ŷ6 ŷ7

= negative log

prob of “the”

= negative log

prob of “have”

= negative log

prob of <END>

9 of 62

Better-than-greedy decoding?

  • We showed how to generate (or “decode”) the target sentence by taking argmax on each step of the decoder

  • This is greedy decoding (take most probable word on each step)
  • Problems?

<START> the poor don’t have any money

the poor don’t have any money <END>

argmax

argmax

argmax

argmax

argmax

argmax

argmax

10 of 62

Better-than-greedy decoding?

  • Greedy decoding has no way to undo decisions!
    • les pauvres sont demunis (the poor don’t have any money)
    • → the ____
    • → the poor ____
    • → the poor are ____ (wrong, but can’t go back)
  • Better option: use beam search (a search algorithm) to explore several hypotheses and select the best one

11 of 62

Exhaustive search decoding

  •  

12 of 62

Beam search decoding

  •  

13 of 62

Beam search decoding: example

<START>

the

poor

people

a

poor

person

are

don’t

person

but

always

not

have

take

in

with

any

enough

money

funds

money

funds

Beam size = 2

 

14 of 62

Beam search decoding: stopping criterion

  • In greedy decoding, usually we decode until the model produces an <END> token
    • For example: <START> the poor don’t have enough money <END>
  • In beam search decoding, different hypotheses may produce <END> tokens on different timesteps
    • When a hypothesis produces <END>, that hypothesis is complete.
    • Place it aside and continue exploring other hypotheses via beam search.
  • Usually we continue beam search until:
    • We reach timestep T (where T is some pre-defined cutoff), or
    • We have at least n completed hypotheses (where n is pre-defined cutoff)

15 of 62

Beam search decoding: finishing up

  •  

16 of 62

Pro/Cons of NMT wrt SMT

  • Better performance
    • More fluent
    • Better use of context
    • Better use of phrase similarities
  • A single neural network to be optimized end-to-end
    • No subcomponents to be individually optimized
  • Requires much less human engineering effort
    • No feature engineering
    • Same method for all language pairs
  • NMT is less interpretable
    • Hard to debug
  • NMT is difficult to control
    • For example, can’t easily specify rules or guidelines for translation
    • Safety concerns!

Disadvantages

Benefits

17 of 62

MT progress over time

Source: http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf

[Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal

18 of 62

NMT: the biggest success story of NLP Deep Learning

Neural Machine Translation went from a fringe research activity in 2014 to the leading standard method in 2016

  • 2014: First seq2seq paper published
  • 2016: Google Translate switches from SMT to NMT

9/2016 Google Brain announces NMT translator

10/2016 Systran follows up

11/2016 Microsoft does the same

  • 2018: everyone uses NMT
  • This is amazing!
    • SMT systems, built by hundreds of engineers over many years, outperformed by NMT systems trained by a handful of engineers in a few months

19 of 62

Is MT a solved problem?

  • Nope!
  • Many difficulties remain:
    • Out-of-vocabulary words
    • Domain mismatch between train and test data
    • Maintaining context over longer text
    • Failures to accurately capture sentence meaning
    • Pronoun (or zero pronoun) resolution errors
    • Morphological agreement errors
    • Low-resource language pair

20 of 62

Is Machine Translation solved?

  • Using common sense is still hard:

slide by C. Manning

21 of 62

Is MT a solved problem?

  • Picks up biases in training data

Didn’t specify gender

slide by C. Manning

22 of 62

Is MT a solved problem?

  • Uninterpretable systems do strange things

slide by C. Manning

23 of 62

NMT research continues

NMT is the flagship task for NLP Deep Learning

  • NMT research has pioneered many of the recent innovations of NLP Deep Learning
  • In 2021: NMT research continues to thrive
    • Researchers have found many improvements to the “vanilla” seq2seq NMT system we’ve presented today
    • But one improvement is so integral that it is the new vanilla…

ATTENTION

24 of 62

Attention

25 of 62

Sequence-to-sequence: the bottleneck problem

the poor don’t have any money <END>

<START> the poor don’t have any money

les pauvres sont démunis

Encoding of the source sentence.

Problems with this architecture?

Source sentence (input)

Encoder RNN

Decoder RNN

argmax

argmax

argmax

argmax

argmax

argmax

argmax

Target sentence (output)

26 of 62

Sequence-to-sequence: the bottleneck problem

the poor don’t have any money <END>

<START> the poor don’t have any money

les pauvres sont démunis

Encoding of the source sentence. This needs to capture all information about the source sentence. Information bottleneck!

Problems with this architecture?

Source sentence (input)

Encoder RNN

Decoder RNN

argmax

argmax

argmax

argmax

argmax

argmax

argmax

Target sentence (output)

27 of 62

Attention

  • Attention provides a solution to the bottleneck problem.
  • Core idea: on each step of the decoder, focus on a particular part of the source sequence
  • First, we will show via diagram (no equations), then we will show with equations

28 of 62

Sequence-to-sequence with attention

<START>

les pauvres sont démunis

Encoder RNN

Decoder RNN

Attention

scores

dot product

29 of 62

Sequence-to-sequence with attention

<START>

les pauvres sont démunis

Encoder RNN

Decoder RNN

Attention

scores

On this decoder timestep, we’re

mostly focusing on the first

encoder hidden state (”les”

Attention

distribution

Take softmax to turn the scores

into a probability distrìbution

30 of 62

Sequence-to-sequence with attention

<START>

les pauvres sont démunis

Encoder RNN

Decoder RNN

Attention

scores

Attention

distribution

Use the attention distribution to take a

weighted sum of the encoder hidden

states.

Attention

output

The attention output mostly contains

information the hidden states that

received high atten

31 of 62

Sequence-to-sequence with attention

<START>

les pauvres sont démunis

Encoder RNN

Decoder RNN

Attention

scores

Attention

distribution

Attention

output

Concatenate attention output

with decoder hidden state, then

use to compute ŷ1 as before

ŷ1

the

32 of 62

Sequence-to-sequence with attention

<START> the

les pauvres sont démunis

Encoder RNN

Decoder RNN

Attention

scores

Attention

distribution

Attention

output

ŷ2

poor

33 of 62

Sequence-to-sequence with attention

<START> the poor

les pauvres sont démunis

Encoder RNN

Decoder RNN

Attention

scores

Attention

distribution

Attention

output

ŷ3

don’t

34 of 62

Sequence-to-sequence with attention

<START> the poor don’t

les pauvres sont démunis

Encoder RNN

Decoder RNN

Attention

scores

Attention

distribution

Attention

output

ŷ4

have

35 of 62

Sequence-to-sequence with attention

<START> the poor don’t have

les pauvres sont démunis

Encoder RNN

Decoder RNN

Attention

scores

Attention

distribution

Attention

output

ŷ5

any

36 of 62

Sequence-to-sequence with attention

<START> the poor don’t have any money

les pauvres sont démunis

Encoder RNN

Decoder RNN

Attention

output

ŷ6

money

Attention

scores

Attention

distribution

37 of 62

Attention in Equations

  •  

38 of 62

Attention is great

  • Attention significantly improves NMT performance
    • It’s very useful to allow decoder to focus on certain parts of the source
  • Attention solves the bottleneck problem
    • Attention allows decoder to look directly at source; bypass bottleneck
  • Attention helps with vanishing gradient problem
    • Provides shortcut to faraway states
  • Attention provides some interpretability
    • By inspecting attention distribution, we can�see what the decoder was focusing on
    • We get alignment for free!
    • This is cool because we never explicitly trained�an alignment system
    • The network just learned alignment by itself

39 of 62

There are several attention variants

  •  

40 of 62

Attention is a general Deep Learning technique

  • We’ve seen that attention is a great way to improve the sequence-to-sequence model for Machine Translation.
  • However: You can use attention in many architectures (not just seq2seq) and many tasks (not just MT)
  • More general definition of attention:
    • Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query.
  • We sometimes say that the query attends to the values.
  • For example, in the seq2seq + attention model, each decoder hidden state (query) attends to all the encoder hidden states (values).

41 of 62

Attention is all you need

42 of 62

Self Attention to contextualize words

  • A limit of word emeddings was that each word had a single vector, even though a word might have multiple meanings
  • Attention allows introducing into the representation of words the context in which it appears
  • It allows of observing other words in the phrase
  • Example: relates word “it” to “animal”

43 of 62

Machine Translation

  • The encoding component is a stack of encoders
  • The decoding component is a stack of decoders of the same number.

44 of 62

Self Attention + Encoder-Decoder Attention

  • Self attention contextualizes within each phrase
  • Encoder-Decoder Attention helps to focus on the relevant parts of the other phrase

45 of 62

Application to other tasks

46 of 62

Task Specific Models

Transformer + Extra Layer fine tuned to task

47 of 62

SuperGlue: benchmark on 10 tasks

48 of 62

Transformer Models

  • BERT (Google)
  • XLNET (Google/CMU)
  • RoBERTa (Facebook)
  • DistillBERT (HuggingFace)
  • CRTL (Salesforce)
  • GPT-2 (OpenAI)
  • ALBERT (Google)
  • Megatron (NVIDIA)
  • T5 (Google)
  • GPT-3 (OpenAI)
  • GPT-4 (OpenAI)

49 of 62

Dimensions

Model

Dimensions

Parameters

BERT

24-layer, 1024-hidden, 16-heads.

340M

GPT-2

48-layer, 1600-hidden, 25-heads

1558M

Transformer XL

18-layer, 1024-hidden, 16-heads,.

257M

XLM

12-layer, 1024-hidden, 8-heads

RoBERTa

24-layer, 1024-hidden, 16-heads

355M

DistilBERT

6-layer, 768-hidden, 12-heads

134M

CTRL

48-layer, 1280-hidden, 16-heads

1.6B

alBERT

12 repeating layer, 128 embedding, 4096-hidden, 64-heads

223M

T5

24-layers, 1024-hidden-state, 65536 feed-forward hidden-state, 128-heads

11B

GPT-3

96-layers, 96x 128-heads

175B

50 of 62

Google Neural MT Architecture

  • Training: 3 weeks on 100 TPUs

51 of 62

MT State of the Art

Language

Model

BLEU

EN-DE

Transformer Big + BT (Edunov et al., 2018)

35.0

EN-DE

Transformer Big (Vaswani et al., 2017)

28.4

EN-FR

DeepL

45.9

EN-FR

Transformer Big (Vaswani et al., 2017)

41.0

52 of 62

Training Costs

Costs of training a NMT model

Model

BLEU (en-fr)

Training Cost (FLOPS)

Time

ConvS2S

40.56

1.5 1020

MOE

40.56

1.2 1020

ConvS2S Ensemble

41.29

1.2 1021

35 gg

53 of 62

Multilingual Translation

54 of 62

Monolingual Translation

One Encoder-Decoder for each language pair

En-Es Encoder

En-Es Decoder

En-Nl Encoder

En-Nl Decoder

En-Fr Encoder

En-Fr Decoder

55 of 62

Multilingual Zero-shot Translation

Pairwise translation

Induced translation

56 of 62

Multilanguage NMT

Hello, how are you? ->

Hola, ¿cómo estás?

<2es> Hello, how are you? ->

Hola, ¿cómo estás?

Multilingual NMT

Pairwise

Only target language

Input is irrelevant

Capable of translating between any language pair

57 of 62

Zero-shot translation

  • Simple: one single model
  • Zero-shot translation
  • Halves the time wrt bridging with intermediate language
  • Deals with languages with small resources

Learns Japanese->Korean from pairs

English->Japanese, Japanese->English, English->Korean, Korean->English

58 of 62

Quality: BLEU Score

Model

Single

Multi

Multi

Multi

Multi

#nodes

1024

1024

1280

1536

1792

#params

3B

255M

367M

499M

650M

Prod English→Japanese

23.66

21.10

21.17

21.72

21.70

Prod English→Korean

19.75

18.41

18.36

18.30

18.28

Prod Japanese→English

23.41

21.62

22.03

22.51

23.18

Prod Korean→English

25.42

22.87

23.46

24.00

24.67

Prod English→Spanish

34.50

34.25

34.40

34.77

34.70

Prod English→Portuguese

38.40

37.35

37.42

37.80

37.92

Prod Spanish→English

38.00

36.04

36.50

37.26

37.45

Prod Portuguese→English

44.40

42.53

42.82

43.64

43.87

Prod English→German

26.43

23.15

23.77

23.63

24.01

Prod English→French

35.37

34.00

34.19

34.91

34.81

Prod German→English

31.77

31.17

31.65

32.24

32.32

Prod French→English

36.47

34.40

34.56

35.35

35.52

ave diff

-

-1.72

-1.43

-0.95

-0.76

vs single

-

-5.6%

-4.7%

-3.1%

-2.5%

59 of 62

No Language Left Behind

Project by Meta to deliver translations directly between 200 languages—including low-resource languages like Asturian, Luganda, Urdu and more.

https://ai.facebook.com/research/no-language-left-behind/

Stage 1: Automatic dataset construction

Collect sentences in the input language and desired output language

Stage 2: Training

Encoder-decoder model trained on millions of example translations

Stage 3: Evaluation

Evaluate model against a human-translated set of sentence translations

60 of 62

Does NMT understand meaning of sentences?

  • NMT does not attempt to capture meaning of phrases
    • Example on GPT-2
      • Two plus two is …
        • a three, five is a six seven
        • the result of a simple equation, and the
        • four, and two plus three is four
  • Does it understand at least grammar?
    • Neither, attempts to use have not been much successful

61 of 62

Visualization of translations

translated sentences colored by their meaning

one sentence in 3 languages

one sentence in 3 languages in 3 colors

62 of 62

Conclusions

We learned the history of Machine Translation (MT)

  • Since 2014, Neural MT rapidly replaced intricate Statistical MT
  • Sequence-to-sequence is the architecture for NMT (uses 2 RNNs)
  • Attention is a way to focus on particular parts of the input
    • Improves sequence-to-sequence a lot!
  • These models have inspired the Transformer Model, that led to SoTA solutions for many tasks, besides MT
  • NMT does surprisingly well, considering that it does not yet understands the texts that translates
  • Multilingual translation is possible with zero-shot learning