1 of 42

Machine Translation

CSE 447 / 517

Feb 13TH, 2025 (WEEK 6)

2 of 42

Logistics

  • Assignment 4 (A4) is due on Wednesday, 2/19
  • Project Checkpoint 3 is due on Monday, 3/03

3 of 42

Agenda

  • Machine Translation�The Noisy Channel Model

The IBM Model

  • Intro to Neural Machine Translation

4 of 42

NLP Task: Machine Translation

Mr President , Noah's ark was filled not with production factors , but with living creatures .

(From Language X)

Noahs Arche war nicht voller Produktionsfaktoren , sondern Geschöpfe .

(To Language Y)

5 of 42

The Noisy Channel Model

Language X Language Y

We want to translate Language X into Language Y.

6 of 42

The Noisy Channel Model

Source

Channel

Language Y

Language X

Language X Language Y

We want to translate Language X into Language Y.

Imagine there is a source that generates Language Y. But then it is passed through some channel, and we observe Language X on the other side of the channel.

7 of 42

The Noisy Channel Model

Source

Channel

Language Y

Language X

Imagine there is a source that generates Language Y. But then it is passed through some channel, and we observe Language X on the other side of the channel.

y= argmax p(y | x)

y

= argmaxy p(x | y) ·

p(y)

Source model aka a LM for Language Y! This captures the fluency in the target language.

8 of 42

The Noisy Channel Model

Source

Channel

Language Y

Language X

Imagine there is a source that generates Language Y. But then it is passed through some channel, and we observe Language X on the other side of the channel.

y= argmax p(y | x)

y

= argmaxy

p(x | y)

· p(y)

Channel model, captures the faithfulness of the translation.

9 of 42

The Noisy Channel Model

Source

Channel

Language Y

Language X

Language X

Language Y

Refer to this when you get lost which is which!

10 of 42

IBM Model 1 - Motivation

Mr President , Noah's ark was filled not with production factors , but with living creatures .

Noahs Arche war nicht voller Produktionsfaktoren , sondern Geschöpfe .

IBM Model 1: What is the mapping from each token in Language X to Language Y?

Source

Channel

Language Y

Language X

Language X

Language Y

11 of 42

IBM Model 1 - Alignment

IBM Model 1: What is the mapping from each token in Language X to Language Y?

Let l be the length of y and m be the length of x.

Latent variable a = a1,...,am, each ai ranging over {0,...,l} (positions in y).

Source

Channel

Language Y

Language X

Language X

Language Y

ai = j

12 of 42

IBM Model 1 - Alignment

IBM Model 1: What is the mapping from each token in Language X to Language Y?

Let l be the length of y and m be the length of x.

Latent variable a = a1,...,am, each ai ranging over {0,...,l} (positions in y).

Source

Channel

Language Y

Language X

Language X

Language Y

ai =

j

The ith token in Language X.

The jth token in Language Y.

13 of 42

IBM Model 1

Source

Channel

Language Y

Language X

Language X

Language Y

Mr President , Noah's ark was filled not with production factors , but with living creatures .

Noahs Arche war nicht voller Produktionsfaktoren , sondern Geschöpfe .

IBM Model 1: What is the mapping from each token in Language X to Language Y?

ai = j

14 of 42

IBM Model 1

Source

Channel

Language Y

Language X

Language X

Language Y

Mr President , Noah's ark was filled not with production factors , but with living creatures .

Noahs Arche war nicht voller Produktionsfaktoren , sondern Geschöpfe .

IBM Model 1: What is the mapping from each token in Language X to Language Y?

ai = j

1 2 3 4

a = [0, 0, 0, 1,

]

???

15 of 42

IBM Model 1

Source

Channel

Language Y

Language X

Language X

Language Y

Mr President , Noah's ark was filled not with production factors , but with living creatures .

Noahs Arche war nicht voller Produktionsfaktoren , sondern Geschöpfe .

IBM Model 1: What is the mapping from each token in Language X to Language Y?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

a = [0, 0, 0, 1, 2, 3, 5, 4, 0, 6, 6, 7, 8, 0, 0, 9, 10]

ai = j

16 of 42

IBM Model 1

Source

Channel

Language Y

Language X

Language X

Language Y

ai = j

Our channel model:

17 of 42

IBM Model 1

Source

Channel

Language Y

Language X

Language X

Language Y

ai = j

Our channel model:

Marginalized over all possible a vectors.

18 of 42

IBM Model 1

Source

Channel

Language Y

Language X

Language X

Language Y

ai = j

Our channel model:

where

19 of 42

IBM Model 1

Source

Channel

Language Y

Language X

Language X

Language Y

ai = j

Our channel model:

where

Go through every position in x.

20 of 42

IBM Model 1

Source

Channel

Language Y

Language X

Language X

Language Y

ai = j

Our channel model:

where

How likely is the current alignment without regard to the text?

21 of 42

IBM Model 1

Source

Channel

Language Y

Language X

Language X

Language Y

ai = j

Our channel model:

where

How likely is the current alignment with regard to the text?

22 of 42

IBM Model 1

Source

Channel

Language Y

Language X

Language X

Language Y

ai = j

Our channel model:

where

Uniform distribution (all distortions modelled by a are treated the same).

23 of 42

IBM Model 1

Source

Channel

Language Y

Language X

Language X

Language Y

ai = j

Our channel model:

where

Learned parameter.

24 of 42

IBM Model 1

Source

Channel

Language Y

Language X

Language X

Language Y

Mr President , Noah's ark was filled not with production factors , but with living creatures .

Noahs Arche war nicht voller Produktionsfaktoren , sondern Geschöpfe .

25 of 42

IBM Model 1 - Learning

Source

Channel

Language Y

Language X

Language X

Language Y

26 of 42

IBM Model 1 - Learning

Source

Channel

Language Y

Language X

Language X

Language Y

How do we estimate this?

27 of 42

IBM Model 1 - Learning

Source

Channel

Language Y

Language X

Language X

Language Y

The problem: we don’t know the alignments ahead of time. So we can’t apply MLE to find the parameter.

The solution: expectation maximization.

28 of 42

Neural Machine Translation (NMT)

  • Based on new model archetype: seq-to-seq or encoder-decoder
  • High-level model:
  • The model has two parts:
    • Encoder that takes in the source language sentence f and outputs an encoding of the sentence encode(f)
    • Decoder that at step j predicts the target language word ej from the previously output target language words e<j and encode(f)

29 of 42

Slides from Abigail See

30 of 42

Slides from Abigail See

31 of 42

Slides from Abigail See

32 of 42

Slides from Abigail See

33 of 42

Slides from Abigail See

34 of 42

Slides from Abigail See

35 of 42

Slides from Abigail See

36 of 42

Slides from Abigail See

37 of 42

Slides from Abigail See

38 of 42

Slides from Abigail See

39 of 42

Slides from Abigail See

40 of 42

Slides from Abigail See

41 of 42

Final notes on NMT

  • To decode (get a translated sentence from the MT model), we can use methods discussed for previous sequence labeling tasks: greedy decoding, beam search, etc.
  • We show how to use the encoder-decoder model for MT, but this is a general setup that works:
    • For many different NLP tasks
    • With different NN architectures (RNNs, Transformers)

42 of 42

Questions?

  • Thank you!