1 of 70

Studying Shared Representations in NMT through Zero-Shot Translation

SRA “Language Similarity in Multilingual Translation”

Danni Liu

2 of 70

Shared representations through zero-shot translation

There are ∼7000 languages in the world
Multilingual translation system == multi-task learner for several tasks

1 task == 1 translation direction
To cover N languages → N² translation directions

Multilingual translation

…

3 of 70

Shared representations through zero-shot translation

There are ∼7000 languages in the world
Multilingual translation system == multi-task learner for several tasks

1 task == 1 translation direction
To cover N languages → N² translation directions

Parallel data not always available

…

Parallel data availability:

4 of 70

Shared representations through zero-shot translation

There are ∼7000 languages in the world
Multilingual translation system == multi-task learner for several tasks

1 task == 1 translation direction
To cover N languages → N² translation directions

Parallel data not always available
Zero-shot: translation directions unseen in training

Esp. likely when scaling up language coverage

…

Parallel data availability:

5 of 70

Shared representations through zero-shot translation

Zero-shot translation relies on shared representations

Utilize learned knowledge from directions w/ parallel data

6 of 70

Shared representations through zero-shot translation

Zero-shot translation relies on shared representations

Utilize learned knowledge from directions w/ parallel data

7 of 70

Shared representations through zero-shot translation

Zero-shot translation relies on shared representations

Utilize learned knowledge from directions w/ parallel data

8 of 70

Shared representations through zero-shot translation

Zero-shot translation relies on shared representations

Utilize learned knowledge from directions w/ parallel data

9 of 70

Shared representations through zero-shot translation

Zero-shot translation relies on shared representations

Utilize learned knowledge from directions w/ parallel data

Ideally:

Encoder creates language-independent, semantic representation

encoder

Good day

10 of 70

Shared representations through zero-shot translation

Zero-shot translation relies on shared representations

Utilize learned knowledge from directions w/ parallel data

Ideally:

Encoder creates language-independent, semantic representation

encoder

Good day

Guten Tag

11 of 70

Shared representations through zero-shot translation

Zero-shot translation relies on shared representations

Utilize learned knowledge from directions w/ parallel data

Ideally:

Encoder creates language-independent, semantic representation
Decoder generates desired target language

encoder

decoder

Buenos días </s>

Buenos días

12 of 70

Shared representations through zero-shot translation

Zero-shot translation relies on shared representations

Utilize learned knowledge from directions w/ parallel data

Ideally:

Encoder creates language-independent, semantic representation
Decoder generates desired target language

encoder

decoder

Bonjour </s>

Bonjour

13 of 70

Outline

Inspect how language-independent of multilingual NMT models are

Methods to encourage language-independent representations:

Training objectives
Model architecture

Resources to improve zero-shot translation:

Data condition
Adapting pretrained models

14 of 70

How language-independent are existing NMT models?

Q: How much source language signal is preserved encoder outputs?

15 of 70

How language-independent are existing NMT models?

Q: How much source language signal is preserved encoder outputs?

Given:

Trained multilingual translation model (frozen)
Known set of source languages

Use classifier to probe (Adi et al., 2017) encoder outputs

Classifier operates on token level

encoder

classifier

de en fr … … pt

*: Source sentences w/o src tags

16 of 70

How language-independent are existing NMT models?

Q: How much source language signal is preserved encoder outputs?

Given:

Trained multilingual translation model (frozen)
Known set of source languages

Use classifier to probe (Adi et al., 2017) encoder outputs

Classifier operates on token level

encoder

classifier

de en fr … … pt

*: Source sentences w/o src tags

17 of 70

How language-independent are existing NMT models?

High classification accuracy → language-specific representation

On Europarl

Overall accuracy: 87%

Language classification results

predicted

true

(Liu et al., 2021)

18 of 70

How language-independent are existing NMT models?

High classification accuracy → language-specific representation

🙂 some notion of language similarity

😐 strong language signals preserved

On Europarl

Overall accuracy: 87%

Language classification results

predicted

true

(Liu et al., 2021)

19 of 70

Language-specific representations & zero-shot translation

Model ignores target language signal

Off-target translation

decoder

Hello

20 of 70

Language-specific representations & zero-shot translation

Model ignores target language signal

Off-target translation

decoder

Hello

21 of 70

Language-specific representations & zero-shot translation

Model ignores target language signal

Off-target translation

Low quality for zero-shot translation

Resort to pivoting → doubles inference-time computation 😐

decoder

encoder

decoder

encoder

22 of 70

Language-specific representations & zero-shot translation

Model ignores target language signal

Off-target translation

Low quality for zero-shot translation

Resort to pivoting → doubles inference-time computation 😐

Zero-shot and pivoting translation quality

23 of 70

Promote similar representations - similarity regularizer

Idea:

src and tgt sentences have the same meaning
Therefore should have same representation

dec.

enc.

tgt

src

24 of 70

Promote similar representations - similarity regularizer

Idea:

src and tgt sentences have the same meaning
Therefore should have same representation

Approach (Arivazhagan et al., 2019; Pham et al., 2019):

Additional training objective
Reduce difference between encoder(src) and encoder(tgt)

Pool sentences to fixed-length
Distance metric

dec.

MT loss

similarity loss

enc.

MT loss

dec.

enc.

tgt src

src tgt

25 of 70

Promote similar representations - adversarial classifier

Idea:

w/o relying on hand-picked distance & pooling methods
Language classifiers should have difficulty distinguishing source languages

dec.

enc.

language classifier

src

tgt

26 of 70

Promote similar representations - adversarial classifier

Idea:

w/o relying on hand-picked distance & pooling methods
Language classifiers should have difficulty distinguishing source languages

Approach (Arivazhagan et al., 2019):

Adversarial language classifier
Alternatively optimize for translation and classification

dec.

enc.

MT loss

language classifier loss

−λ

src

tgt

27 of 70

Effects of similarity-enforcing training objectives

* Supervised BLEU degradation ≤ 0.5

Pivoting BLEU: 22.1

Zero-shot translation quality

Data amount: 0.9M 18M 0.5M

# languages 4 9 10

# zero-shot dir. 6 56 72

28 of 70

Effects of similarity-enforcing training objectives

* Supervised BLEU degradation ≤ 0.5

Pivoting BLEU: 22.1

Zero-shot translation quality

Data amount: 0.9M 18M 0.5M

# languages 4 9 10

# zero-shot dir. 6 56 72

29 of 70

Effects of similarity-enforcing training objectives

Zero-shot translation quality

Pivoting BLEU: 19.1

Pivoting BLEU: 26.0

Pivoting BLEU: 22.1

🙂 improved zero-shot translation quality

😐 performance gap to pivoting

* Supervised BLEU degradation ≤ 0.5

30 of 70

Effects of similarity-enforcing training objectives

🙂 improved zero-shot translation quality

😐 performance gap to pivoting

😐 unaddressed: source word order difference

Zero-shot translation quality

Pivoting BLEU: 19.1

Pivoting BLEU: 26.0

Pivoting BLEU: 22.1

* Supervised BLEU degradation ≤ 0.5

31 of 70

Promote similar representations - source word order

Recall: similar representation for different source languages

encoder

32 of 70

Promote similar representations - source word order

Recall: similar representation for different source languages
But:

Word order difference
Preserved in encoder output

encoder

≠

I can there go

I can go there

33 of 70

Promote similar representations - source word order

Recall: similar representation for different source languages
But:

Word order difference
Preserved in encoder output

Residual connections

Shortcut access to bottom layers
Facilitate training

34 of 70

Promote similar representations - source word order

Recall: similar representation for different source languages
But:

Word order difference
Preserved in encoder output

Residual connections

Shortcut access to bottom layers
Facilitate training

Side effect

One-to-one correspondence

35 of 70

Promote similar representations - source word order

Recall: similar representation for different source languages
But:

Word order difference
Preserved in encoder output

Residual connections

Shortcut access to bottom layers
Facilitate training

Side effect

One-to-one correspondence

encoder

I can there go

I can go there

≠

36 of 70

Promote similar representations - source word order

Idea:

Disentangle source word order
Give encoder some reordering capability

37 of 70

Promote similar representations - source word order

Idea:

Disentangle source word order
Give encoder some reordering capability

Approach:

In a middle encoder layer remove residual connection
Why middle layer?

Gradual transition (syntactic → semantic)
Bottom layer: less gain on zero-shot
Top layer: slow convergence

Why remove?

Also tried: replace residual with meanpooled sentence embedding → less gain on zero-shot

38 of 70

Effects of removing residual connections in middle layer

Zero-shot translation quality

39 of 70

Analyzing source language signals

More confusion between related languages → Encoder captures language similarity

Language classification results

Baseline Transformer

After residual removal

(Liu et al., 2021)

40 of 70

Complementary effects

Additional gain (+0.1∼2.8 BLEU) by combining training object & residual removal

Zero-shot translation quality

41 of 70

Complementary effects

Additional gain (+0.1∼2.8 BLEU) by combining training object & residual removal
On par with pivoting on Europarl

Zero-shot translation quality

Pivoting BLEU: 26.0

42 of 70

So far we focused on modeling

3 approaches to promoting language-independent representations:

Similarity loss
Adversarial language classifier
Residual removal @ middle encoder layer

Techniques above are complementary in improving zero-shot translation

43 of 70

So far we focused on modeling, what about data?

3 approaches to promoting language-independent representations:

Similarity loss
Adversarial language classifier
Residual removal @ middle encoder layer

Techniques above are complementary in improving zero-shot translation

Next up:

Only with English-centric data → Add local connectivity
Train with general-domain data, test on FLoRes-101 (Goyal et al., 2021)

44 of 70

Choosing similar bridging languages

WMT21 large-scale task

en/id

Data:

Bridge language:

MultiCCAligned

en/hi

OpenSubtitle

en/es

45 of 70

Choosing similar bridging languages

Zero-shot translation quality

WMT21 large-scale task

en/id

Data:

Bridge language:

MultiCCAligned

en/hi

OpenSubtitle

en/es

46 of 70

Choosing similar bridging languages

Zero-shot translation quality

WMT21 large-scale task

en/id

Data:

Bridge language:

MultiCCAligned

en/hi

OpenSubtitle

en/es

21.6

31.9

13.8

12.5

17.9

19.7

Pivoting BLEU

47 of 70

Choosing similar bridging languages

Zero-shot translation is easier with similar bridging languages

Zero-shot translation quality

48 of 70

English-centric extended with local connectivity

Zero-shot translation quality

49 of 70

English-centric extended with local connectivity

Zero-shot translation quality

1-stop

50 of 70

English-centric extended with local connectivity

Zero-shot translation quality

2-stop

51 of 70

English-centric extended with local connectivity

Zero-shot translation quality

3-stop

52 of 70

English-centric extended with local connectivity

Pivoting BLEU: 14.6

Zero-shot translation quality

*: Avg. supervised performance degraded by 0.6~1.3 BLEU

53 of 70

English-centric extended with local connectivity

Local connectivity in parallel data facilitates zero-shot translation

Pivoting BLEU: 14.6

Zero-shot translation quality

*: Avg. supervised performance degraded by 0.6~1.3 BLEU

54 of 70

English-centric extended with local connectivity

Local connectivity in parallel data facilitates zero-shot translation
Our methods still improves zero-shot translation, yet on a smaller scale

Pivoting BLEU: 14.6

Zero-shot translation quality

*: Avg. supervised performance degraded by 0.6~1.3 BLEU

55 of 70

Pretrain-finetune setup

Idea:

Pretrain-finetune has become an established paradigm
Adapt pretrained models towards more language-independent representations

Approach:

Data:

Model:

+: methods to promote language similarity

M2M-124

(Goyal et al., 2021)

{id, ms, tl, jv} × {id, ms, tl, jv}

Initialize: Train: Test:

56 of 70

Finetune pretrained models for zero-shot translation

“Zero-shot” translation quality

57 of 70

Finetune pretrained models for zero-shot translation

Similarity regularizer helps us find settle at more language-independent models

“Zero-shot” translation quality

58 of 70

Finetune pretrained models for zero-shot translation

Similarity regularizer helps us find settle at more language-independent models
Zero-shot translation 1.3-1.6 BLEU behind training with oracle parallel data

“Zero-shot” translation quality

59 of 70

Finetune pretrained models for zero-shot translation

Similarity regularizer helps us find settle at more language-independent models
Zero-shot translation 1.3-1.6 BLEU behind training with oracle parallel data
Training objective more useful than architectural change in this case

“Zero-shot” translation quality

60 of 70

Encouraging cross-modality similarity

So far: encourage language-independent representations for text-to-text translation
Related idea:

Paired speech translation data is scarce, but there are abundant:

transcription data
text-to-text translation data

Modality-independent representations for end-to-end speech-to-text translation

61 of 70

Encouraging cross-modality similarity

Paired speech translation data is scarce, but there are abundant:

transcription data
text-to-text translation data

Modality-independent representations for end-to-end speech-to-text translation

(Dinh et al., 2022)

similarity loss

62 of 70

Encouraging cross-modality similarity

Paired speech translation data is scarce, but there are abundant:

transcription data
text-to-text translation data

Modality-independent representations for end-to-end speech-to-text translation

(Dinh et al., 2022)

63 of 70

Encouraging cross-modality similarity

Zero-shot inference remains difficult
Representational similarity improves few-shot speech translation

Few-shot speech translation quality on CoVoST en-de

BLEU w/ 100% data: 14.9

(Dinh et al., 2022)

64 of 70

Ongoing: close the gap to pivot-based translation

Idea:

Pivoting is a very strong baseline

Input passes through model twice
w/ discrete intermediate tokens

Can we improve zero-shot translation by enforcing discrete intermediate representations?

Approach:

Vector quantization

decoder

encoder

IL (interlingua)

decoder

encoder

param. sharing

65 of 70

Summary

Language-independent representations do not come by default

66 of 70

Summary

Language-independent representations do not come by default

Various methods to improve zero-shot translation

67 of 70

Summary

Language-independent representations do not come by default

Various methods to improve zero-shot translation

Select suitable resources to improve zero-shot translation

68 of 70

Summary

Language-independent representations do not come by default

Various methods to improve zero-shot translation

Select suitable resources to improve zero-shot translation

Open question: Fully close the gap to pivot-based translation

69 of 70

References

Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2017. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. ICLR 2017.

Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Roee Aharoni, Melvin Johnson, and Wolfgang Macherey. 2019. The missing ingredient in zero-shot neural machine translation.

Tu Anh Dinh, Danni Liu, and Jan Niehues. Tackling data scarcity in speech translation using zero-shot multilingual machine translation techniques. To appear in ICASSP 2022.

Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc'Aurelio Ranzato, Francisco Guzmán, Angela Fan. The FLORES-101 evaluation benchmark for low-resource and multilingual machine translation.

Danni Liu, Jan Niehues, James Cross, Francisco Guzmán, and Xian Li. Improving zero-shot translation by disentangling positional information. ACL 2021.

Ngoc-Quan Pham, Jan Niehues, Thanh-Le Ha, and Alexander Waibel. 2019. Improving zero-shot translation with language-independent constraints. WMT 2019.

70 of 70

Question/Discussion

Thank you! :)

{danni.liu, jan.niehues}@maastrichtuniversity.nl