1 of 70

Studying Shared Representations in NMT through Zero-Shot Translation

SRA “Language Similarity in Multilingual Translation”

Danni Liu

Meta

16/02/2022

2 of 70

Shared representations through zero-shot translation

  • There are ∼7000 languages in the world
  • Multilingual translation system == multi-task learner for several tasks
    • 1 task == 1 translation direction
    • To cover N languages → N2 translation directions

1

Multilingual translation

3 of 70

Shared representations through zero-shot translation

  • There are ∼7000 languages in the world
  • Multilingual translation system == multi-task learner for several tasks
    • 1 task == 1 translation direction
    • To cover N languages → N2 translation directions
  • Parallel data not always available

1

Parallel data availability:

4 of 70

Shared representations through zero-shot translation

  • There are ∼7000 languages in the world
  • Multilingual translation system == multi-task learner for several tasks
    • 1 task == 1 translation direction
    • To cover N languages → N2 translation directions
  • Parallel data not always available
  • Zero-shot: translation directions unseen in training
    • Esp. likely when scaling up language coverage

1

Parallel data availability:

5 of 70

Shared representations through zero-shot translation

  • Zero-shot translation relies on shared representations
    • Utilize learned knowledge from directions w/ parallel data

2

6 of 70

Shared representations through zero-shot translation

  • Zero-shot translation relies on shared representations
    • Utilize learned knowledge from directions w/ parallel data

2

7 of 70

Shared representations through zero-shot translation

  • Zero-shot translation relies on shared representations
    • Utilize learned knowledge from directions w/ parallel data

2

8 of 70

Shared representations through zero-shot translation

  • Zero-shot translation relies on shared representations
    • Utilize learned knowledge from directions w/ parallel data

2

9 of 70

Shared representations through zero-shot translation

  • Zero-shot translation relies on shared representations
    • Utilize learned knowledge from directions w/ parallel data

  • Ideally:
    • Encoder creates language-independent, semantic representation

2

encoder

Good day

10 of 70

Shared representations through zero-shot translation

  • Zero-shot translation relies on shared representations
    • Utilize learned knowledge from directions w/ parallel data

  • Ideally:
    • Encoder creates language-independent, semantic representation

2

encoder

Good day

Guten Tag

11 of 70

Shared representations through zero-shot translation

  • Zero-shot translation relies on shared representations
    • Utilize learned knowledge from directions w/ parallel data

  • Ideally:
    • Encoder creates language-independent, semantic representation
    • Decoder generates desired target language

2

encoder

decoder

Buenos días </s>

Buenos días

12 of 70

Shared representations through zero-shot translation

  • Zero-shot translation relies on shared representations
    • Utilize learned knowledge from directions w/ parallel data

  • Ideally:
    • Encoder creates language-independent, semantic representation
    • Decoder generates desired target language

2

encoder

decoder

Bonjour </s>

Bonjour

13 of 70

Outline

  • Inspect how language-independent of multilingual NMT models are

  • Methods to encourage language-independent representations:
    • Training objectives
    • Model architecture

  • Resources to improve zero-shot translation:
    • Data condition
    • Adapting pretrained models

3

14 of 70

How language-independent are existing NMT models?

  • Q: How much source language signal is preserved encoder outputs?

4

15 of 70

How language-independent are existing NMT models?

  • Q: How much source language signal is preserved encoder outputs?

  • Given:
    • Trained multilingual translation model (frozen)
    • Known set of source languages

  • Use classifier to probe (Adi et al., 2017) encoder outputs
    • Classifier operates on token level

4

encoder

classifier

de en fr … … pt

*: Source sentences w/o src tags

16 of 70

How language-independent are existing NMT models?

  • Q: How much source language signal is preserved encoder outputs?

  • Given:
    • Trained multilingual translation model (frozen)
    • Known set of source languages

  • Use classifier to probe (Adi et al., 2017) encoder outputs
    • Classifier operates on token level

4

encoder

classifier

de en fr … … pt

*: Source sentences w/o src tags

17 of 70

How language-independent are existing NMT models?

  • High classification accuracylanguage-specific representation

5

On Europarl

Overall accuracy: 87%

Language classification results

predicted

true

(Liu et al., 2021)

18 of 70

How language-independent are existing NMT models?

  • High classification accuracylanguage-specific representation

5

🙂 some notion of language similarity

😐 strong language signals preserved

On Europarl

Overall accuracy: 87%

Language classification results

predicted

true

(Liu et al., 2021)

19 of 70

Language-specific representations & zero-shot translation

  • Model ignores target language signal
    • Off-target translation

6

decoder

Hello

20 of 70

Language-specific representations & zero-shot translation

  • Model ignores target language signal
    • Off-target translation

6

decoder

Hello

21 of 70

Language-specific representations & zero-shot translation

  • Model ignores target language signal
    • Off-target translation
  • Low quality for zero-shot translation
    • Resort to pivoting → doubles inference-time computation 😐

6

X

decoder

encoder

En

decoder

encoder

Y

22 of 70

Language-specific representations & zero-shot translation

  • Model ignores target language signal
    • Off-target translation
  • Low quality for zero-shot translation
    • Resort to pivoting → doubles inference-time computation 😐

6

Zero-shot and pivoting translation quality

23 of 70

Promote similar representations - similarity regularizer

  • Idea:
    • src and tgt sentences have the same meaning
    • Therefore should have same representation

7

dec.

enc.

tgt

src

24 of 70

Promote similar representations - similarity regularizer

  • Idea:
    • src and tgt sentences have the same meaning
    • Therefore should have same representation
  • Approach (Arivazhagan et al., 2019; Pham et al., 2019):
    • Additional training objective
    • Reduce difference between encoder(src) and encoder(tgt)
      • Pool sentences to fixed-length
      • Distance metric

7

dec.

MT loss

similarity loss

enc.

MT loss

dec.

enc.

tgt src

src tgt

25 of 70

Promote similar representations - adversarial classifier

  • Idea:
    • w/o relying on hand-picked distance & pooling methods
    • Language classifiers should have difficulty distinguishing source languages

8

dec.

enc.

language classifier

src

tgt

26 of 70

Promote similar representations - adversarial classifier

  • Idea:
    • w/o relying on hand-picked distance & pooling methods
    • Language classifiers should have difficulty distinguishing source languages
  • Approach (Arivazhagan et al., 2019):
    • Adversarial language classifier
    • Alternatively optimize for translation and classification

8

dec.

enc.

MT loss

language classifier loss

λ

src

tgt

27 of 70

Effects of similarity-enforcing training objectives

9

* Supervised BLEU degradation ≤ 0.5

Pivoting BLEU: 22.1

Zero-shot translation quality

Data amount: 0.9M 18M 0.5M

# languages 4 9 10

# zero-shot dir. 6 56 72

28 of 70

Effects of similarity-enforcing training objectives

9

* Supervised BLEU degradation ≤ 0.5

Pivoting BLEU: 22.1

Zero-shot translation quality

Data amount: 0.9M 18M 0.5M

# languages 4 9 10

# zero-shot dir. 6 56 72

29 of 70

Effects of similarity-enforcing training objectives

9

Zero-shot translation quality

Pivoting BLEU: 19.1

Pivoting BLEU: 26.0

Pivoting BLEU: 22.1

🙂 improved zero-shot translation quality

😐 performance gap to pivoting

* Supervised BLEU degradation ≤ 0.5

30 of 70

Effects of similarity-enforcing training objectives

9

🙂 improved zero-shot translation quality

😐 performance gap to pivoting

😐 unaddressed: source word order difference

Zero-shot translation quality

Pivoting BLEU: 19.1

Pivoting BLEU: 26.0

Pivoting BLEU: 22.1

* Supervised BLEU degradation ≤ 0.5

31 of 70

Promote similar representations - source word order

  • Recall: similar representation for different source languages

10

encoder

encoder

=

32 of 70

Promote similar representations - source word order

  • Recall: similar representation for different source languages
  • But:
    • Word order difference
    • Preserved in encoder output

10

encoder

encoder

I can there go

I can go there

33 of 70

Promote similar representations - source word order

  • Recall: similar representation for different source languages
  • But:
    • Word order difference
    • Preserved in encoder output
  • Residual connections
    • Shortcut access to bottom layers
    • Facilitate training

10

34 of 70

Promote similar representations - source word order

  • Recall: similar representation for different source languages
  • But:
    • Word order difference
    • Preserved in encoder output
  • Residual connections
    • Shortcut access to bottom layers
    • Facilitate training
  • Side effect
    • One-to-one correspondence

10

35 of 70

Promote similar representations - source word order

  • Recall: similar representation for different source languages
  • But:
    • Word order difference
    • Preserved in encoder output
  • Residual connections
    • Shortcut access to bottom layers
    • Facilitate training
  • Side effect
    • One-to-one correspondence

10

encoder

encoder

I can there go

I can go there

36 of 70

Promote similar representations - source word order

  • Idea:
    • Disentangle source word order
    • Give encoder some reordering capability

11

37 of 70

Promote similar representations - source word order

  • Idea:
    • Disentangle source word order
    • Give encoder some reordering capability
  • Approach:
    • In a middle encoder layer remove residual connection
    • Why middle layer?
      • Gradual transition (syntactic → semantic)
      • Bottom layer: less gain on zero-shot
      • Top layer: slow convergence
    • Why remove?
      • Also tried: replace residual with meanpooled sentence embedding → less gain on zero-shot

11

38 of 70

Effects of removing residual connections in middle layer

12

Zero-shot translation quality

39 of 70

Analyzing source language signals

  • More confusion between related languages → Encoder captures language similarity

13

Language classification results

Baseline Transformer

After residual removal

(Liu et al., 2021)

40 of 70

Complementary effects

  • Additional gain (+0.1∼2.8 BLEU) by combining training object & residual removal

15

Zero-shot translation quality

41 of 70

Complementary effects

  • Additional gain (+0.1∼2.8 BLEU) by combining training object & residual removal
  • On par with pivoting on Europarl

15

Zero-shot translation quality

Pivoting BLEU: 26.0

42 of 70

So far we focused on modeling

  • 3 approaches to promoting language-independent representations:
    • Similarity loss
    • Adversarial language classifier
    • Residual removal @ middle encoder layer
  • Techniques above are complementary in improving zero-shot translation

16

43 of 70

So far we focused on modeling, what about data?

  • 3 approaches to promoting language-independent representations:
    • Similarity loss
    • Adversarial language classifier
    • Residual removal @ middle encoder layer
  • Techniques above are complementary in improving zero-shot translation

Next up:

  • Only with English-centric data → Add local connectivity
  • Train with general-domain data, test on FLoRes-101 (Goyal et al., 2021)

16

44 of 70

Choosing similar bridging languages

17

WMT21 large-scale task

en/id

Data:

Bridge language:

MultiCCAligned

en/hi

OpenSubtitle

en/es

45 of 70

Choosing similar bridging languages

17

Zero-shot translation quality

WMT21 large-scale task

en/id

Data:

Bridge language:

MultiCCAligned

en/hi

OpenSubtitle

en/es

46 of 70

Choosing similar bridging languages

17

Zero-shot translation quality

WMT21 large-scale task

en/id

Data:

Bridge language:

MultiCCAligned

en/hi

OpenSubtitle

en/es

21.6

31.9

13.8

12.5

17.9

19.7

Pivoting BLEU

47 of 70

Choosing similar bridging languages

  • Zero-shot translation is easier with similar bridging languages

17

Zero-shot translation quality

48 of 70

English-centric extended with local connectivity

18

Zero-shot translation quality

49 of 70

English-centric extended with local connectivity

18

Zero-shot translation quality

1-stop

50 of 70

English-centric extended with local connectivity

18

Zero-shot translation quality

2-stop

51 of 70

English-centric extended with local connectivity

18

Zero-shot translation quality

3-stop

52 of 70

English-centric extended with local connectivity

18

Pivoting BLEU: 14.6

Zero-shot translation quality

*: Avg. supervised performance degraded by 0.6~1.3 BLEU

53 of 70

English-centric extended with local connectivity

  • Local connectivity in parallel data facilitates zero-shot translation

18

Pivoting BLEU: 14.6

Zero-shot translation quality

*: Avg. supervised performance degraded by 0.6~1.3 BLEU

54 of 70

English-centric extended with local connectivity

  • Local connectivity in parallel data facilitates zero-shot translation
  • Our methods still improves zero-shot translation, yet on a smaller scale

18

Pivoting BLEU: 14.6

Zero-shot translation quality

*: Avg. supervised performance degraded by 0.6~1.3 BLEU

55 of 70

Pretrain-finetune setup

  • Idea:
    • Pretrain-finetune has become an established paradigm
    • Adapt pretrained models towards more language-independent representations
  • Approach:

19

Data:

Model:

+: methods to promote language similarity

M2M-124

(Goyal et al., 2021)

{id, ms, tl, jv} × {id, ms, tl, jv}

Initialize: Train: Test:

56 of 70

Finetune pretrained models for zero-shot translation

20

“Zero-shot” translation quality

57 of 70

Finetune pretrained models for zero-shot translation

  • Similarity regularizer helps us find settle at more language-independent models

20

“Zero-shot” translation quality

58 of 70

Finetune pretrained models for zero-shot translation

  • Similarity regularizer helps us find settle at more language-independent models
  • Zero-shot translation 1.3-1.6 BLEU behind training with oracle parallel data

20

“Zero-shot” translation quality

59 of 70

Finetune pretrained models for zero-shot translation

  • Similarity regularizer helps us find settle at more language-independent models
  • Zero-shot translation 1.3-1.6 BLEU behind training with oracle parallel data
  • Training objective more useful than architectural change in this case

20

“Zero-shot” translation quality

60 of 70

Encouraging cross-modality similarity

  • So far: encourage language-independent representations for text-to-text translation
  • Related idea:
    • Paired speech translation data is scarce, but there are abundant:
      • transcription data
      • text-to-text translation data
    • Modality-independent representations for end-to-end speech-to-text translation

21

61 of 70

Encouraging cross-modality similarity

  • Paired speech translation data is scarce, but there are abundant:
    • transcription data
    • text-to-text translation data
  • Modality-independent representations for end-to-end speech-to-text translation

21

(Dinh et al., 2022)

similarity loss

62 of 70

Encouraging cross-modality similarity

  • Paired speech translation data is scarce, but there are abundant:
    • transcription data
    • text-to-text translation data
  • Modality-independent representations for end-to-end speech-to-text translation

21

(Dinh et al., 2022)

63 of 70

Encouraging cross-modality similarity

  • Zero-shot inference remains difficult
  • Representational similarity improves few-shot speech translation

22

Few-shot speech translation quality on CoVoST en-de

BLEU w/ 100% data: 14.9

(Dinh et al., 2022)

64 of 70

Ongoing: close the gap to pivot-based translation

  • Idea:
    • Pivoting is a very strong baseline
      • Input passes through model twice
      • w/ discrete intermediate tokens
    • Can we improve zero-shot translation by enforcing discrete intermediate representations?

  • Approach:
    • Vector quantization

23

X

decoder

encoder

IL (interlingua)

decoder

encoder

Y

param. sharing

65 of 70

Summary

  • Language-independent representations do not come by default

24

66 of 70

Summary

  • Language-independent representations do not come by default

  • Various methods to improve zero-shot translation

24

67 of 70

Summary

  • Language-independent representations do not come by default

  • Various methods to improve zero-shot translation

  • Select suitable resources to improve zero-shot translation

24

68 of 70

Summary

  • Language-independent representations do not come by default

  • Various methods to improve zero-shot translation

  • Select suitable resources to improve zero-shot translation

  • Open question: Fully close the gap to pivot-based translation

24

69 of 70

References

Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2017. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. ICLR 2017.

Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Roee Aharoni, Melvin Johnson, and Wolfgang Macherey. 2019. The missing ingredient in zero-shot neural machine translation.

Tu Anh Dinh, Danni Liu, and Jan Niehues. Tackling data scarcity in speech translation using zero-shot multilingual machine translation techniques. To appear in ICASSP 2022.

Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc'Aurelio Ranzato, Francisco Guzmán, Angela Fan. The FLORES-101 evaluation benchmark for low-resource and multilingual machine translation.

Danni Liu, Jan Niehues, James Cross, Francisco Guzmán, and Xian Li. Improving zero-shot translation by disentangling positional information. ACL 2021.

Ngoc-Quan Pham, Jan Niehues, Thanh-Le Ha, and Alexander Waibel. 2019. Improving zero-shot translation with language-independent constraints. WMT 2019.

25

70 of 70

Question/Discussion

Thank you! :)

{danni.liu, jan.niehues}@maastrichtuniversity.nl

26