1 of 156

Multilingual Neural Machine Translation

1

COLING 2020 Tutorial, December 12th, 2020

Raj Dabre

NICT

Kyoto, Japan

Chenhui Chu

Kyoto University

Kyoto, Japan

Anoop Kunchukuttan

Microsoft STCI

Hyderabad, India

2 of 156

2

Tutorial Homepage

https://github.com/anoopkunchukuttan/multinmt_tutorial_coling2020

Survey Paper

Raj Dabre, Chenhui Chu, and Anoop Kunchukuttan. 2020. A Survey of Multilingual Neural Machine Translation. ACM Comput. Surv. 53, 5, Article 99 (September 2020), 38 pages. https://doi.org/10.1145/3406095

Updated Bibliography (the field is moving so fast!)

https://github.com/anoopkunchukuttan/multinmt_tutorial_coling2020/mnmt_bibliography.pdf

Tutorial Material

3 of 156

Self Introduction (Chenhui Chu)

  • Experience
    • 2020-present: program-specific associate professor @ Kyoto University
    • 2017-2020: research assistant professor @ Osaka University
    • 2015-2017: researcher @ Japan Science and Technology Agency
    • 2014-2015: research fellowship for young scientists (DC2) @ JSPS
  • Research
    • Machine translation (JSPS DC2, Chinese-Japanese MT practical application project, JSPS research activity start-up)
    • Language and vision understanding (ACT-I, MSRA CORE, JSPS young scientists)

3

4 of 156

Outline of This Tutorial

  • Overview of Multilingual NMT (30 min by Chenhui Chu)
  • Multiway Modeling (1 hour by Raj Dabre)
  • Low-resource Translation (1 hour by Anoop Kunchukuttan)
  • Multi-source Translation (10 min by Chenhui Chu)
  • Datasets, Future Directions, and Summary (20 min by Chenhui Chu)

4

5 of 156

What is Multilingual NMT (MNMT)?

  • Definition
    • Neural machine translation (NMT) systems handling translation between more than one language pair

  • A research field to study:
    • How can we leverage multilingual data effectively in order to learn distributions across multiple languages so as to improve MT (NLP) performance across all languages?

5

6 of 156

Why MNMT?

6

Size of the top-100 language pairs in OPUS (Tiedemann et al., 2012)

7 of 156

Why MNMT?

7

Data Distribution over language pairs (Arivazhagan et al., 2019)

8 of 156

Motivation and Goal of MNMT (Dabre et al., 2020)

  • Motivation
    • MNMT systems are desirable because training models with data from many language pairs might help a resource-poor language acquire extra knowledge; from the other languages
    • MNMT systems tend to generalize better due to exposure to diverse languages, leading to improved translation quality compared to bilingual NMT systems
  • Goal
    • Have a one-model-for-all-languages solution to MT (NLP) applications
    • Shared multilingual distributed representations help MT (NLP) for low-resource languages

8

9 of 156

Timeline of MNMT Research

9

NMT

2014

2015

2016

2017

2018

2019

2020

NMT with attention

Transformer

Multiway MNMT

Massive MNMT

Multi-source MNMT

Google MNMT

Beyond English-Centric

Low-resource MNMT

BERT

10 of 156

Google’s MNMT System (Johnson et al., 2017)

10

A single multilingual model achieves comparable performance for

English-French and surpasses state-of-the-art results for English-German.

11 of 156

Language Clusters (Johnson et al., 2017)

11

12 of 156

Mixing Target Languages (Johnson et al., 2017)

12

13 of 156

Google’s Massive MNMT System (Arivazhagan et al., 2019)

13

BLEU

Studied a single massive MNMT model handling

103 languages trained on over 25 billion examples

14 of 156

Effect of Massive MNMT (Arivazhagan et al., 2019)

14

Massive MNMT is helpful for low-resource languages but not high-resource languages

15 of 156

Effect of Task Numbers (Arivazhagan et al., 2019)

15

Increasing numbers of languages is not always helpful

16 of 156

Facebook’s Beyond English-Centric MNMT System (Fan et al., 2020)

  • M2M-100: the first MNMT model that can translate between any pair of 100 languages without relying on English data
  • It outperforms English-centric systems by 10 points on the widely used BLEU metric for evaluating machine translations
  • M2M-100 is trained on a total of 2,200 language directions — or 10x more than previous best, English-centric multilingual models

16

17 of 156

Performance of M2M-100 (Angela et al., 2020)

17

18 of 156

Goal of This Tutorial

  • Understand the basics of MNMT
  • Cover an in-depth survey of existing literature on MNMT
    • Central use-case, resource scenarios, underlying modeling principles, core-issues and challenges of various approaches
    • Strengths and weaknesses of several techniques by comparing them with each other
  • Inspire new ideas for researchers and engineers interested in MNMT

18

19 of 156

Basic NMT Architecture (Dabre et al., 2020)

19

20 of 156

RNN based NMT (Bahdanau et al., 2015)

20

21 of 156

Self-Attention Based NMT (Vaswani et al., 2017)

21

22 of 156

Initial MNMT Models (1/2) (Firat et al., 2016)

22

E

L1

L2

LN

Encoder 1

E

L1

L2

LN

Encoder X

.

.

.

E

L1

LM

S

Decoder 1

E

L1

LM

S

Decoder Y

.

.

.

Shared Attention Mechanism

ENGLISH

I am a boy.

HINDI

मैं लड़का हूँ.

ITALIAN

Sono un ragazzo

MARATHI

मी मुलगा आहे.

Encode English

Decode Italian

Minimal Parameter Sharing

23 of 156

Initial MNMT Models (2/2) (Johnson et al., 2017)

23

E

L1

L2

LN

Encoder

1. <2mr> I am a boy.

2. <2mr> मैं लड़का हूँ.

3. <2it> I am a boy.

4. <2it> मैं लड़का हूँ.

.

.

E

L1

LM

S

Decoder

Attention Mechanism

1. मी मुलगा आहे.

2. मी मुलगा आहे.

3. Sono un ragazzo.

4. Sono un ragazzo.

.

.

Complete Parameter Sharing

24 of 156

24

MNMT

Multiway Modeling

Transfer Learning

Zero-resource Modeling

Multi-source Translation

Low-resource Translation

Zero-shot Modelling

Parameter Sharing

Language Divergence

Training Protocols

Use-case

Core-issues

Overview of MNMT

(Dabre et al., 2020)

25 of 156

Outline of This Tutorial

  • Overview of Multilingual NMT (30 min by Chenhui Chu)
  • Multiway Modeling (1 hour by Raj Dabre)
  • Low-resource Translation (1 hour by Anoop Kunchukuttan)
  • Multi-source Translation (10 min by Chenhui Chu)
  • Datasets, Future Directions, and Summary (20 min by Chenhui Chu)

25

26 of 156

Self Introduction (Raj Dabre)

  • Experience
    • 2018-present: Researcher at NICT, Japan
    • 2014-2018: MEXT Ph.D. scholar at Kyoto University, Japan
    • 2011-2014: M.Tech. Government RA at IIT Bombay, India
  • Research
    • Low-Resource Natural Language Processing
      • Multilingual Machine Translation: 2012-present
      • Fundamental Analysis: 2011-2014
    • Efficient Deep Learning:
      • Compact, flexible and fast models (2018-present)

26

27 of 156

Topics to address

  • Parameter sharing

  • Massively multilingual models

  • Language divergence

  • Training protocols

27

28 of 156

Topics to address

  • Parameter sharing

  • Massively multilingual models

  • Language divergence

  • Training protocols

28

29 of 156

Parameter Sharing: Finding The Right Balance

  • Empirically determined component sharing
    • How to maximize impact of shared vocabulary?
    • Universal encoders or decoders?
    • Should attention be universal or language specific?
    • How much sharing is good sharing?

  • Parameterized parameters for shared modeling
    • Learn about sharing

29

30 of 156

Sharing Vocabularies (1)

  • Naive solution: A single vocabulary for all languages (Johnson et al., 2017)
  • Adding constraints: Soft-decoupled encoding (Wang et al., 2019)
    • Word or sub-word vocabulary
    • N-gram embeddings, language specific and latent semantic transformation
    • Good for linguistically similar languages in transfer learning setups
  • Future investigation
    • Indic languages
    • Groups of language families (for massively multilingual NMT)
      • Chung et al. 2020, Lyu et al. 2020

30

31 of 156

Sharing Vocabularies (2)

  • Language sensitive embeddings (Wang et al., 2019)
    • Augment shared embeddings with language specific indicators (factors)
    • One hot, learnable, direction specific
  • Language family shared embeddings
    • Separate shared embeddings for different families
    • Avoid accidental n-gram vocabulary overlaps

31

en

I

EIword

Eenlang

EIword:Eenlang

Concatenate

Encoder

Add

EIword:+Eenlang

32 of 156

Sharing In Encoders and Decoders (1)

  • Most works agree on a universal encoder
    • Dong et al. 2015; Sachan et al. 2018; Wang et al. 2019
    • Lack of cross-lingual components

  • Component sharing in decoders is tricky
    • Balance between universal and distinct representation (more on this later)
    • Cross-lingual components

32

33 of 156

Sharing In Encoders and Decoders (2)

  • Sachan et al. 2018 get optimal results with sharing attention and embedding
  • Distinct FF parameters for language specific adaptation
  • Reduce MNMT model sizes by 20-30% but slightly improve quality

33

34 of 156

Sharing In Encoders and Decoders (3)

  • Wang et al. 2019 focus on language sensitization
    • Sharing self-attention and FF parameters
    • Shared encoder/decoder →representor
      • Also in Dabre et al. 2019
  • Language sensitization strategies:
    • Language sensitive embeddings
    • Language pair specific cross-attention
    • Language discrimination
  • General improvement of 1-2 BLEU

34

35 of 156

A Note On Attention Sharing (Blackwood et al. 2018)

  • Universal shared attention
    • Firat et al. 2016
    • Too much load on a single set of parameters
  • Source specific attention
    • Reduces load but not too effective
  • Target specific attention
    • Works best (shows importance of unshared components in decoders)
    • Orthogonal to Sachan et al. 2018 where shared attention was optimal
  • Source-target specific attention
    • Least load but prevents zero-shot

35

36 of 156

Contextual Parameters (Platanois et al. 2018)

36

Generate parameters using parameters!

37 of 156

Contextual Parameters (Platanois et al. 2018)

  • Meta-parameterization
    • Parameters generate parameters: Θ = F(ƛ)
  • Can support parameter groups
    • Potential for self-discovering optimal language families
    • Related work on family specific sharing by Lyu et al. 2020
  • Significant model compression
    • Separate models: O(L*L*P + 2*L*L*W*V)
    • This model: O(P*M + L*W*V)
  • Significant improvement in translation quality

37

38 of 156

Contextual Parameters (Platanois et al, 2018)

  • Automatically learns language similarity
    • IT similar to RO
    • VI similar to FR?
  • Make groups of contextual parameters
    • Two step meta-learning
    • Step 1: Discover groups
    • Step 2: Optimal sharing
  • Next steps:
    • Conditional computation
    • Neural architecture search
    • Tensor routing (Zaremoudi et al. 2018)

38

39 of 156

Tensor routing (Zaremoudi et al. 2018)

  • Augmented encoder block
    • Language specific block (dotted)
    • Shared blocks (solid)
    • Routing controls relevance of blocks
  • Another perspective
    • Each block is an expert
    • Related to MOE and Gshard works

39

40 of 156

Topics to address

  • Parameter sharing

  • Massively multilingual models

  • Language divergence

  • Training protocols

40

41 of 156

Massively Multilingual Models

  • Google’s work
    • MNMT In the wild
    • Adaptor Layers
    • GShard
  • University of Edinburgh’s work
    • Reproducible
  • Bottlenecks
      • Hardware/computational/cost
      • Representational

41

42 of 156

Google’s Massively MNMT (Aharoni et al. 2019)

  • 59-lingual low-resource models (6-layer transformer base)
  • One to many models better than many-to-many for non-English targets
    • 2-3 BLEU improvement
    • Relative over-representation of English
  • Many to one models worse than many-to-many for English targets
    • 2-3 BLEU drop
    • Relative under-representation of English

42

43 of 156

Google’s Massively MNMT (Aharoni et al. 2019)

  • 103-lingual model (6-layer variation of transformer-big)
  • Many-to-one and one-to-many are both better than many-to-many
    • N-way corpora is detrimental to many-to-one performance
  • Supervised NMT quality degrades with more languages
  • Zero-shot NMT quality increases with more languages

43

Ukranian to Russian Zero-Shot Translation

44 of 156

Google’s Massively MNMT Model In the Wild

  • Extension of Aharoni et al. 2019 by Arivazhagan et al. 2019
  • Main focus
    • Temperature based data sampling for transfer-interference balance
    • Pushing number of supported language pairs to limit
    • Pushing MNMT performance with 1+ billion parameters
  • Starting point: Quality is directly proportional to data size

44

45 of 156

More Languages and Directions

  • Generic drop in performance for resource rich pairs
  • Performance degrades as number of pairs increase
    • 25 languages seems to be optimal
  • Separate one-to-all and all-to one models preferred
    • All to All models degrade translation into English

45

46 of 156

Are Larger Models Helpful?

  • 400M parameters
    • Transformer Big:
      • 6-layers, 1024-4096 hidden-filter sizes
      • 16 attention heads
  • 1.3B parameters
    • Transformer Deep:
      • 24-layers, 1024-4096 hidden-filter sizes
      • 16 attention heads
      • Low-resource performance
    • Transformer Wide:
      • 12-layers, 2048-16384 hidden-filter sizes
      • 32 attention heads
      • High-resource performance
  • 128 TPUs and 4M tokens per batch ($$$)

46

47 of 156

Language Aware Multilingualism

  • Zhang et al. 2020
    • OPUS 100 dataset: 55M sentence pairs
  • Deep transformers: Up to 2 BLEU better using 12 layers instead of 6 layers
    • Depth scaled initialization: Better initialization to handle deep stacking
    • Merged attention: Self and cross attention side by side
  • Expressibility-Scalability: Improves by 2-3 BLEU
    • Target language aware layer normalization
    • Language aware linear transformation b/w Encoder and Decoder
  • Random online back-translation: Boosts zero-shot quality by up to 10 BLEU
    • Back-translate monolingual corpora and use it to train

47

48 of 156

Adapting Previously Trained Models

  • Feed forward layers to refine outputs
    • Bapna et al. 2019
    • Partial solution to bottleneck
    • Language pair specific
      • Zero-shot performance?
  • 13.5% larger models
  • Improved high-resource pair performance
    • Low-resource performance kept
  • Questions:
    • Multiple adapters?
    • Dynamic adapter size?
    • Adapter-Base Model ratio in massively multilingual situation?

48

49 of 156

Pushing Limits: Mixtures Of Experts (Gshard; Lepikhin et al. 2020)

49

Replace attention layer with mixtures of experts layer sandwiched between two attention layers

Explosive growth in parameters (600B) as number of experts grow (2048).

Use model sharding and dynamic routing with large number of accelerators (2048 TPUs).

50 of 156

Engineering + Research = Ultimate Solution?

  • Advances in gating, tensor routing and load balancing to increase experts
    • Language agnostic work
  • Shallower models with large number of experts are recommended
    • Avoid deep modeling issues partially or completely
    • Save 10 times or more training time compared to inefficient 96L models
    • ~7 BLEU improvement globally over deepest model

50

51 of 156

Bottlenecks: Hardware, Cost and Energy Efficiency

  • Needs heavy hardware ($$$)
  • Most existing work by Google with TPUs
    • TPUs are faster than GPUs
    • Typical to use up to 100s of TPUs
    • Equivalent GPU setting not yet known
  • May not be possible to do this in a university setting :(((((

51

52 of 156

Bottlenecks: Hardware, Cost and Energy Efficiency

  • More devices and data = More training time (effective) = $$$$
    • Are the BLEU gains worth the mammoth models?
    • Do BLEU gains translate into human evaluation gains?
    • How do we deploy these models?
      • One model on a 2048 TPU cluster?
    • How about continual learning?
  • Future: Bigger models or better language aware neural modeling?
    • Model size (representational capacity) is no longer the bottleneck

52

53 of 156

Topics to address

  • Parameter sharing

  • Massively multilingual models

  • Language divergence

  • Training protocols

53

54 of 156

Language Divergence

  • Vocabularies
    • Sufficient and fair representation for languages�
  • Lessons from visualization and language families
    • Understanding what is learned where�
  • Token based control
    • Augmenting input with features�

54

55 of 156

On Vocabulary in MNMT

  • Core point: Fair vocabulary representation for all languages
  • Difficulty: Skew in data and hence word vocabulary
  • Key Points:
    • Use large monolingual corpora
    • Oversample smaller vocabularies
    • Temperature sampling
      • pVL(1/i) where i is the sampling temperature for vocabulary item V for language L
    • Adjust vocabulary sizes (32K to 128K sub-word vocabularies work well)
    • Trade-offs
      • Sequence length, softmax computation time, enough sub-words per language

55

56 of 156

Key Observations In Practice

  • Larger vocabularies do not always bring proportionate improvements
    • Larger vocabularies = Shorter sequences
    • Larger vocabularies = Slower softmaxes
  • Do your cost-benefit analysis!
  • (Almost) equal representation is important
  • Moderate temperature sampling with fixed vocabulary size is crucial
  • Is T=5 a golden rule?
    • Arivazhagan et al.; Bapna et al. 2019

56

57 of 156

Visualization Of MNMT Representations

  • SVCCA similarity between representations (Kuduganta et al. 2019)
    • Also see Dabre et al. and Johnson et al. 2017
  • Encoder representations cluster sentences into language families
    • Regardless of script sharing
    • Script sharing for stronger clustering
  • High resource languages cause partition
    • Low-resource languages ride the wave
  • Evidence of representation invariance when fine-tuning
    • Explains poor zero shot quality between distant pairs

57

58 of 156

Representation Similarity Evolution

  • Representation similarity varies with depth
    • Shallower layers cluster by script
    • Deeper layers cluster by family
  • Case study: Turkic and Slavic
    • Some use roman script
      • Turkish, Uzbeki and Azerbaijani
    • Some use cyrillic script
      • Kyrgyz and Kazakh
    • First encoder layer separates by script and last layer closes the gap
      • Distinguishes from slavic languages
      • Such languages use roman or cyrillic

58

59 of 156

Representation Evolution With Depth

For Many to English: Encoder representations converge and decoder representations diverge�For English to Many: Encoder representations of English diverge based on target language�Question: Is divergence a good thing? Does it cause a learning overhead? Should it?

59

60 of 156

Empirically Determined Language Families

  • Train many-to-many model with language tokens
  • Hierarchical clustering of tokens
    • Set number of clusters by elbow-sampling
  • Tan et al. 2019
  • Also see Oncevay et al. 2020

60

Predetermined language families

Empirically determined language families via embedding clustering

61 of 156

Is There An Optimal Number of Languages

  • Does empirical clustering help? (Upper table)
    • Mostly yes
    • Random clustering gives poorer results
    • Predetermined clustering is equally good
  • Language family specific models (Bottom table)
    • Universal model < Individual models
    • Family specific model > Individual models
    • Related to observations by Dabre et al. 2017 and 2018
  • Next steps
    • Family specific adaptor layers (Bapna et al. 2019)
    • Family specific vocabulary and decoder separation
    • Behavior in extremely low-resource settings (<20k pairs; Dabre et al, 2019)

61

62 of 156

On Language Tags: Embeddings vs Features

62

<2ja> I am a boy

私は男の子です

Train NMT

Johnson et al. 2017: prime the encoder’s output using a <2xx> token.

<2xx> is a single token with its embedding. Black box approach..

ja I@en am@en a@en boy@en ja

私は男の子です

Train NMT

Ha et al. 2016: distinguish between shared vocabulary units. Encoder is primed with source as well as target language information.

Blackwood et al. 2019: Use language tokens at beginning and end.

63 of 156

On Language Tags: Embeddings vs Features

63

I am a boy

私は男の子です

Train Factored NMT

Ha et al. 2017 and Hokamp et al. 2019: keep word embeddings independent of task (target language) via features

ja

en

I

EIword

Eenfeature

EIword+feature

Concatenate or Add

Encoder

64 of 156

Topics to address

  • Parameter sharing

  • Massively multilingual models

  • Language divergence

  • Training protocols

64

65 of 156

Training Protocols

  • MNMT training fundamentals�
  • Training schedules
    • Batching strategies
    • Importance of sampling�
  • Leveraging bilingual models
    • Distillation�
  • Model expansion and incremental learning

65

66 of 156

On MNMT Training

  • Fundamentally the same as standard NMT: Minimizing negative log likelihood

Where the individual language pair negative log-likelihood is

  • Challenges:
    • Good training schedule
    • Language equality

66

67 of 156

Training Schedule: Joint Training

  • One pair at a time (Firat et al. 2016, Dong et al. 2015)
    • Cycle through corpora (L1-L2 →L3-L4→L5-L6→.....→L1-L2)
  • Useful for models with separate encoders and/or decoders
  • Potential forgetting of language pair information in a previous batch
    • Catastrophic forgetting!

67

68 of 156

Training Schedule: Joint Training

  • Mixed language pairs batch (Johnson et al 2017)
    • Mix all corpora, shuffle and then choose batches
  • Useful for fully shared models
  • For models with separate language encoders/decoders
    • Shard batch and feed to appropriate components

68

69 of 156

Training Schedule: Addressing Language Equality

  • Source of inequality: Corpora size skew
  • Solutions: Oversampling smaller corpora
  • Oversampling before training or during training?
    • Matter of implementation choice
    • Oversampling prior to training creates large duplicated corpora

69

70 of 156

Importance Of Temperature Based Sampling (Arivazhagan et al. 2019)

  • Naive approaches:
    • Ignore corpora size distributions
    • Sample from all corpora equally
  • New approach: Temperature based sampling (pL(1/i) )
    • Where pL is the probability of sampling a sentence from a corpus
    • i is the sampling temperature
    • Strongly benefits low-resource pairs

70

71 of 156

Leveraging Bilingual Models

  • Training from scratch is challenging
    • Multitude of pairs
    • Complexity of task
    • Interference of languages
  • Solution: Leverage bilingual models or smaller multilingual models
    • Key principle: Transfer learning via sequence knowledge distillation (Kim et al. 2016)

71

L1→L2

Train wide and/or deep model

Big NMT model

L1→L’2

Decode L1

Distillation data

Train narrow and/or shallow model

Small NMT model

72 of 156

Distillation For MNMT Training (Tan et al. 2019)

72

L1→L2

L1→L2

model

L1→L’2

Decode L1

L3→L4

L3→L4

model

L3→L’4

Decode L3

Lm→Ln

Lm→Ln

model

Lm→L’n

Decode Lm

.

.

.

Train MNMT Model With Smoothed Data and Labels

Smoothed labels

Smoothed labels

Smoothed labels

MNMT

model

73 of 156

Incremental Training

  • Objective: Expand the capacity of existing models with maximum reusability
  • Promising approaches
    • Vocabulary expansion
      • Surafel et al. 2018
    • Gradual capacity expansion
      • Escolano et al. 2019
    • Adaptor layers (experts) on top of pre-trained models
      • Bapna et al. 2019 (already discussed)

73

74 of 156

Incorporating New Languages (Surafel et al. 2018)

  • Language specific transfer
    • Replace vocabulary
    • Fine-tune on new data
  • Similar to Zoph et al. 2016
  • Expanding to new languages
    • Expand vocabulary
    • Fine-tune on old+new data
  • Increase computational capacity?

74

75 of 156

Capacity Expansion Of Existing Models (Escolano et al. 2019)

  • Add new components while freezing existing components
    • Lightweight training BUT
    • Previous components may not be aware of new languages
      • Poor transfer learning
    • Potential zero-shot learning
      • Will it work for distant languages

  • Lessons from Sachan et al. 2018; Firat et al. 2016a/b; Bapna et al. 2019
    • Deepen encoders and decoders
    • Only train new components with old and/or new data
      • Vocabulary expansion by Surafel et al. 2018 will help

75

76 of 156

Correlated Topics Not Well Addressed Yet

  • Diverse language learning rates
    • Language pairs are learned at different rates
  • Jean et al. 2019 set task weights using model performance
    • Weights decide sampling (task importance)
  • Lessons from Wang et al. 2018
    • Adding or removing instances from training set

76

77 of 156

Correlated Topics Not Well Addressed Yet

  • Catastrophic forgetting
    • Thompson et al. 2019 on domain adaptation
    • Addressing forgetting for incremental learning?
  • MNMT model convergence
    • Currently report average performance over all pairs
      • I’m looking at you GShard
    • Ignoring individual pair’s performance is unwise
    • Weighted performance metric to the rescue?
    • Better evaluation metrics for MNMT settings?

77

78 of 156

Outline of This Tutorial

  • Overview of Multilingual NMT (30 min by Chenhui Chu)
  • Multiway Modeling (1 hour by Raj Dabre)
  • Low-resource Translation (1 hour by Anoop Kunchukuttan)
  • Multi-source Translation (10 min by Chenhui Chu)
  • Datasets, Future Directions, and Summary (20 min by Chenhui Chu)

78

79 of 156

Self Introduction (Anoop Kunchukuttan)

  • Experience
    • 2018-present: Senior Applied Researcher, MT Group, Microsoft India
    • 2012-2017: Ph.D. scholar at IIT Bombay, India
    • 2008-2011: ML/NLP Lead, Life Sciences Group, Persistent Systems
    • 2006-2008: M.Tech. IIT Bombay, India
  • Research Interests
    • Multilingual NLP
    • Machine Translation, Transliteration
    • Representation Learning for NLP
    • Indian language NLP

79

80 of 156

Section Overview

  • Transfer Learning for MNMT
    • Training Methods
    • Lexical Transfer
    • Syntactic Transfer
    • Language Relatedness
  • Translation between unseen language pairs
    • Pivot Translation
    • Zeroshot Translation
    • Zero-resource Translation
    • Multibridge MNMT

80

81 of 156

81

Translation for low-resource languages

A large skew in parallel corpus availability

Long tail of low-resource languages

Difficult to obtain corpora for many languages

Can high-resource languages help low-resource languages?

(Arivazhagan et al., 2019b)

82 of 156

82

Storing knowledge gained while solving one problem & applying it to a different but related problem.

Transfer learning

High resource pair corpus

Parent Model

Child Model

Knowledge

Low resource pair corpus

Concepts,

Domain knowledge,

Grammar,

Source-to-target transformation

Transfer Learning Scenarios

Many to one translation (M2O)

One to Many translation (O2M)

(Pan & Yang, 2010)

83 of 156

Section Overview

  • Transfer Learning for MNMT
    • Training Methods
    • Lexical Transfer
    • Syntactic Transfer
    • Language Relatedness
  • Translation between unseen language pairs
    • Pivot Translation
    • Zeroshot Translation
    • Zero-resource Translation
    • Multibridge MNMT

83

84 of 156

84

Joint Training

(Ha et al., 2016; Johnson et al, 2017)

Oversample child/undersample parent to balance training

Low-resource languages performance

Many-to-One direction ⇒ major gains

One-to-Many direction ⇒ minor gains

85 of 156

85

Fine-tuning

Fine-tuning

  • Fine-tuning vs. Joint Training
    • Fine-tuning better in O2M setting and vice-versa
  • While fine-tuning, specific set of parameters can be tuned
    • Many to one ⇒ Encoder bottom layers
    • One to many⇒ Decoder top-layers

(Zoph et al., 2016; Tan et al., 2019b)

86 of 156

86

Small low-resource corpus ⇒ overfitting might occur ⇒ Use mixed-finetuning

Mixed

Fine-tuning

(Dabre et al., 2019)

87 of 156

87

Transfer from multiple parents

Pre-train a multilingual NMT model on a representative set of high-resource languages

Useful for rapid-adaptation to new languages

(Neubig and Hu et al., 2018)

88 of 156

88

Transfer to multiple children

Useful for O2M setting, where single-stage fine-tuning is not very beneficial (10% improvement in BLEU score)

Multi-linguality in the mixed-fine tune stage aids translation

(Dabre et al., 2019)

89 of 156

89

What is the objective of the parent-model training?

  • Optimize performance on parent tasks?
  • Optimize performance on child tasks?
  • Enable few-shot learning?

90 of 156

90

Meta-learning

Learning to learn

Learn an initialization from which only a few examples are required to learn the child-task

(Finn et al., 2017)

91 of 156

91

Sample

language pairs

Sample

training examples

Simulate training step

Sample

test examples

Compute test-error

Compute average test-errors across language pairs

Compute meta-gradient

LPi

LP1

𝚹

𝚹i

Err1

Erri

Model Agnostic Meta-learning for MNMT

(Gu et al., 2018b)

𝚹new

Update original parameters

𝚹=𝚹new

Outperforms multilingual fine-tuning strategy on unseen pairs

Requires far fewer adaptation steps for comparable performance

92 of 156

Section Overview

  • Transfer Learning for MNMT
    • Training Methods
    • Lexical Transfer
    • Syntactic Transfer
    • Language Relatedness
  • Translation between unseen language pairs
    • Pivot Translation
    • Zeroshot Translation
    • Zero-resource Translation
    • Multibridge MNMT

92

93 of 156

93

Lexical Transfer

How do we initialize child token embeddings?

Parent vocabulary

Child vocabulary

Random Assignment

Dictionary Initialization

Bilingual

Dictionary

Initialize child token embeddings prior to fine-tuning

(Zoph et al., 2016)

(Zoph et al., 2016)

Dictionary Init faster than random assign, but random init eventually achieves comparable performance

94 of 156

94

Use bilingual embeddings to map parent and child embeddings to a common space

Map parent and child embeddings to a common space

Map child embeddings to parent embeddings

Linear mapping functions can be learnt using small bilingual dictionaries

Modified child embeddings

Original child embeddings

Modified child

Modified parent

Original child

Original parent

(Kim et al., 2019a)

(Gu et al., 2018a)

Significant improvements over random assignment of embeddings

95 of 156

Section Overview

  • Transfer Learning for MNMT
    • Training Methods
    • Lexical Transfer
    • Syntactic Transfer
    • Language Relatedness
  • Translation between unseen language pairs
    • Pivot Translation
    • Zeroshot Translation
    • Zero-resource Translation
    • Multibridge MNMT

95

96 of 156

96

Introduce noise in parent-source sentences

Prevents over-optimization to parent-source language

Noising schemes

Word insertion

Word deletion

Word Swapping

(Kim et al., 2019a)

Simple methods, gives modest improvements over baseline finetuning

What if parent and child have different word orders?

97 of 156

97

Reorder parent-source sentence to match child-source sentence

Ensures better alignment of encoder contextual embeddings

(Murthy et al., 2019)

Significant improvements over baseline finetuning, but needs a parser and re-ordering system

98 of 156

Section Overview

  • Transfer Learning for MNMT
    • Training Methods
    • Lexical Transfer
    • Syntactic Transfer
    • Language Relatedness
  • Translation between unseen language pairs
    • Pivot Translation
    • Zeroshot Translation
    • Zero-resource Translation
    • Multibridge MNMT

98

99 of 156

99

100 of 156

100

Key Similarities between related languages

(Kunchukuttan & Bhattacharyya., 2020)

101 of 156

101

(Kudungta et al., 2019)

Transfer Learning works best for related languages

Encoder Representations cluster by language family

(Zoph et al., 2016; Dabre et al, 2017b)

Contact relationship is also captured

Related languages using different scripts also cluster together

102 of 156

102

Utilizing lexical similarity

Subword-level vocabulary improves transfer

Improves parent-child vocabulary overlap encouraging better transfer

child

subword

BLEU score

parent-child overlap (%)

baseline

transfer

train

dev

tur-eng

word

8.1

8.5

3.9

3.6

BPE

12.4

13.2

58.8

25.0

uyg-eng

word

8.5

10.6

0.5

1.7

BPE

11.1

15.4

57.2

48.5

(uzb-eng is the parent language pair)

(Nguyen and Chiang, 2017)

103 of 156

103

Transfer between related languages using different scripts works well

Similar scripts

Script Conversion

Slavic (Cyrillic & Roman)

Indic (Brahmi)

Very different scripts

Romanization (unconv, uroman)

Turkic

Hindi, Urdu

(Lee et al., 2016;

Dabre et al, 2018;

Goyal et al 2020)

(Gheini & May, 2019;

Amrhein & Sennrich, 2020)

Transfer can also be done between languages related by contact (Goyal et al. 2020)

Dravidian and Indo-Aryan languages form a linguistic area in the Indian subcontinent (Emeneau, 1956)

Transfer works without script conversion → but script conversion provides improvements

104 of 156

104

Similar language regularization

Very small low resource language ⇒ Overfitting on finetuning

Pre-train model on multiple languages

Concatenate HRL and LRL corpus

Fine-tune jointly

Joint fine-tuning outperforms just target language fine-tuning

Pre-training with multiple languages is better than single language

(Neubig & Hu, 2018; Chaudhary et al, 2019)

Similar idea is also used for knowledge distillation

(Dabre & Fujita et al, 2020)

105 of 156

105

Reorder parent-source sentence to match child-source sentence

Significant improvements over baseline finetuning, but needs a parser and re-ordering system

(Murthy et al., 2019)

Reordering rules can be reused if the child-source languages have the same word order

Utilizing Syntactic Similarity

106 of 156

106

Parent Data Selection

Which examples in the parent language-pair are most helpful for transfer?

(Wang et al., 2019)

Let us look at the case of Many-to-one translation

s1

s2

s4

s2

s5

t

(si,t) is parent sentence pair from (H,E)

Score si by the probability that is belongs to low-resource source language L

Scoring Functions

  • score(si,L) ∝ vocab-overlap of si with L
  • score(si,L) ∝ PLM-L(si )

Sample examples using this score

  • Can be extended to multiple parent languages
  • Can be extended to language-level similarity score

107 of 156

Section Overview

  • Transfer Learning for MNMT
    • Training Methods
    • Lexical Transfer
    • Syntactic Transfer
    • Language Relatedness
  • Translation between unseen language pairs
    • Pivot Translation
    • Zeroshot Translation
    • Zero-resource Translation
    • Multibridge MNMT

107

108 of 156

108

Pivot Translation

Zero-shot Translation

Zero-resource Translation

aaja bahuta ThaNDa hai

It is very cold today

inn vaLar.e taNuppAN

hi-en model

en-ml model

hi

en

ml

hi

en

ml

aaja bahuta ThaNDa hai

inn vaLar.e taNuppAN

many-2-many model

Multiple decoding steps

Multiple translation systems

Single decoding step

Single translation system

(Johnson et al., 2016)

(Firat et al., 2016b)

109 of 156

Section Overview

  • Transfer Learning for MNMT
    • Training Methods
    • Lexical Transfer
    • Syntactic Transfer
    • Language Relatedness
  • Translation between unseen language pairs
    • Pivot Translation
    • Zeroshot Translation
    • Zero-resource Translation
    • Multibridge MNMT

109

110 of 156

110

aaja bahuta ThaNDa hai

It is very cold today

inn vaLar.e taNuppAN

hi-en model

en-ml model

hi

en

ml

hi

en

ml

aaja bahuta ThaNDa hai

It is very cold today

hi

en

ml

hi

en

ml

inn vaLar.e taNuppAN

many-2-many model

many-2-many model

Bilingual Pivot

Multilingual Pivot

Multilingual pivot generally outperforms bilingual pivot (Firat et al., 2016)

Pivot translation is a strong baseline

Limitations:

  • Cascading errors
  • Decode time

Reduce cascading errors using n-best translations

(function of path length)

(Johnson et al., 2016)

111 of 156

Section Overview

  • Transfer Learning for MNMT
    • Training Methods
    • Lexical Transfer
    • Syntactic Transfer
    • Language Relatedness
  • Translation between unseen language pairs
    • Pivot Translation
    • Zeroshot Translation
    • Zero-resource Translation
    • Multibridge MNMT

111

112 of 156

112

Naive Zeroshot-performance significantly lags behind pivot translation

de→fr

fr→de

Pivot

19.71

24.33

Zeroshot

19.22

21.63

Output is generated in the wrong language

Code-mixing is rare

Once a wrong language token is generated, all subsequent tokens are generated in that language

  • Spurious correlation between source input & target language
  • Model always associates english output with any input
  • Copying behaviour

de→fr

fr→de

Pivot

26.25

20.18

Zeroshot

16.80

12.03

Performance gap reduces when evaluation restricted to correct language output

(Results from Arivazhagan et al., 2019a)

(Arivazhagan et al., 2019a; Gu et al., 2019)

113 of 156

113

Prevents generation of wrong language, perfomance still lags pivot baseline

Restrict decoder to output only target-language vocabulary items

(Ha et al., 2017)

German-Dutch

German-Romanian

baseline

14.95

10.83

+vocab filter

16.02

11.00

Language-specific subword model learning (Rios et al., 2020)

Vocabulary construction and control

Reduces copying behaviour and bias towards English output

Joint vocab

15.4

Language-specific vocab

20.5

  • overlapping model vocab

114 of 156

114

Minimize divergence between encoder representations

Loss = CE + ꭥ

ꭥ(zx,zy): distance between encoder representations of source (x) and target (y)

Supervised Objectives

  • Cosine distance (Arivazhagan et al., 2019a)
  • Euclidean distance (Pham et al., 2019)
  • Correlation distance (Saha et al., 2016)

Unsupervised Distribution Matching

(Arivazhagan et al., 2019a)

Use a domain-adversarial loss

Competitive with pivot and improves over baseline MNMT

115 of 156

115

Avoid pooling to generate encoder representations

(Pham et al., 2019)

Encoder output is variable length

Decoder receives fixed input at every timestep

Minimize divergence between auto-encoded and true target at different points in the decoder

Attention Forcing at bottom decoder layer

Improvements over directly minimizing encoder divergence

116 of 156

116

Encourage output agreement

Encourage equivalent sentence in two languages to generate similar output in an auxiliary language

Decode to language L using both inputs s and t

Score output of one input using other input

Add loss term force output agreement

(Al-Shedivat and Parikh, 2019)

Competitive with pivot and no-loss in supervised directions

117 of 156

117

Effect of number of languages & corpus size

Zeroshot performance improves with the number of languages

(Aharoni et al., 2019; Arivazhagan et al., 2019b)

(Aharoni et al., 2019)

(Arivazhagan et al., 2019b)

Zero-shot translation may work well only when the multilingual parallel corpora is large

(Mattoni et al., 2017; Lakew et al., 2017)

118 of 156

118

Can monolingual pre-training help zero-shot translation?

mBART Training

Finetune with parallel related corpus

Decode for related language

hi

ne

en

…...

Monolingual

corpora

m-BART model

hi-en Translation model

ne test sentence

en output

hi-en parallel corpus

(Liu et al., 2020)

119 of 156

Section Overview

  • Transfer Learning for MNMT
    • Training Methods
    • Lexical Transfer
    • Syntactic Transfer
    • Language Relatedness
  • Translation between unseen language pairs
    • Pivot Translation
    • Zeroshot Translation
    • Zero-resource Translation
    • Multibridge MNMT

119

120 of 156

120

Zero-resource Translation

Zero-shot translation & Zero-resource No parallel corpus between unseen languages

Zero-shot → no specific training for unseen language pairs

Zero-resource → training takes into account unseen language pairs of interest

e.g use synthetic parallel corpus

Zero-resource NMT can be used to tune the NMT model for some unseen language pairs of interest

(Firat et al., 2016b)

121 of 156

121

Creating Synthetic Parallel Corpus

Expose M2M model to zeroshot directions

Create synthetic source to real target parallel

{ (X’1Y1), (X’2Y2), (X’3Y3), (X’4Y4) }

Via back-translation using the M2M model

Via pivot language

Y → E → X’

Y’ ← E → X’

(Firat et al., 2016b; Lakew et al., 2017; Gu et al., 2019; Currey & Heafield, 2019)

122 of 156

122

Iterative Refinement

Backtranslation quality depends on quality of underlying translation models

Train MNMT model

Backtranslation

Augment

Original + backtranslated data

MNMT model

Backtranslated

data

(Lakew et al., 2017)

Original parallel data

Iterative reinforcement learning approaches which reward original translation directions based on language modelling and reconstruction losses in zeroshot directions (Sestorain et al., 2018)

123 of 156

123

Scaling BT to multiple translation directions

Expensive to generate BT data for O(n2) language pairs

Random Online backtranslation

(Zhang et al., 2020)

For every real sentence pair (x,y) in lang-pair (s,t)

  • Sample a source language s’ → t
  • Generate backtranslation pair (x’,y) in pair (s’,t)
  • add backtranslated pairs to the minibatch

Only doubles the effective corpus size

Results approach pivot baseline

124 of 156

124

S

P

T

Minimize

Teacher-student training

(Chen et al., 2017)

Assumption: The following distributions are similar

125 of 156

125

Source-Pivot

Parallel Corpus

Pivot-Target

Parallel Corpus

Source-Target

Model

Pivot-Target Model

Teacher-Student Training

MLE Training

Teacher model

Student model

Given: Source-Pivot and Pivot-Target Parallel Corpus

Teacher-Student Training Approaches

Sentence-level matching

Token-level matching

126 of 156

Section Overview

  • Transfer Learning for MNMT
    • Training Methods
    • Lexical Transfer
    • Syntactic Transfer
    • Language Relatedness
  • Translation between unseen language pairs
    • Pivot Translation
    • Zeroshot Translation
    • Zero-resource Translation
    • Multibridge MNMT

126

127 of 156

127

Can’t we simply add direct parallel corpora between non-English languages?

How do we acquire such parallel data?

How do we address data imbalance?

Single-bridge vs. Multi-bridge systems

128 of 156

128

Mine parallel corpus from monolingual corpora

Expensive to mine from all 100 x 99 pairs

Which are the most promising language pairs to mine from?

Cluster languages and select bridge languages

(Fan et al., 2020)

129 of 156

129

Extraction from English-centric parallel corpora

(Freitag & Firat, 2020; Rios et al., 2020)

130 of 156

130

Sampling strategies used for English-centric datasets have limitations

  • Large sample space → quadratic number of pairs to sample
  • Pairwise sampling will be biased in favour of instances containing English

Just ~15% sentence pairs exclusively non-English

Sampling independently on source and target marginal is also biased.

Solution 1: (Freitag & Firat, 2020)

  1. Temperature-based sampling of target language first using marginal distribution of target languages
  2. Sample source language uniformly

Solution 2: (Fan et al., 2020)

Sample from probability matrix while ensuring source and target marginals follow a temperature based schedule

Data Sampling Strategies

These approaches improve over standard temperature-based sampling for non-English directions

131 of 156

131

  • Significant improvement in translation quality over pivot baseline for:
    • Newly trained directions
    • Zero-shot directions
  • Multi-bridge translation does not adversely impact English-centric directions
  • Multi-bridge translation and synthetic data augmentation provide complementary benefits

Fan et al., 2020

Rios et al., 2020

Language Pairs →

New Train

Unseen

New Train

Unseen

Single-bridge

5.4

7.6

20.0

20.9

Single-bridge pivot

9.8

12.4

22.9

23.7

Multi-bridge

12.3

18.5

25.1

24.0

  • synth-data

25.2

Major Results

Word of caution: Human evaluations gains are not as significant (Fan et al., 2020)

zeroshot

132 of 156

Section Summary

MNMT has helped make significant advances in low-resource MT

  • Novel methods for transfer learning and zeroshot translation

Transfer Learning

  • Optimize the right objective for improved transfer
  • Language-relatedness plays a key role in successful transfer
  • Transfer works better in M2O setting than O2M setting
  • Lexical transfer is easier to achieve than syntactic transfer

Translation between unseen languages

  • Pivot translation is a strong baseline
  • Zeroshot ⇒ spurious correlation between input representation & output language
    • Reducing divergence between internal representations of different languages
    • Multi-bridge systems
  • Zero-resource translation can use synthetic data to reduce spurious correlations

132

133 of 156

Outline of This Tutorial

  • Overview of Multimodal NMT (30 min by Chenhui Chu)
  • Multiway Modeling (1 hour by Raj Dabre)
  • Low-resource Translation (1 hour by Anoop Kunchukuttan)
  • Multi-source Translation (10 min by Chenhui Chu)
  • Datasets, Future Directions, and Summary (20 min by Chenhui Chu)

133

134 of 156

Why Multi-source MT?

134

If a source sentence has already been translated into multiple languages then these sentences can be used together to improve the translation into the target language:

such as EU, and UN.

Figures from (Nishimura et al., 2018)

135 of 156

Multi-Source Available: Multi-source Encoder (Zoop et al., 2016)

135

136 of 156

Results of Multi-source Encoder (Zoph et al., 2016)

136

137 of 156

Multi-Source Available: Ensembling (Garmash et al., 2016)

137

138 of 156

Results of Ensembling (Garmash et al., 2016)

138

139 of 156

Multi-Source Available: Concatenation (Dabre et al., 2017)

139

Multilingual (N-way) Corpus

Concatenate Source Sentences

Train NMT Model

Model

Word Segmentation (BPE)

Early Stop on Best Dev BLEU

Hello ||| Bonjour ||| नमस्कर ||| Kamusta ||| Hallo ||| こんにちは

I ||| Je ||| मी ||| ako ||| ech ||| 私

Multi-source Corpus

Hello Bonjour नमस्कर Kamusta Hallo ||| こんにちは

I Je मी ako ech ||| 私

140 of 156

Results of Concatenation (Dabre et al., 2017)

140

* Concatenation (bold)/Ensembling/Multi-source Encoder

141 of 156

Missing Source Sentences (Nishimura et al., 2018)

141

Scenario:

Methods:

142 of 156

Results of Missing Source Sentences (Nishimura et al., 2018)

142

143 of 156

Summary of Multi-source Approaches

143

144 of 156

Outline of This Tutorial

  • Overview of Multimodal NMT (30 min by Chenhui Chu)
  • Multiway Modeling (1 hour by Raj Dabre)
  • Low-resource Translation (1 hour by Anoop Kunchukuttan)
  • Multi-source Translation (10 min by Chenhui Chu)
  • Datasets, Future Directions, and Summary (20 min by Chenhui Chu)

144

145 of 156

Multiway Datasets

145

Languages

Corpora

European languages

Europarl, JRC-Aquis, DGT-Aquis, DGT-TM, ECDC-TM, EAC-TM etc.

Asian languages

WAT shared tasks etc.

Indic languages

CVIT-PIB/MKB, PMIndia, IndoWordNet etc.

Massive

WikiMatrix, JW300 etc.

Others

UN, TED, Opensubtitles etc.

Refer to catalogs like OPUS and the IndicNLP catalog for comprehensive listings of parallel corpora resources.

146 of 156

Low or Zero-Resource Multiway Datasets

146

Corpus

Domain

Languages

FLORES

Wikipedia

English, Nepali, Sinhala

XNLI

Caption

15 languages

CVIT-Mann ki Baat/PIB

General

10 Indian Languages

Indic parallel corpus

Wikipedia

6 Indian Languages

WMT shared tasks

Web

German, Upper Sorbian

+All the multiway datasets listed in the previous slide can be used for testing

147 of 156

Multi-source Datasets

147

Corpus

N-way

Domain

Languages

Europarl

11

Politics

European languages

TED

5

Spoken

French, German, Czech, Arabic and English

UN

6

Politics

Arabic, Chinese, English, French, Russian and Spanish

ILCI

11

Tourism/health

Indian languages + English

ALT

9

News

South-East Asian languages + English, Japanese

Bible

1,000

Religion

Most major languages

148 of 156

Outline of This Tutorial

  • Overview of Multimodal NMT (30 min by Chenhui Chu)
  • Multiway Modeling (1 hour by Raj Dabre)
  • Low-resource Translation (1 hour by Anoop Kunchukuttan)
  • Multi-source Translation (10 min by Chenhui Chu)
  • Datasets, Future Directions, and Summary (20 min by Chenhui Chu)

148

149 of 156

Exploring Pre-trained Models

  • Pre-training embeddings, encoders and decoders have been shown to be useful for NMT
  • But
    • How pre-training can be incorporated into different MNMT architectures?
    • How to address techniques to maximize the impact of transfer in fine-tuning?
  • Unsupervised pre-training and unsupervised NMT might be worth investing

149

150 of 156

Unseen Language Pair Translation

  • Previous work on unseen language pair translation has only addressed cases where the pivot language is related to or shares the same script with the source language
  • The pivot language (mostly English) is unlikely to be related to the source and target languages and this scenario requires further investigation (especially for zero-shot translation).
  • New approaches need to be explored to significantly improve over the simple pivot baseline.

150

151 of 156

Joint Multilingual and Multi-Domain NMT

  • When extending an NMT system to a new language, the parallel corpus in the domain of interest may not be available
  • Transfer learning in this case has to span languages and domains
  • It might be worthwhile to explore adversarial approaches where domain and language invariant representations can be learned for the best translations

151

152 of 156

Multilingual Speech-to-Speech NMT

  • An interesting research direction would be to explore multilingual speech translation, where the ASR, translation and TTS modules can be multilingual
  • Interesting challenges and opportunities may arise in the quest to compose all these multilingual systems in an end-to-end method.
  • Multilingual end-to-end speech-to-speech translation would also be a future challenging scenario

152

153 of 156

153

Your ideas?

154 of 156

Outline of This Tutorial

  • Overview of Multimodal NMT (30 min by Chenhui Chu)
  • Multiway Modeling (1 hour by Raj Dabre)
  • Low-resource Translation (1 hour by Anoop Kunchukuttan)
  • Multi-source Translation (10 min by Chenhui Chu)
  • Datasets, Future Directions, and Summary (20 min by Chenhui Chu)

154

155 of 156

155

MNMT

Multiway Modeling

Transfer Learning

Zero-resource Modeling

Multi-source Translation

Synthetic corpus generation, iterative refinement, teacher-student models, pre-training approaches

Low-resource Translation

Available/missing source sentences, multiway-multisource modeling, post-editing

Zero-shot Modelling

Wrong language generation, language invariant representations, output agreement, effect of corpus size and number of languages

Parameter Sharing

Language Divergence

Full/minimal/controlled parameter sharing,

massive models, capacity bottlenecks

Balancing language agnostic-dependent representations, impact of language tags, reordering and pre-ordering of languages, vocabularies

Training Protocols

Parallel/joint training, multi-stage and incremental training, knowledge distillation, optimal stopping

Fine-tuning, regularization, lexical transfer, syntactic transfer, language relatedness

Use-case

Core-issues

Challenges

Summary of MNMT

(Dabre et al., 2020)

156 of 156

156

Thank You!