Multilingual Neural Machine Translation
1
COLING 2020 Tutorial, December 12th, 2020
Raj Dabre
NICT
Kyoto, Japan
Chenhui Chu
Kyoto University
Kyoto, Japan
Anoop Kunchukuttan
Microsoft STCI
Hyderabad, India
2
Tutorial Homepage
https://github.com/anoopkunchukuttan/multinmt_tutorial_coling2020
Survey Paper
Raj Dabre, Chenhui Chu, and Anoop Kunchukuttan. 2020. A Survey of Multilingual Neural Machine Translation. ACM Comput. Surv. 53, 5, Article 99 (September 2020), 38 pages. https://doi.org/10.1145/3406095
Updated Bibliography (the field is moving so fast!)
https://github.com/anoopkunchukuttan/multinmt_tutorial_coling2020/mnmt_bibliography.pdf
Tutorial Material
Self Introduction (Chenhui Chu)
3
Outline of This Tutorial
4
What is Multilingual NMT (MNMT)?
5
Why MNMT?
6
Size of the top-100 language pairs in OPUS (Tiedemann et al., 2012)
Why MNMT?
7
Data Distribution over language pairs (Arivazhagan et al., 2019)
Motivation and Goal of MNMT (Dabre et al., 2020)
8
Timeline of MNMT Research
9
NMT
2014
2015
2016
2017
2018
2019
2020
NMT with attention
Transformer
Multiway MNMT
Massive MNMT
Multi-source MNMT
Google MNMT
Beyond English-Centric
Low-resource MNMT
BERT
Google’s MNMT System (Johnson et al., 2017)
10
A single multilingual model achieves comparable performance for
English-French and surpasses state-of-the-art results for English-German.
Language Clusters (Johnson et al., 2017)
11
Mixing Target Languages (Johnson et al., 2017)
12
Google’s Massive MNMT System (Arivazhagan et al., 2019)
13
BLEU
Studied a single massive MNMT model handling
103 languages trained on over 25 billion examples
Effect of Massive MNMT (Arivazhagan et al., 2019)
14
Massive MNMT is helpful for low-resource languages but not high-resource languages
Effect of Task Numbers (Arivazhagan et al., 2019)
15
Increasing numbers of languages is not always helpful
Facebook’s Beyond English-Centric MNMT System (Fan et al., 2020)
16
Performance of M2M-100 (Angela et al., 2020)
17
Goal of This Tutorial
18
Basic NMT Architecture (Dabre et al., 2020)
19
RNN based NMT (Bahdanau et al., 2015)
20
Self-Attention Based NMT (Vaswani et al., 2017)
21
Initial MNMT Models (1/2) (Firat et al., 2016)
22
E
L1
L2
LN
Encoder 1
E
L1
L2
LN
Encoder X
.
.
.
E
L1
LM
S
Decoder 1
E
L1
LM
S
Decoder Y
.
.
.
Shared Attention Mechanism
ENGLISH
I am a boy.
HINDI
मैं लड़का हूँ.
ITALIAN
Sono un ragazzo
MARATHI
मी मुलगा आहे.
Encode English
Decode Italian
Minimal Parameter Sharing
Initial MNMT Models (2/2) (Johnson et al., 2017)
23
E
L1
L2
LN
Encoder
1. <2mr> I am a boy.
2. <2mr> मैं लड़का हूँ.
3. <2it> I am a boy.
4. <2it> मैं लड़का हूँ.
.
.
E
L1
LM
S
Decoder
Attention Mechanism
1. मी मुलगा आहे.
2. मी मुलगा आहे.
3. Sono un ragazzo.
4. Sono un ragazzo.
.
.
Complete Parameter Sharing
24
MNMT
Multiway Modeling
Transfer Learning
Zero-resource Modeling
Multi-source Translation
Low-resource Translation
Zero-shot Modelling
Parameter Sharing
Language Divergence
Training Protocols
Use-case
Core-issues
Overview of MNMT
(Dabre et al., 2020)
Outline of This Tutorial
25
Self Introduction (Raj Dabre)
26
Topics to address
27
Topics to address
28
Parameter Sharing: Finding The Right Balance
29
Sharing Vocabularies (1)
30
Sharing Vocabularies (2)
31
en
I
EIword
Eenlang
EIword:Eenlang
Concatenate
Encoder
Add
EIword:+Eenlang
Sharing In Encoders and Decoders (1)
32
Sharing In Encoders and Decoders (2)
33
Sharing In Encoders and Decoders (3)
34
A Note On Attention Sharing (Blackwood et al. 2018)
35
Contextual Parameters (Platanois et al. 2018)
36
Generate parameters using parameters!
Contextual Parameters (Platanois et al. 2018)
37
Contextual Parameters (Platanois et al, 2018)
38
Tensor routing (Zaremoudi et al. 2018)
39
Topics to address
40
Massively Multilingual Models
41
Google’s Massively MNMT (Aharoni et al. 2019)
42
Google’s Massively MNMT (Aharoni et al. 2019)
43
Ukranian to Russian Zero-Shot Translation
Google’s Massively MNMT Model In the Wild
44
More Languages and Directions
45
Are Larger Models Helpful?
46
Language Aware Multilingualism
47
Adapting Previously Trained Models
48
Pushing Limits: Mixtures Of Experts (Gshard; Lepikhin et al. 2020)
49
Replace attention layer with mixtures of experts layer sandwiched between two attention layers
Explosive growth in parameters (600B) as number of experts grow (2048).
Use model sharding and dynamic routing with large number of accelerators (2048 TPUs).
Engineering + Research = Ultimate Solution?
50
Bottlenecks: Hardware, Cost and Energy Efficiency
51
Bottlenecks: Hardware, Cost and Energy Efficiency
52
Topics to address
53
Language Divergence
54
On Vocabulary in MNMT
55
Key Observations In Practice
56
Visualization Of MNMT Representations
57
Representation Similarity Evolution
58
Representation Evolution With Depth
For Many to English: Encoder representations converge and decoder representations diverge�For English to Many: Encoder representations of English diverge based on target language�Question: Is divergence a good thing? Does it cause a learning overhead? Should it?
59
Empirically Determined Language Families
60
Predetermined language families
Empirically determined language families via embedding clustering
Is There An Optimal Number of Languages
61
On Language Tags: Embeddings vs Features
62
<2ja> I am a boy
私は男の子です
Train NMT
Johnson et al. 2017: prime the encoder’s output using a <2xx> token.
<2xx> is a single token with its embedding. Black box approach..
ja I@en am@en a@en boy@en ja
私は男の子です
Train NMT
Ha et al. 2016: distinguish between shared vocabulary units. Encoder is primed with source as well as target language information.
Blackwood et al. 2019: Use language tokens at beginning and end.
On Language Tags: Embeddings vs Features
63
I am a boy
私は男の子です
Train Factored NMT
Ha et al. 2017 and Hokamp et al. 2019: keep word embeddings independent of task (target language) via features
ja
en
I
EIword
Eenfeature
EIword+feature
Concatenate or Add
Encoder
Topics to address
64
Training Protocols
65
On MNMT Training
Where the individual language pair negative log-likelihood is
66
Training Schedule: Joint Training
67
Training Schedule: Joint Training
68
Training Schedule: Addressing Language Equality
69
Importance Of Temperature Based Sampling (Arivazhagan et al. 2019)
70
Leveraging Bilingual Models
71
L1→L2
Train wide and/or deep model
Big NMT model
L1→L’2
Decode L1
Distillation data
Train narrow and/or shallow model
Small NMT model
Distillation For MNMT Training (Tan et al. 2019)
72
L1→L2
L1→L2
model
L1→L’2
Decode L1
L3→L4
L3→L4
model
L3→L’4
Decode L3
Lm→Ln
Lm→Ln
model
Lm→L’n
Decode Lm
.
.
.
Train MNMT Model With Smoothed Data and Labels
Smoothed labels
Smoothed labels
Smoothed labels
MNMT
model
Incremental Training
73
Incorporating New Languages (Surafel et al. 2018)
74
Capacity Expansion Of Existing Models (Escolano et al. 2019)
�
75
Correlated Topics Not Well Addressed Yet
76
Correlated Topics Not Well Addressed Yet
77
Outline of This Tutorial
78
Self Introduction (Anoop Kunchukuttan)
79
Section Overview
80
81
Translation for low-resource languages
A large skew in parallel corpus availability
Long tail of low-resource languages
Difficult to obtain corpora for many languages
Can high-resource languages help low-resource languages?
(Arivazhagan et al., 2019b)
82
Storing knowledge gained while solving one problem & applying it to a different but related problem.
Transfer learning
High resource pair corpus
Parent Model
Child Model
Knowledge
Low resource pair corpus
Concepts,
Domain knowledge,
Grammar,
Source-to-target transformation
Transfer Learning Scenarios
Many to one translation (M2O)
One to Many translation (O2M)
(Pan & Yang, 2010)
Section Overview
83
84
Joint Training
(Ha et al., 2016; Johnson et al, 2017)
Oversample child/undersample parent to balance training
Low-resource languages performance
Many-to-One direction ⇒ major gains
One-to-Many direction ⇒ minor gains
85
Fine-tuning
Fine-tuning
(Zoph et al., 2016; Tan et al., 2019b)
86
Small low-resource corpus ⇒ overfitting might occur ⇒ Use mixed-finetuning
Mixed
Fine-tuning
(Dabre et al., 2019)
87
Transfer from multiple parents
Pre-train a multilingual NMT model on a representative set of high-resource languages
Useful for rapid-adaptation to new languages
(Neubig and Hu et al., 2018)
88
Transfer to multiple children
Useful for O2M setting, where single-stage fine-tuning is not very beneficial (10% improvement in BLEU score)
Multi-linguality in the mixed-fine tune stage aids translation
(Dabre et al., 2019)
89
What is the objective of the parent-model training?
90
Meta-learning
Learning to learn
Learn an initialization from which only a few examples are required to learn the child-task
(Finn et al., 2017)
91
Sample
language pairs
Sample
training examples
Simulate training step
Sample
test examples
Compute test-error
Compute average test-errors across language pairs
Compute meta-gradient
LPi
LP1
𝚹
𝚹i’
Err1
Erri
Model Agnostic Meta-learning for MNMT
(Gu et al., 2018b)
𝚹new
Update original parameters
𝚹=𝚹new
Outperforms multilingual fine-tuning strategy on unseen pairs
Requires far fewer adaptation steps for comparable performance
Section Overview
92
93
Lexical Transfer
How do we initialize child token embeddings?
Parent vocabulary
Child vocabulary
Random Assignment
Dictionary Initialization
Bilingual
Dictionary
Initialize child token embeddings prior to fine-tuning
(Zoph et al., 2016)
(Zoph et al., 2016)
Dictionary Init faster than random assign, but random init eventually achieves comparable performance
94
Use bilingual embeddings to map parent and child embeddings to a common space
Map parent and child embeddings to a common space
Map child embeddings to parent embeddings
Linear mapping functions can be learnt using small bilingual dictionaries
Modified child embeddings
Original child embeddings
Modified child
Modified parent
Original child
Original parent
(Kim et al., 2019a)
(Gu et al., 2018a)
Significant improvements over random assignment of embeddings
Section Overview
95
96
Introduce noise in parent-source sentences
Prevents over-optimization to parent-source language
Noising schemes
Word insertion
Word deletion
Word Swapping
(Kim et al., 2019a)
Simple methods, gives modest improvements over baseline finetuning
What if parent and child have different word orders?
97
Reorder parent-source sentence to match child-source sentence
Ensures better alignment of encoder contextual embeddings
(Murthy et al., 2019)
Significant improvements over baseline finetuning, but needs a parser and re-ordering system
Section Overview
98
99
100
Key Similarities between related languages
(Kunchukuttan & Bhattacharyya., 2020)
101
(Kudungta et al., 2019)
Transfer Learning works best for related languages
Encoder Representations cluster by language family
(Zoph et al., 2016; Dabre et al, 2017b)
Contact relationship is also captured
Related languages using different scripts also cluster together
102
Utilizing lexical similarity
Subword-level vocabulary improves transfer
Improves parent-child vocabulary overlap encouraging better transfer
child | subword | BLEU score | parent-child overlap (%) | ||
| | baseline | transfer | train | dev |
tur-eng | word | 8.1 | 8.5 | 3.9 | 3.6 |
| BPE | 12.4 | 13.2 | 58.8 | 25.0 |
uyg-eng | word | 8.5 | 10.6 | 0.5 | 1.7 |
| BPE | 11.1 | 15.4 | 57.2 | 48.5 |
(uzb-eng is the parent language pair)
(Nguyen and Chiang, 2017)
103
Transfer between related languages using different scripts works well
Similar scripts
Script Conversion
Slavic (Cyrillic & Roman)
Indic (Brahmi)
Very different scripts
Romanization (unconv, uroman)
Turkic
Hindi, Urdu
(Lee et al., 2016;
Dabre et al, 2018;
Goyal et al 2020)
(Gheini & May, 2019;
Amrhein & Sennrich, 2020)
Transfer can also be done between languages related by contact (Goyal et al. 2020)
Dravidian and Indo-Aryan languages form a linguistic area in the Indian subcontinent (Emeneau, 1956)
Transfer works without script conversion → but script conversion provides improvements
104
Similar language regularization
Very small low resource language ⇒ Overfitting on finetuning
Pre-train model on multiple languages
Concatenate HRL and LRL corpus
Fine-tune jointly
Joint fine-tuning outperforms just target language fine-tuning
Pre-training with multiple languages is better than single language
(Neubig & Hu, 2018; Chaudhary et al, 2019)
Similar idea is also used for knowledge distillation
(Dabre & Fujita et al, 2020)
105
Reorder parent-source sentence to match child-source sentence
Significant improvements over baseline finetuning, but needs a parser and re-ordering system
(Murthy et al., 2019)
Reordering rules can be reused if the child-source languages have the same word order
Utilizing Syntactic Similarity
106
Parent Data Selection
Which examples in the parent language-pair are most helpful for transfer?
(Wang et al., 2019)
Let us look at the case of Many-to-one translation
s1
s2
s4
s2
s5
t
(si,t) is parent sentence pair from (H,E)
Score si by the probability that is belongs to low-resource source language L
Scoring Functions
Sample examples using this score
Section Overview
107
108
Pivot Translation
Zero-shot Translation
Zero-resource Translation
aaja bahuta ThaNDa hai
It is very cold today
inn vaLar.e taNuppAN
hi-en model
en-ml model
hi
en
ml
hi
en
ml
aaja bahuta ThaNDa hai
inn vaLar.e taNuppAN
many-2-many model
Multiple decoding steps
Multiple translation systems
Single decoding step
Single translation system
(Johnson et al., 2016)
(Firat et al., 2016b)
Section Overview
109
110
aaja bahuta ThaNDa hai
It is very cold today
inn vaLar.e taNuppAN
hi-en model
en-ml model
hi
en
ml
hi
en
ml
aaja bahuta ThaNDa hai
It is very cold today
hi
en
ml
hi
en
ml
inn vaLar.e taNuppAN
many-2-many model
many-2-many model
Bilingual Pivot
Multilingual Pivot
Multilingual pivot generally outperforms bilingual pivot (Firat et al., 2016)
Pivot translation is a strong baseline
Limitations:
Reduce cascading errors using n-best translations
(function of path length)
(Johnson et al., 2016)
Section Overview
111
112
Naive Zeroshot-performance significantly lags behind pivot translation
| de→fr | fr→de |
Pivot | 19.71 | 24.33 |
Zeroshot | 19.22 | 21.63 |
Output is generated in the wrong language
Code-mixing is rare
Once a wrong language token is generated, all subsequent tokens are generated in that language
| de→fr | fr→de |
Pivot | 26.25 | 20.18 |
Zeroshot | 16.80 | 12.03 |
Performance gap reduces when evaluation restricted to correct language output
(Results from Arivazhagan et al., 2019a)
(Arivazhagan et al., 2019a; Gu et al., 2019)
113
Prevents generation of wrong language, perfomance still lags pivot baseline
Restrict decoder to output only target-language vocabulary items
(Ha et al., 2017)
| German-Dutch | German-Romanian |
baseline | 14.95 | 10.83 |
+vocab filter | 16.02 | 11.00 |
Language-specific subword model learning (Rios et al., 2020)
Vocabulary construction and control
Reduces copying behaviour and bias towards English output
Joint vocab | 15.4 |
Language-specific vocab | 20.5 |
114
Minimize divergence between encoder representations
Loss = CE + ꭥ
ꭥ(zx,zy): distance between encoder representations of source (x) and target (y)
Supervised Objectives
Unsupervised Distribution Matching
(Arivazhagan et al., 2019a)
Use a domain-adversarial loss
Competitive with pivot and improves over baseline MNMT
115
Avoid pooling to generate encoder representations
(Pham et al., 2019)
Encoder output is variable length
Decoder receives fixed input at every timestep
Minimize divergence between auto-encoded and true target at different points in the decoder
Attention Forcing at bottom decoder layer
Improvements over directly minimizing encoder divergence
116
Encourage output agreement
Encourage equivalent sentence in two languages to generate similar output in an auxiliary language
Decode to language L using both inputs s and t
Score output of one input using other input
Add loss term force output agreement
(Al-Shedivat and Parikh, 2019)
Competitive with pivot and no-loss in supervised directions
117
Effect of number of languages & corpus size
Zeroshot performance improves with the number of languages
(Aharoni et al., 2019; Arivazhagan et al., 2019b)
(Aharoni et al., 2019)
(Arivazhagan et al., 2019b)
Zero-shot translation may work well only when the multilingual parallel corpora is large
(Mattoni et al., 2017; Lakew et al., 2017)
118
Can monolingual pre-training help zero-shot translation?
mBART Training
Finetune with parallel related corpus
Decode for related language
hi
ne
en
…...
Monolingual
corpora
m-BART model
hi-en Translation model
ne test sentence
en output
hi-en parallel corpus
(Liu et al., 2020)
Section Overview
119
120
Zero-resource Translation
Zero-shot translation & Zero-resource → No parallel corpus between unseen languages
Zero-shot → no specific training for unseen language pairs
Zero-resource → training takes into account unseen language pairs of interest
e.g use synthetic parallel corpus
Zero-resource NMT can be used to tune the NMT model for some unseen language pairs of interest
(Firat et al., 2016b)
121
Creating Synthetic Parallel Corpus
Expose M2M model to zeroshot directions
Create synthetic source to real target parallel
{ (X’1Y1), (X’2Y2), (X’3Y3), (X’4Y4) }
Via back-translation using the M2M model
Via pivot language
Y → E → X’
Y’ ← E → X’
(Firat et al., 2016b; Lakew et al., 2017; Gu et al., 2019; Currey & Heafield, 2019)
122
Iterative Refinement
Backtranslation quality depends on quality of underlying translation models
Train MNMT model
Backtranslation
Augment
Original + backtranslated data
MNMT model
Backtranslated
data
(Lakew et al., 2017)
Original parallel data
Iterative reinforcement learning approaches which reward original translation directions based on language modelling and reconstruction losses in zeroshot directions (Sestorain et al., 2018)
123
Scaling BT to multiple translation directions
Expensive to generate BT data for O(n2) language pairs
Random Online backtranslation
(Zhang et al., 2020)
For every real sentence pair (x,y) in lang-pair (s,t)
Only doubles the effective corpus size
Results approach pivot baseline
124
S
P
T
Minimize
Teacher-student training
(Chen et al., 2017)
Assumption: The following distributions are similar
125
Source-Pivot
Parallel Corpus
Pivot-Target
Parallel Corpus
Source-Target
Model
Pivot-Target Model
Teacher-Student Training
MLE Training
Teacher model
Student model
Given: Source-Pivot and Pivot-Target Parallel Corpus
Teacher-Student Training Approaches
Sentence-level matching
Token-level matching
Section Overview
126
127
Can’t we simply add direct parallel corpora between non-English languages?
How do we acquire such parallel data?
How do we address data imbalance?
Single-bridge vs. Multi-bridge systems
128
Mine parallel corpus from monolingual corpora
Expensive to mine from all 100 x 99 pairs
Which are the most promising language pairs to mine from?
Cluster languages and select bridge languages
(Fan et al., 2020)
129
Extraction from English-centric parallel corpora
(Freitag & Firat, 2020; Rios et al., 2020)
130
Sampling strategies used for English-centric datasets have limitations
Just ~15% sentence pairs exclusively non-English
Sampling independently on source and target marginal is also biased.
Solution 1: (Freitag & Firat, 2020)
Solution 2: (Fan et al., 2020)
Sample from probability matrix while ensuring source and target marginals follow a temperature based schedule
Data Sampling Strategies
These approaches improve over standard temperature-based sampling for non-English directions
131
| Fan et al., 2020 | Rios et al., 2020 | ||
Language Pairs → | New Train | Unseen | New Train | Unseen |
Single-bridge | 5.4 | 7.6 | 20.0 | 20.9 |
Single-bridge pivot | 9.8 | 12.4 | 22.9 | 23.7 |
Multi-bridge | 12.3 | 18.5 | 25.1 | 24.0 |
| | | | 25.2 |
Major Results
Word of caution: Human evaluations gains are not as significant (Fan et al., 2020)
zeroshot
Section Summary
MNMT has helped make significant advances in low-resource MT
Transfer Learning
Translation between unseen languages
132
Outline of This Tutorial
133
Why Multi-source MT?
134
If a source sentence has already been translated into multiple languages then these sentences can be used together to improve the translation into the target language:
such as EU, and UN.
Figures from (Nishimura et al., 2018)
Multi-Source Available: Multi-source Encoder (Zoop et al., 2016)
135
Results of Multi-source Encoder (Zoph et al., 2016)
136
Multi-Source Available: Ensembling (Garmash et al., 2016)
137
Results of Ensembling (Garmash et al., 2016)
138
Multi-Source Available: Concatenation (Dabre et al., 2017)
139
Multilingual (N-way) Corpus
Concatenate Source Sentences
Train NMT Model
Model
Word Segmentation (BPE)
Early Stop on Best Dev BLEU
Hello ||| Bonjour ||| नमस्कर ||| Kamusta ||| Hallo ||| こんにちは
I ||| Je ||| मी ||| ako ||| ech ||| 私
Multi-source Corpus
Hello Bonjour नमस्कर Kamusta Hallo ||| こんにちは
I Je मी ako ech ||| 私
Results of Concatenation (Dabre et al., 2017)
140
* Concatenation (bold)/Ensembling/Multi-source Encoder
Missing Source Sentences (Nishimura et al., 2018)
141
Scenario:
Methods:
Results of Missing Source Sentences (Nishimura et al., 2018)
142
Summary of Multi-source Approaches
143
Outline of This Tutorial
144
Multiway Datasets
145
Languages | Corpora |
European languages | Europarl, JRC-Aquis, DGT-Aquis, DGT-TM, ECDC-TM, EAC-TM etc. |
Asian languages | WAT shared tasks etc. |
Indic languages | CVIT-PIB/MKB, PMIndia, IndoWordNet etc. |
Massive | WikiMatrix, JW300 etc. |
Others | UN, TED, Opensubtitles etc. |
Refer to catalogs like OPUS and the IndicNLP catalog for comprehensive listings of parallel corpora resources.
Low or Zero-Resource Multiway Datasets
146
Corpus | Domain | Languages |
FLORES | Wikipedia | English, Nepali, Sinhala |
XNLI | Caption | 15 languages |
CVIT-Mann ki Baat/PIB | General | 10 Indian Languages |
Indic parallel corpus | Wikipedia | 6 Indian Languages |
WMT shared tasks | Web | German, Upper Sorbian |
+All the multiway datasets listed in the previous slide can be used for testing
Multi-source Datasets
147
Corpus | N-way | Domain | Languages |
Europarl | 11 | Politics | European languages |
TED | 5 | Spoken | French, German, Czech, Arabic and English |
UN | 6 | Politics | Arabic, Chinese, English, French, Russian and Spanish |
ILCI | 11 | Tourism/health | Indian languages + English |
ALT | 9 | News | South-East Asian languages + English, Japanese |
Bible | 1,000 | Religion | Most major languages |
Outline of This Tutorial
148
Exploring Pre-trained Models
149
Unseen Language Pair Translation
150
Joint Multilingual and Multi-Domain NMT
151
Multilingual Speech-to-Speech NMT
152
153
Your ideas?
Outline of This Tutorial
154
155
MNMT
Multiway Modeling
Transfer Learning
Zero-resource Modeling
Multi-source Translation
Synthetic corpus generation, iterative refinement, teacher-student models, pre-training approaches
Low-resource Translation
Available/missing source sentences, multiway-multisource modeling, post-editing
Zero-shot Modelling
Wrong language generation, language invariant representations, output agreement, effect of corpus size and number of languages
Parameter Sharing
Language Divergence
Full/minimal/controlled parameter sharing,
massive models, capacity bottlenecks
Balancing language agnostic-dependent representations, impact of language tags, reordering and pre-ordering of languages, vocabularies
Training Protocols
Parallel/joint training, multi-stage and incremental training, knowledge distillation, optimal stopping
Fine-tuning, regularization, lexical transfer, syntactic transfer, language relatedness
Use-case
Core-issues
Challenges
Summary of MNMT
(Dabre et al., 2020)
156
Tutorial Material: https://github.com/anoopkunchukuttan/multinmt_tutorial_coling2020
Survey Paper: https://dl.acm.org/doi/10.1145/3406095
Thank You!