1 of 156

Multilingual Neural Machine Translation

COLING 2020 Tutorial, December 12th, 2020

Raj Dabre

NICT

Kyoto, Japan

Chenhui Chu

Kyoto University

Kyoto, Japan

Anoop Kunchukuttan

Microsoft STCI

Hyderabad, India

2 of 156

Tutorial Homepage

https://github.com/anoopkunchukuttan/multinmt_tutorial_coling2020

Survey Paper

Raj Dabre, Chenhui Chu, and Anoop Kunchukuttan. 2020. A Survey of Multilingual Neural Machine Translation. ACM Comput. Surv. 53, 5, Article 99 (September 2020), 38 pages. https://doi.org/10.1145/3406095

Updated Bibliography (the field is moving so fast!)

https://github.com/anoopkunchukuttan/multinmt_tutorial_coling2020/mnmt_bibliography.pdf

Tutorial Material

3 of 156

Self Introduction (Chenhui Chu)

Experience

2020-present: program-specific associate professor @ Kyoto University
2017-2020: research assistant professor @ Osaka University
2015-2017: researcher @ Japan Science and Technology Agency
2014-2015: research fellowship for young scientists (DC2) @ JSPS

Research

Machine translation (JSPS DC2, Chinese-Japanese MT practical application project, JSPS research activity start-up)
Language and vision understanding (ACT-I, MSRA CORE, JSPS young scientists)

4 of 156

Outline of This Tutorial

Overview of Multilingual NMT (30 min by Chenhui Chu)
Multiway Modeling (1 hour by Raj Dabre)
Low-resource Translation (1 hour by Anoop Kunchukuttan)
Multi-source Translation (10 min by Chenhui Chu)
Datasets, Future Directions, and Summary (20 min by Chenhui Chu)

5 of 156

What is Multilingual NMT (MNMT)?

Definition

Neural machine translation (NMT) systems handling translation between more than one language pair

A research field to study:

How can we leverage multilingual data effectively in order to learn distributions across multiple languages so as to improve MT (NLP) performance across all languages?

6 of 156

Why MNMT?

Size of the top-100 language pairs in OPUS (Tiedemann et al., 2012)

7 of 156

Why MNMT?

Data Distribution over language pairs (Arivazhagan et al., 2019)

8 of 156

Motivation and Goal of MNMT (Dabre et al., 2020)

Motivation

MNMT systems are desirable because training models with data from many language pairs might help a resource-poor language acquire extra knowledge; from the other languages
MNMT systems tend to generalize better due to exposure to diverse languages, leading to improved translation quality compared to bilingual NMT systems

Goal

Have a one-model-for-all-languages solution to MT (NLP) applications
Shared multilingual distributed representations help MT (NLP) for low-resource languages

9 of 156

Timeline of MNMT Research

NMT

2014

2015

2016

2017

2018

2019

2020

NMT with attention

Transformer

Multiway MNMT

Massive MNMT

Multi-source MNMT

Google MNMT

Beyond English-Centric

Low-resource MNMT

BERT

10 of 156

Google’s MNMT System (Johnson et al., 2017)

A single multilingual model achieves comparable performance for

English-French and surpasses state-of-the-art results for English-German.

11 of 156

Language Clusters (Johnson et al., 2017)

12 of 156

Mixing Target Languages (Johnson et al., 2017)

13 of 156

Google’s Massive MNMT System (Arivazhagan et al., 2019)

BLEU

Studied a single massive MNMT model handling

103 languages trained on over 25 billion examples

14 of 156

Effect of Massive MNMT (Arivazhagan et al., 2019)

Massive MNMT is helpful for low-resource languages but not high-resource languages

15 of 156

Effect of Task Numbers (Arivazhagan et al., 2019)

Increasing numbers of languages is not always helpful

16 of 156

Facebook’s Beyond English-Centric MNMT System (Fan et al., 2020)

M2M-100: the first MNMT model that can translate between any pair of 100 languages without relying on English data
It outperforms English-centric systems by 10 points on the widely used BLEU metric for evaluating machine translations
M2M-100 is trained on a total of 2,200 language directions — or 10x more than previous best, English-centric multilingual models

17 of 156

Performance of M2M-100 (Angela et al., 2020)

18 of 156

Goal of This Tutorial

Understand the basics of MNMT
Cover an in-depth survey of existing literature on MNMT

Central use-case, resource scenarios, underlying modeling principles, core-issues and challenges of various approaches
Strengths and weaknesses of several techniques by comparing them with each other

Inspire new ideas for researchers and engineers interested in MNMT

19 of 156

Basic NMT Architecture (Dabre et al., 2020)

20 of 156

RNN based NMT (Bahdanau et al., 2015)

21 of 156

Self-Attention Based NMT (Vaswani et al., 2017)

22 of 156

Initial MNMT Models (1/2) (Firat et al., 2016)

Encoder 1

Encoder X

Decoder 1

Decoder Y

Shared Attention Mechanism

ENGLISH

I am a boy.

HINDI

मैं लड़का हूँ.

ITALIAN

Sono un ragazzo

MARATHI

मी मुलगा आहे.

Encode English

Decode Italian

Minimal Parameter Sharing

23 of 156

Initial MNMT Models (2/2) (Johnson et al., 2017)

Encoder

1. <2mr> I am a boy.

2. <2mr> मैं लड़का हूँ.

3. <2it> I am a boy.

4. <2it> मैं लड़का हूँ.

Decoder

Attention Mechanism

1. मी मुलगा आहे.

2. मी मुलगा आहे.

3. Sono un ragazzo.

4. Sono un ragazzo.

Complete Parameter Sharing

24 of 156

MNMT

Multiway Modeling

Transfer Learning

Zero-resource Modeling

Multi-source Translation

Low-resource Translation

Zero-shot Modelling

Parameter Sharing

Language Divergence

Training Protocols

Use-case

Core-issues

Overview of MNMT

(Dabre et al., 2020)

25 of 156

Outline of This Tutorial

Overview of Multilingual NMT (30 min by Chenhui Chu)
Multiway Modeling (1 hour by Raj Dabre)
Low-resource Translation (1 hour by Anoop Kunchukuttan)
Multi-source Translation (10 min by Chenhui Chu)
Datasets, Future Directions, and Summary (20 min by Chenhui Chu)

26 of 156

Self Introduction (Raj Dabre)

Experience

2018-present: Researcher at NICT, Japan
2014-2018: MEXT Ph.D. scholar at Kyoto University, Japan
2011-2014: M.Tech. Government RA at IIT Bombay, India

Research

Low-Resource Natural Language Processing

Multilingual Machine Translation: 2012-present
Fundamental Analysis: 2011-2014

Efficient Deep Learning:

Compact, flexible and fast models (2018-present)

27 of 156

Topics to address

Parameter sharing

Massively multilingual models

Language divergence

Training protocols

28 of 156

Topics to address

Parameter sharing

Massively multilingual models

Language divergence

Training protocols

29 of 156

Parameter Sharing: Finding The Right Balance

Empirically determined component sharing

How to maximize impact of shared vocabulary?
Universal encoders or decoders?
Should attention be universal or language specific?
How much sharing is good sharing?

Parameterized parameters for shared modeling

Learn about sharing

30 of 156

Sharing Vocabularies (1)

Naive solution: A single vocabulary for all languages (Johnson et al., 2017)
Adding constraints: Soft-decoupled encoding (Wang et al., 2019)

Word or sub-word vocabulary
N-gram embeddings, language specific and latent semantic transformation
Good for linguistically similar languages in transfer learning setups

Future investigation

Indic languages
Groups of language families (for massively multilingual NMT)

Chung et al. 2020, Lyu et al. 2020

31 of 156

Sharing Vocabularies (2)

Language sensitive embeddings (Wang et al., 2019)

Augment shared embeddings with language specific indicators (factors)
One hot, learnable, direction specific

Language family shared embeddings

Separate shared embeddings for different families
Avoid accidental n-gram vocabulary overlaps

E_I^word

E_en^lang

E_I^word:E_en^lang

Concatenate

Encoder

Add

E_I^word_:+E_en^lang

32 of 156

Sharing In Encoders and Decoders (1)

Most works agree on a universal encoder

Dong et al. 2015; Sachan et al. 2018; Wang et al. 2019
Lack of cross-lingual components

Component sharing in decoders is tricky

Balance between universal and distinct representation (more on this later)
Cross-lingual components

33 of 156

Sharing In Encoders and Decoders (2)

Sachan et al. 2018 get optimal results with sharing attention and embedding
Distinct FF parameters for language specific adaptation
Reduce MNMT model sizes by 20-30% but slightly improve quality

34 of 156

Sharing In Encoders and Decoders (3)

Wang et al. 2019 focus on language sensitization

Sharing self-attention and FF parameters
Shared encoder/decoder →representor

Also in Dabre et al. 2019

Language sensitization strategies:

Language sensitive embeddings
Language pair specific cross-attention
Language discrimination

General improvement of 1-2 BLEU

35 of 156

A Note On Attention Sharing (Blackwood et al. 2018)

Universal shared attention

Firat et al. 2016
Too much load on a single set of parameters

Source specific attention

Reduces load but not too effective

Target specific attention

Works best (shows importance of unshared components in decoders)
Orthogonal to Sachan et al. 2018 where shared attention was optimal

Source-target specific attention

Least load but prevents zero-shot

36 of 156

Contextual Parameters (Platanois et al. 2018)

Generate parameters using parameters!

37 of 156

Contextual Parameters (Platanois et al. 2018)

Meta-parameterization

Parameters generate parameters: Θ = F(ƛ)

Can support parameter groups

Potential for self-discovering optimal language families
Related work on family specific sharing by Lyu et al. 2020

Significant model compression

Separate models: O(L*L*P + 2*L*L*W*V)
This model: O(P*M + L*W*V)

Significant improvement in translation quality

38 of 156

Contextual Parameters (Platanois et al, 2018)

Automatically learns language similarity

IT similar to RO
VI similar to FR?

Make groups of contextual parameters

Two step meta-learning
Step 1: Discover groups
Step 2: Optimal sharing

Next steps:

Conditional computation
Neural architecture search
Tensor routing (Zaremoudi et al. 2018)

39 of 156

Tensor routing (Zaremoudi et al. 2018)

Augmented encoder block

Language specific block (dotted)
Shared blocks (solid)
Routing controls relevance of blocks

Another perspective

Each block is an expert
Related to MOE and Gshard works

40 of 156

Topics to address

Parameter sharing

Massively multilingual models

Language divergence

Training protocols

41 of 156

Massively Multilingual Models

Google’s work

MNMT In the wild
Adaptor Layers
GShard

University of Edinburgh’s work

Reproducible

Bottlenecks

Hardware/computational/cost
Representational

42 of 156

Google’s Massively MNMT (Aharoni et al. 2019)

59-lingual low-resource models (6-layer transformer base)
One to many models better than many-to-many for non-English targets

2-3 BLEU improvement
Relative over-representation of English

Many to one models worse than many-to-many for English targets

2-3 BLEU drop
Relative under-representation of English

43 of 156

Google’s Massively MNMT (Aharoni et al. 2019)

103-lingual model (6-layer variation of transformer-big)
Many-to-one and one-to-many are both better than many-to-many

N-way corpora is detrimental to many-to-one performance

Supervised NMT quality degrades with more languages
Zero-shot NMT quality increases with more languages

Ukranian to Russian Zero-Shot Translation

44 of 156

Google’s Massively MNMT Model In the Wild

Extension of Aharoni et al. 2019 by Arivazhagan et al. 2019
Main focus

Temperature based data sampling for transfer-interference balance
Pushing number of supported language pairs to limit
Pushing MNMT performance with 1+ billion parameters

Starting point: Quality is directly proportional to data size

45 of 156

More Languages and Directions

Generic drop in performance for resource rich pairs
Performance degrades as number of pairs increase

25 languages seems to be optimal

Separate one-to-all and all-to one models preferred

All to All models degrade translation into English

46 of 156

Are Larger Models Helpful?

400M parameters

Transformer Big:

6-layers, 1024-4096 hidden-filter sizes
16 attention heads

1.3B parameters

Transformer Deep:

24-layers, 1024-4096 hidden-filter sizes
16 attention heads
Low-resource performance

Transformer Wide:

12-layers, 2048-16384 hidden-filter sizes
32 attention heads
High-resource performance

128 TPUs and 4M tokens per batch ($$$)

47 of 156

Language Aware Multilingualism

Zhang et al. 2020

OPUS 100 dataset: 55M sentence pairs

Deep transformers: Up to 2 BLEU better using 12 layers instead of 6 layers

Depth scaled initialization: Better initialization to handle deep stacking
Merged attention: Self and cross attention side by side

Expressibility-Scalability: Improves by 2-3 BLEU

Target language aware layer normalization
Language aware linear transformation b/w Encoder and Decoder

Random online back-translation: Boosts zero-shot quality by up to 10 BLEU

Back-translate monolingual corpora and use it to train

48 of 156

Adapting Previously Trained Models

Feed forward layers to refine outputs

Bapna et al. 2019
Partial solution to bottleneck
Language pair specific

Zero-shot performance?

13.5% larger models
Improved high-resource pair performance

Low-resource performance kept

Questions:

Multiple adapters?
Dynamic adapter size?
Adapter-Base Model ratio in massively multilingual situation?

49 of 156

Pushing Limits: Mixtures Of Experts (Gshard; Lepikhin et al. 2020)

Replace attention layer with mixtures of experts layer sandwiched between two attention layers

Explosive growth in parameters (600B) as number of experts grow (2048).

Use model sharding and dynamic routing with large number of accelerators (2048 TPUs).

50 of 156

Engineering + Research = Ultimate Solution?

Advances in gating, tensor routing and load balancing to increase experts

Language agnostic work

Shallower models with large number of experts are recommended

Avoid deep modeling issues partially or completely
Save 10 times or more training time compared to inefficient 96L models
~7 BLEU improvement globally over deepest model

51 of 156

Bottlenecks: Hardware, Cost and Energy Efficiency

Needs heavy hardware ($$$)
Most existing work by Google with TPUs

TPUs are faster than GPUs
Typical to use up to 100s of TPUs
Equivalent GPU setting not yet known

May not be possible to do this in a university setting :(((((

52 of 156

Bottlenecks: Hardware, Cost and Energy Efficiency

More devices and data = More training time (effective) = $$$$

Are the BLEU gains worth the mammoth models?
Do BLEU gains translate into human evaluation gains?
How do we deploy these models?

One model on a 2048 TPU cluster?

How about continual learning?

Future: Bigger models or better language aware neural modeling?

Model size (representational capacity) is no longer the bottleneck

53 of 156

Topics to address

Parameter sharing

Massively multilingual models

Language divergence

Training protocols

54 of 156

Language Divergence

Vocabularies

Sufficient and fair representation for languages�

Lessons from visualization and language families

Understanding what is learned where�

Token based control

Augmenting input with features�

55 of 156

On Vocabulary in MNMT

Core point: Fair vocabulary representation for all languages
Difficulty: Skew in data and hence word vocabulary
Key Points:

Use large monolingual corpora
Oversample smaller vocabularies
Temperature sampling

p_VL^(1/i) where i is the sampling temperature for vocabulary item V for language L

Adjust vocabulary sizes (32K to 128K sub-word vocabularies work well)
Trade-offs

Sequence length, softmax computation time, enough sub-words per language

56 of 156

Key Observations In Practice

Larger vocabularies do not always bring proportionate improvements

Larger vocabularies = Shorter sequences
Larger vocabularies = Slower softmaxes

Do your cost-benefit analysis!

(Almost) equal representation is important
Moderate temperature sampling with fixed vocabulary size is crucial
Is T=5 a golden rule?

Arivazhagan et al.; Bapna et al. 2019

57 of 156

Visualization Of MNMT Representations

SVCCA similarity between representations (Kuduganta et al. 2019)

Also see Dabre et al. and Johnson et al. 2017

Encoder representations cluster sentences into language families

Regardless of script sharing
Script sharing for stronger clustering

High resource languages cause partition

Low-resource languages ride the wave

Evidence of representation invariance when fine-tuning

Explains poor zero shot quality between distant pairs

58 of 156

Representation Similarity Evolution

Representation similarity varies with depth

Shallower layers cluster by script
Deeper layers cluster by family

Case study: Turkic and Slavic

Some use roman script

Turkish, Uzbeki and Azerbaijani

Some use cyrillic script

Kyrgyz and Kazakh

First encoder layer separates by script and last layer closes the gap

Distinguishes from slavic languages
Such languages use roman or cyrillic

59 of 156

Representation Evolution With Depth

For Many to English: Encoder representations converge and decoder representations diverge�For English to Many: Encoder representations of English diverge based on target language�Question: Is divergence a good thing? Does it cause a learning overhead? Should it?

60 of 156

Empirically Determined Language Families

Train many-to-many model with language tokens
Hierarchical clustering of tokens

Set number of clusters by elbow-sampling

Tan et al. 2019
Also see Oncevay et al. 2020

Predetermined language families

Empirically determined language families via embedding clustering

61 of 156

Is There An Optimal Number of Languages

Does empirical clustering help? (Upper table)

Mostly yes
Random clustering gives poorer results
Predetermined clustering is equally good

Language family specific models (Bottom table)

Universal model < Individual models
Family specific model > Individual models
Related to observations by Dabre et al. 2017 and 2018

Next steps

Family specific adaptor layers (Bapna et al. 2019)
Family specific vocabulary and decoder separation
Behavior in extremely low-resource settings (<20k pairs; Dabre et al, 2019)

62 of 156

On Language Tags: Embeddings vs Features

<2ja> I am a boy

私は男の子です

Train NMT

Johnson et al. 2017: prime the encoder’s output using a <2xx> token.

<2xx> is a single token with its embedding. Black box approach..

ja I@en am@en a@en boy@en ja

私は男の子です

Train NMT

Ha et al. 2016: distinguish between shared vocabulary units. Encoder is primed with source as well as target language information.

Blackwood et al. 2019: Use language tokens at beginning and end.

63 of 156

On Language Tags: Embeddings vs Features

I am a boy

私は男の子です

Train Factored NMT

Ha et al. 2017 and Hokamp et al. 2019: keep word embeddings independent of task (target language) via features

E_I^word

E_en^feature

E_I^word+feature

Concatenate or Add

Encoder

64 of 156

Topics to address

Parameter sharing

Massively multilingual models

Language divergence

Training protocols

65 of 156

Training Protocols

MNMT training fundamentals�
Training schedules

Batching strategies
Importance of sampling�

Leveraging bilingual models

Distillation�

Model expansion and incremental learning

66 of 156

On MNMT Training

Fundamentally the same as standard NMT: Minimizing negative log likelihood

Where the individual language pair negative log-likelihood is

Challenges:

Good training schedule
Language equality

67 of 156

Training Schedule: Joint Training

One pair at a time (Firat et al. 2016, Dong et al. 2015)

Cycle through corpora (L₁-L₂ →L₃-L₄→L₅-L₆→.....→L₁-L₂)

Useful for models with separate encoders and/or decoders
Potential forgetting of language pair information in a previous batch

Catastrophic forgetting!

68 of 156

Training Schedule: Joint Training

Mixed language pairs batch (Johnson et al 2017)

Mix all corpora, shuffle and then choose batches

Useful for fully shared models
For models with separate language encoders/decoders

Shard batch and feed to appropriate components

69 of 156

Training Schedule: Addressing Language Equality

Source of inequality: Corpora size skew
Solutions: Oversampling smaller corpora
Oversampling before training or during training?

Matter of implementation choice
Oversampling prior to training creates large duplicated corpora

70 of 156

Importance Of Temperature Based Sampling (Arivazhagan et al. 2019)

Naive approaches:

Ignore corpora size distributions
Sample from all corpora equally

New approach: Temperature based sampling (p_L^(1/i) )

Where p_L is the probability of sampling a sentence from a corpus
i is the sampling temperature
Strongly benefits low-resource pairs

71 of 156

Leveraging Bilingual Models

Training from scratch is challenging

Multitude of pairs
Complexity of task
Interference of languages

Solution: Leverage bilingual models or smaller multilingual models

Key principle: Transfer learning via sequence knowledge distillation (Kim et al. 2016)

L₁→L₂

Train wide and/or deep model

Big NMT model

L₁→L’₂

Decode L₁

Distillation data

Train narrow and/or shallow model

Small NMT model

72 of 156

Distillation For MNMT Training (Tan et al. 2019)

L₁→L₂

model

L₁→L’₂

Decode L₁

L₃→L₄

model

L₃→L’₄

Decode L₃

L_m→L_n

model

L_m→L’_n

Decode L_m

Train MNMT Model With Smoothed Data and Labels

Smoothed labels

MNMT

model

73 of 156

Incremental Training

Objective: Expand the capacity of existing models with maximum reusability
Promising approaches

Vocabulary expansion

Surafel et al. 2018

Gradual capacity expansion

Escolano et al. 2019

Adaptor layers (experts) on top of pre-trained models

Bapna et al. 2019 (already discussed)

74 of 156

Incorporating New Languages (Surafel et al. 2018)

Language specific transfer

Replace vocabulary
Fine-tune on new data

Similar to Zoph et al. 2016

Expanding to new languages

Expand vocabulary
Fine-tune on old+new data

Increase computational capacity?

75 of 156

Capacity Expansion Of Existing Models (Escolano et al. 2019)

Add new components while freezing existing components

Lightweight training BUT
Previous components may not be aware of new languages

Poor transfer learning

Potential zero-shot learning

Will it work for distant languages

�

Lessons from Sachan et al. 2018; Firat et al. 2016a/b; Bapna et al. 2019

Deepen encoders and decoders
Only train new components with old and/or new data

Vocabulary expansion by Surafel et al. 2018 will help

76 of 156

Correlated Topics Not Well Addressed Yet

Diverse language learning rates

Language pairs are learned at different rates

Jean et al. 2019 set task weights using model performance

Weights decide sampling (task importance)

Lessons from Wang et al. 2018

Adding or removing instances from training set

77 of 156

Correlated Topics Not Well Addressed Yet

Catastrophic forgetting

Thompson et al. 2019 on domain adaptation
Addressing forgetting for incremental learning?

MNMT model convergence

Currently report average performance over all pairs

I’m looking at you GShard

Ignoring individual pair’s performance is unwise
Weighted performance metric to the rescue?
Better evaluation metrics for MNMT settings?

78 of 156

Outline of This Tutorial

Overview of Multilingual NMT (30 min by Chenhui Chu)
Multiway Modeling (1 hour by Raj Dabre)
Low-resource Translation (1 hour by Anoop Kunchukuttan)
Multi-source Translation (10 min by Chenhui Chu)
Datasets, Future Directions, and Summary (20 min by Chenhui Chu)

79 of 156

Self Introduction (Anoop Kunchukuttan)

Experience

2018-present: Senior Applied Researcher, MT Group, Microsoft India
2012-2017: Ph.D. scholar at IIT Bombay, India
2008-2011: ML/NLP Lead, Life Sciences Group, Persistent Systems
2006-2008: M.Tech. IIT Bombay, India

Research Interests

Multilingual NLP
Machine Translation, Transliteration
Representation Learning for NLP
Indian language NLP

80 of 156

Section Overview

Transfer Learning for MNMT

Training Methods
Lexical Transfer
Syntactic Transfer
Language Relatedness

Translation between unseen language pairs

Pivot Translation
Zeroshot Translation
Zero-resource Translation
Multibridge MNMT

81 of 156

Translation for low-resource languages

A large skew in parallel corpus availability

Long tail of low-resource languages

Difficult to obtain corpora for many languages

Can high-resource languages help low-resource languages?

(Arivazhagan et al., 2019b)

82 of 156

Storing knowledge gained while solving one problem & applying it to a different but related problem.

Transfer learning

High resource pair corpus

Parent Model

Child Model

Knowledge

Low resource pair corpus

Concepts,

Domain knowledge,

Grammar,

Source-to-target transformation

Transfer Learning Scenarios

Many to one translation (M2O)

One to Many translation (O2M)

(Pan & Yang, 2010)

83 of 156

Section Overview

Transfer Learning for MNMT

Training Methods
Lexical Transfer
Syntactic Transfer
Language Relatedness

Translation between unseen language pairs

Pivot Translation
Zeroshot Translation
Zero-resource Translation
Multibridge MNMT

84 of 156

Joint Training

(Ha et al., 2016; Johnson et al, 2017)

Oversample child/undersample parent to balance training

Low-resource languages performance

Many-to-One direction ⇒ major gains

One-to-Many direction ⇒ minor gains

85 of 156

Fine-tuning

Fine-tuning vs. Joint Training

Fine-tuning better in O2M setting and vice-versa

While fine-tuning, specific set of parameters can be tuned

Many to one ⇒ Encoder bottom layers
One to many⇒ Decoder top-layers

(Zoph et al., 2016; Tan et al., 2019b)

86 of 156

Small low-resource corpus ⇒ overfitting might occur ⇒ Use mixed-finetuning

Mixed

Fine-tuning

(Dabre et al., 2019)

87 of 156

Transfer from multiple parents

Pre-train a multilingual NMT model on a representative set of high-resource languages

Useful for rapid-adaptation to new languages

(Neubig and Hu et al., 2018)

88 of 156

Transfer to multiple children

Useful for O2M setting, where single-stage fine-tuning is not very beneficial (10% improvement in BLEU score)

Multi-linguality in the mixed-fine tune stage aids translation

(Dabre et al., 2019)

89 of 156

What is the objective of the parent-model training?

Optimize performance on parent tasks?
Optimize performance on child tasks?
Enable few-shot learning?

90 of 156

Meta-learning

Learning to learn

Learn an initialization from which only a few examples are required to learn the child-task

(Finn et al., 2017)

91 of 156

Sample

language pairs

Sample

training examples

Simulate training step

Sample

test examples

Compute test-error

Compute average test-errors across language pairs

Compute meta-gradient

LP_i

LP₁

𝚹

𝚹_i’

Err₁

Err_i

Model Agnostic Meta-learning for MNMT

(Gu et al., 2018b)

𝚹_new

Update original parameters

𝚹=𝚹_new

Outperforms multilingual fine-tuning strategy on unseen pairs

Requires far fewer adaptation steps for comparable performance

92 of 156

Section Overview

Transfer Learning for MNMT

Training Methods
Lexical Transfer
Syntactic Transfer
Language Relatedness

Translation between unseen language pairs

Pivot Translation
Zeroshot Translation
Zero-resource Translation
Multibridge MNMT

93 of 156

Lexical Transfer

How do we initialize child token embeddings?

Parent vocabulary

Child vocabulary

Random Assignment

Dictionary Initialization

Bilingual

Dictionary

Initialize child token embeddings prior to fine-tuning

(Zoph et al., 2016)

Dictionary Init faster than random assign, but random init eventually achieves comparable performance

94 of 156

Use bilingual embeddings to map parent and child embeddings to a common space

Map parent and child embeddings to a common space

Map child embeddings to parent embeddings

Linear mapping functions can be learnt using small bilingual dictionaries

Modified child embeddings

Original child embeddings

Modified child

Modified parent

Original child

Original parent

(Kim et al., 2019a)

(Gu et al., 2018a)

Significant improvements over random assignment of embeddings

95 of 156

Section Overview

Transfer Learning for MNMT

Training Methods
Lexical Transfer
Syntactic Transfer
Language Relatedness

Translation between unseen language pairs

Pivot Translation
Zeroshot Translation
Zero-resource Translation
Multibridge MNMT

96 of 156

Introduce noise in parent-source sentences

Prevents over-optimization to parent-source language

Noising schemes

Word insertion

Word deletion

Word Swapping

(Kim et al., 2019a)

Simple methods, gives modest improvements over baseline finetuning

What if parent and child have different word orders?

97 of 156

Reorder parent-source sentence to match child-source sentence

Ensures better alignment of encoder contextual embeddings

(Murthy et al., 2019)

Significant improvements over baseline finetuning, but needs a parser and re-ordering system

98 of 156

Section Overview

Transfer Learning for MNMT

Training Methods
Lexical Transfer
Syntactic Transfer
Language Relatedness

Translation between unseen language pairs

Pivot Translation
Zeroshot Translation
Zero-resource Translation
Multibridge MNMT

100 of 156

100

Key Similarities between related languages

(Kunchukuttan & Bhattacharyya., 2020)

101 of 156

101

(Kudungta et al., 2019)

Transfer Learning works best for related languages

Encoder Representations cluster by language family

(Zoph et al., 2016; Dabre et al, 2017b)

Contact relationship is also captured

Related languages using different scripts also cluster together

102 of 156

102

Utilizing lexical similarity

Subword-level vocabulary improves transfer

Improves parent-child vocabulary overlap encouraging better transfer

child	subword	BLEU score		parent-child overlap (%)
		baseline	transfer	train	dev
tur-eng	word	8.1	8.5	3.9	3.6
	BPE	12.4	13.2	58.8	25.0
uyg-eng	word	8.5	10.6	0.5	1.7
	BPE	11.1	15.4	57.2	48.5

(uzb-eng is the parent language pair)

(Nguyen and Chiang, 2017)

103 of 156

103

Transfer between related languages using different scripts works well

Similar scripts

Script Conversion

Slavic (Cyrillic & Roman)

Indic (Brahmi)

Very different scripts

Romanization (unconv, uroman)

Turkic

Hindi, Urdu

(Lee et al., 2016;

Dabre et al, 2018;

Goyal et al 2020)

(Gheini & May, 2019;

Amrhein & Sennrich, 2020)

Transfer can also be done between languages related by contact (Goyal et al. 2020)

Dravidian and Indo-Aryan languages form a linguistic area in the Indian subcontinent (Emeneau, 1956)

Transfer works without script conversion → but script conversion provides improvements

104 of 156

104

Similar language regularization

Very small low resource language ⇒ Overfitting on finetuning

Pre-train model on multiple languages

Concatenate HRL and LRL corpus

Fine-tune jointly

Joint fine-tuning outperforms just target language fine-tuning

Pre-training with multiple languages is better than single language

(Neubig & Hu, 2018; Chaudhary et al, 2019)

Similar idea is also used for knowledge distillation

(Dabre & Fujita et al, 2020)

105 of 156

105

Reorder parent-source sentence to match child-source sentence

Significant improvements over baseline finetuning, but needs a parser and re-ordering system

(Murthy et al., 2019)

Reordering rules can be reused if the child-source languages have the same word order

Utilizing Syntactic Similarity

106 of 156

106

Parent Data Selection

Which examples in the parent language-pair are most helpful for transfer?

(Wang et al., 2019)

Let us look at the case of Many-to-one translation

s₁

s₂

s₄

s₂

s₅

(s_i,t) is parent sentence pair from (H,E)

Score s_i by the probability that is belongs to low-resource source language L

Scoring Functions

score(s_i,L) ∝ vocab-overlap of s_i with L
score(s_i,L) ∝ P_LM-L(s_i)

Sample examples using this score

Can be extended to multiple parent languages
Can be extended to language-level similarity score

107 of 156

Section Overview

Transfer Learning for MNMT

Training Methods
Lexical Transfer
Syntactic Transfer
Language Relatedness

Translation between unseen language pairs

Pivot Translation
Zeroshot Translation
Zero-resource Translation
Multibridge MNMT

107

108 of 156

108

Pivot Translation

Zero-shot Translation

Zero-resource Translation

aaja bahuta ThaNDa hai

It is very cold today

inn vaLar.e taNuppAN

hi-en model

en-ml model

aaja bahuta ThaNDa hai

inn vaLar.e taNuppAN

many-2-many model

Multiple decoding steps

Multiple translation systems

Single decoding step

Single translation system

(Johnson et al., 2016)

(Firat et al., 2016b)

109 of 156

Section Overview

Transfer Learning for MNMT

Training Methods
Lexical Transfer
Syntactic Transfer
Language Relatedness

Translation between unseen language pairs

Pivot Translation
Zeroshot Translation
Zero-resource Translation
Multibridge MNMT

109

110 of 156

110

aaja bahuta ThaNDa hai

It is very cold today

inn vaLar.e taNuppAN

hi-en model

en-ml model

aaja bahuta ThaNDa hai

It is very cold today

inn vaLar.e taNuppAN

many-2-many model

Bilingual Pivot

Multilingual Pivot

Multilingual pivot generally outperforms bilingual pivot (Firat et al., 2016)

Pivot translation is a strong baseline

Limitations:

Cascading errors
Decode time

Reduce cascading errors using n-best translations

(function of path length)

(Johnson et al., 2016)

111 of 156

Section Overview

Transfer Learning for MNMT

Training Methods
Lexical Transfer
Syntactic Transfer
Language Relatedness

Translation between unseen language pairs

Pivot Translation
Zeroshot Translation
Zero-resource Translation
Multibridge MNMT

111

112 of 156

112

Naive Zeroshot-performance significantly lags behind pivot translation

	de→fr	fr→de
Pivot	19.71	24.33
Zeroshot	19.22	21.63

Output is generated in the wrong language

Code-mixing is rare

Once a wrong language token is generated, all subsequent tokens are generated in that language

Spurious correlation between source input & target language
Model always associates english output with any input
Copying behaviour

	de→fr	fr→de
Pivot	26.25	20.18
Zeroshot	16.80	12.03

Performance gap reduces when evaluation restricted to correct language output

(Results from Arivazhagan et al., 2019a)

(Arivazhagan et al., 2019a; Gu et al., 2019)

113 of 156

113

Prevents generation of wrong language, perfomance still lags pivot baseline

Restrict decoder to output only target-language vocabulary items

(Ha et al., 2017)

	German-Dutch	German-Romanian
baseline	14.95	10.83
+vocab filter	16.02	11.00

Language-specific subword model learning (Rios et al., 2020)

Vocabulary construction and control

Reduces copying behaviour and bias towards English output

Joint vocab	15.4
Language-specific vocab	20.5

overlapping model vocab

114 of 156

114

Minimize divergence between encoder representations

Loss = CE + ꭥ

ꭥ(z_x,z_y): distance between encoder representations of source (x) and target (y)

Supervised Objectives

Cosine distance (Arivazhagan et al., 2019a)
Euclidean distance (Pham et al., 2019)
Correlation distance (Saha et al., 2016)

Unsupervised Distribution Matching

(Arivazhagan et al., 2019a)

Use a domain-adversarial loss

Competitive with pivot and improves over baseline MNMT

115 of 156

115

Avoid pooling to generate encoder representations

(Pham et al., 2019)

Encoder output is variable length

Decoder receives fixed input at every timestep

Minimize divergence between auto-encoded and true target at different points in the decoder

Attention Forcing at bottom decoder layer

Improvements over directly minimizing encoder divergence

116 of 156

116

Encourage output agreement

Encourage equivalent sentence in two languages to generate similar output in an auxiliary language

Decode to language L using both inputs s and t

Score output of one input using other input

Add loss term force output agreement

(Al-Shedivat and Parikh, 2019)

Competitive with pivot and no-loss in supervised directions

117 of 156

117

Effect of number of languages & corpus size

Zeroshot performance improves with the number of languages

(Aharoni et al., 2019; Arivazhagan et al., 2019b)

(Aharoni et al., 2019)

(Arivazhagan et al., 2019b)

Zero-shot translation may work well only when the multilingual parallel corpora is large

(Mattoni et al., 2017; Lakew et al., 2017)

118 of 156

118

Can monolingual pre-training help zero-shot translation?

mBART Training

Finetune with parallel related corpus

Decode for related language

…...

Monolingual

corpora

m-BART model

hi-en Translation model

ne test sentence

en output

hi-en parallel corpus

(Liu et al., 2020)

119 of 156

Section Overview

Transfer Learning for MNMT

Training Methods
Lexical Transfer
Syntactic Transfer
Language Relatedness

Translation between unseen language pairs

Pivot Translation
Zeroshot Translation
Zero-resource Translation
Multibridge MNMT

119

120 of 156

120

Zero-resource Translation

Zero-shot translation & Zero-resource → No parallel corpus between unseen languages

Zero-shot → no specific training for unseen language pairs

Zero-resource → training takes into account unseen language pairs of interest

e.g use synthetic parallel corpus

Zero-resource NMT can be used to tune the NMT model for some unseen language pairs of interest

(Firat et al., 2016b)

121 of 156

121

Creating Synthetic Parallel Corpus

Expose M2M model to zeroshot directions

Create synthetic source to real target parallel

{ (X’₁Y₁), (X’₂Y₂), (X’₃Y₃), (X’₄Y₄) }

Via back-translation using the M2M model

Via pivot language

Y → E → X’

Y’ ← E → X’

(Firat et al., 2016b; Lakew et al., 2017; Gu et al., 2019; Currey & Heafield, 2019)

122 of 156

122

Iterative Refinement

Backtranslation quality depends on quality of underlying translation models

Train MNMT model

Backtranslation

Augment

Original + backtranslated data

MNMT model

Backtranslated

data

(Lakew et al., 2017)

Original parallel data

Iterative reinforcement learning approaches which reward original translation directions based on language modelling and reconstruction losses in zeroshot directions (Sestorain et al., 2018)

123 of 156

123

Scaling BT to multiple translation directions

Expensive to generate BT data for O(n²) language pairs

Random Online backtranslation

(Zhang et al., 2020)

For every real sentence pair (x,y) in lang-pair (s,t)

Sample a source language s’ → t
Generate backtranslation pair (x’,y) in pair (s’,t)
add backtranslated pairs to the minibatch

Only doubles the effective corpus size

Results approach pivot baseline

124 of 156

124

Minimize

Teacher-student training

(Chen et al., 2017)

Assumption: The following distributions are similar

125 of 156

125

Source-Pivot

Parallel Corpus

Pivot-Target

Parallel Corpus

Source-Target

Model

Pivot-Target Model

Teacher-Student Training

MLE Training

Teacher model

Student model

Given: Source-Pivot and Pivot-Target Parallel Corpus

Teacher-Student Training Approaches

Sentence-level matching

Token-level matching

126 of 156

Section Overview

Transfer Learning for MNMT

Training Methods
Lexical Transfer
Syntactic Transfer
Language Relatedness

Translation between unseen language pairs

Pivot Translation
Zeroshot Translation
Zero-resource Translation
Multibridge MNMT

126

127 of 156

127

Can’t we simply add direct parallel corpora between non-English languages?

How do we acquire such parallel data?

How do we address data imbalance?

Single-bridge vs. Multi-bridge systems

128 of 156

128

Mine parallel corpus from monolingual corpora

Expensive to mine from all 100 x 99 pairs

Which are the most promising language pairs to mine from?

Cluster languages and select bridge languages

(Fan et al., 2020)

129 of 156

129

Extraction from English-centric parallel corpora

(Freitag & Firat, 2020; Rios et al., 2020)

130 of 156

130

Sampling strategies used for English-centric datasets have limitations

Large sample space → quadratic number of pairs to sample
Pairwise sampling will be biased in favour of instances containing English

Just ~15% sentence pairs exclusively non-English

Sampling independently on source and target marginal is also biased.

Solution 1: (Freitag & Firat, 2020)

Temperature-based sampling of target language first using marginal distribution of target languages
Sample source language uniformly

Solution 2: (Fan et al., 2020)

Sample from probability matrix while ensuring source and target marginals follow a temperature based schedule

Data Sampling Strategies

These approaches improve over standard temperature-based sampling for non-English directions

131 of 156

131

Significant improvement in translation quality over pivot baseline for:

Newly trained directions
Zero-shot directions

Multi-bridge translation does not adversely impact English-centric directions
Multi-bridge translation and synthetic data augmentation provide complementary benefits

	Fan et al., 2020		Rios et al., 2020
Language Pairs →	New Train	Unseen	New Train	Unseen
Single-bridge	5.4	7.6	20.0	20.9
Single-bridge pivot	9.8	12.4	22.9	23.7
Multi-bridge	12.3	18.5	25.1	24.0
synth-data				25.2

Major Results

Word of caution: Human evaluations gains are not as significant (Fan et al., 2020)

zeroshot

132 of 156

Section Summary

MNMT has helped make significant advances in low-resource MT

Novel methods for transfer learning and zeroshot translation

Transfer Learning

Optimize the right objective for improved transfer
Language-relatedness plays a key role in successful transfer
Transfer works better in M2O setting than O2M setting
Lexical transfer is easier to achieve than syntactic transfer

Translation between unseen languages

Pivot translation is a strong baseline
Zeroshot ⇒ spurious correlation between input representation & output language

Reducing divergence between internal representations of different languages
Multi-bridge systems

Zero-resource translation can use synthetic data to reduce spurious correlations

132

133 of 156

Outline of This Tutorial

Overview of Multimodal NMT (30 min by Chenhui Chu)
Multiway Modeling (1 hour by Raj Dabre)
Low-resource Translation (1 hour by Anoop Kunchukuttan)
Multi-source Translation (10 min by Chenhui Chu)
Datasets, Future Directions, and Summary (20 min by Chenhui Chu)

133

134 of 156

Why Multi-source MT?

134

If a source sentence has already been translated into multiple languages then these sentences can be used together to improve the translation into the target language:

such as EU, and UN.

Figures from (Nishimura et al., 2018)

135 of 156

Multi-Source Available: Multi-source Encoder (Zoop et al., 2016)

135

136 of 156

Results of Multi-source Encoder (Zoph et al., 2016)

136

137 of 156

Multi-Source Available: Ensembling (Garmash et al., 2016)

137

138 of 156

Results of Ensembling (Garmash et al., 2016)

138

139 of 156

Multi-Source Available: Concatenation (Dabre et al., 2017)

139

Multilingual (N-way) Corpus

Concatenate Source Sentences

Train NMT Model

Model

Word Segmentation (BPE)

Early Stop on Best Dev BLEU

Hello ||| Bonjour ||| नमस्कर ||| Kamusta ||| Hallo ||| こんにちは

I ||| Je ||| मी ||| ako ||| ech ||| 私

Multi-source Corpus

Hello Bonjour नमस्कर Kamusta Hallo ||| こんにちは

I Je मी ako ech ||| 私

140 of 156

Results of Concatenation (Dabre et al., 2017)

140

* Concatenation (bold)/Ensembling/Multi-source Encoder

141 of 156

Missing Source Sentences (Nishimura et al., 2018)

141

Scenario:

Methods:

142 of 156

Results of Missing Source Sentences (Nishimura et al., 2018)

142

143 of 156

Summary of Multi-source Approaches

143

144 of 156

Outline of This Tutorial

Overview of Multimodal NMT (30 min by Chenhui Chu)
Multiway Modeling (1 hour by Raj Dabre)
Low-resource Translation (1 hour by Anoop Kunchukuttan)
Multi-source Translation (10 min by Chenhui Chu)
Datasets, Future Directions, and Summary (20 min by Chenhui Chu)

144

145 of 156

Multiway Datasets

145

Languages	Corpora
European languages	Europarl, JRC-Aquis, DGT-Aquis, DGT-TM, ECDC-TM, EAC-TM etc.
Asian languages	WAT shared tasks etc.
Indic languages	CVIT-PIB/MKB, PMIndia, IndoWordNet etc.
Massive	WikiMatrix, JW300 etc.
Others	UN, TED, Opensubtitles etc.

Refer to catalogs like OPUS and the IndicNLP catalog for comprehensive listings of parallel corpora resources.

146 of 156

Low or Zero-Resource Multiway Datasets

146

Corpus	Domain	Languages
FLORES	Wikipedia	English, Nepali, Sinhala
XNLI	Caption	15 languages
CVIT-Mann ki Baat/PIB	General	10 Indian Languages
Indic parallel corpus	Wikipedia	6 Indian Languages
WMT shared tasks	Web	German, Upper Sorbian

+All the multiway datasets listed in the previous slide can be used for testing

147 of 156

Multi-source Datasets

147

Corpus	N-way	Domain	Languages
Europarl	11	Politics	European languages
TED	5	Spoken	French, German, Czech, Arabic and English
UN	6	Politics	Arabic, Chinese, English, French, Russian and Spanish
ILCI	11	Tourism/health	Indian languages + English
ALT	9	News	South-East Asian languages + English, Japanese
Bible	1,000	Religion	Most major languages

148 of 156

Outline of This Tutorial

Overview of Multimodal NMT (30 min by Chenhui Chu)
Multiway Modeling (1 hour by Raj Dabre)
Low-resource Translation (1 hour by Anoop Kunchukuttan)
Multi-source Translation (10 min by Chenhui Chu)
Datasets, Future Directions, and Summary (20 min by Chenhui Chu)

148

149 of 156

Exploring Pre-trained Models

Pre-training embeddings, encoders and decoders have been shown to be useful for NMT
But

How pre-training can be incorporated into different MNMT architectures?
How to address techniques to maximize the impact of transfer in fine-tuning?

Unsupervised pre-training and unsupervised NMT might be worth investing

149

150 of 156

Unseen Language Pair Translation

Previous work on unseen language pair translation has only addressed cases where the pivot language is related to or shares the same script with the source language
The pivot language (mostly English) is unlikely to be related to the source and target languages and this scenario requires further investigation (especially for zero-shot translation).
New approaches need to be explored to significantly improve over the simple pivot baseline.

150

151 of 156

Joint Multilingual and Multi-Domain NMT

When extending an NMT system to a new language, the parallel corpus in the domain of interest may not be available
Transfer learning in this case has to span languages and domains
It might be worthwhile to explore adversarial approaches where domain and language invariant representations can be learned for the best translations

151

152 of 156

Multilingual Speech-to-Speech NMT

An interesting research direction would be to explore multilingual speech translation, where the ASR, translation and TTS modules can be multilingual
Interesting challenges and opportunities may arise in the quest to compose all these multilingual systems in an end-to-end method.
Multilingual end-to-end speech-to-speech translation would also be a future challenging scenario

152

153 of 156

153

Your ideas?

154 of 156

Outline of This Tutorial

Overview of Multimodal NMT (30 min by Chenhui Chu)
Multiway Modeling (1 hour by Raj Dabre)
Low-resource Translation (1 hour by Anoop Kunchukuttan)
Multi-source Translation (10 min by Chenhui Chu)
Datasets, Future Directions, and Summary (20 min by Chenhui Chu)

154

155 of 156

155

MNMT

Multiway Modeling

Transfer Learning

Zero-resource Modeling

Multi-source Translation

Synthetic corpus generation, iterative refinement, teacher-student models, pre-training approaches

Low-resource Translation

Available/missing source sentences, multiway-multisource modeling, post-editing

Zero-shot Modelling

Wrong language generation, language invariant representations, output agreement, effect of corpus size and number of languages

Parameter Sharing

Language Divergence

Full/minimal/controlled parameter sharing,

massive models, capacity bottlenecks

Balancing language agnostic-dependent representations, impact of language tags, reordering and pre-ordering of languages, vocabularies

Training Protocols

Parallel/joint training, multi-stage and incremental training, knowledge distillation, optimal stopping

Fine-tuning, regularization, lexical transfer, syntactic transfer, language relatedness

Use-case

Core-issues

Challenges

Summary of MNMT

(Dabre et al., 2020)