1 of 86

Machine Translation in the Real World

Hassan Sajjad

2 of 86

About me

Research Scientist, Qatar Computing Research Institute (2014- to date)

Dr. Stephan Vogel, Dr. Lluis Marquez, Dr. Preslav Nakov

Post-doctorate, Qatar Computing Research Institute (2013-2014)

Dr. Stephan Vogel, Dr. Lluis Marquez, Dr. Preslav Nakov

Research Intern, Microsoft Research (2011)

Dr. Patrick Pantel, Dr. Michael Gamon

PhD, University of Stuttgart (2008-2013)

Prof. Dr. Hinrich Schütze, Prof. Dr. Alex Fraser, Dr. Helmut Schmid

2

3 of 86

Research Experience

NLP Areas

Statistical machine translation

Neural machine translation

Neural language models

Domain adaptation

Multitask learning

Word alignment

Query expansion

Corpus generation

Transliteration mining

Part of speech tagging

Machine translation evaluation

Comparable corpora extraction

Interpretation of deep models

Techniques

Unsupervised methods

Deep neural networks

Supervised methods

Genre/Domain

Informal language (SMS, Tweet, Chat)

Spoken language (Talks, Lectures)

Formal language (News)

Languages

Low-resource languages

Resource-rich languages

Morphologically-rich languages

3

Research Experience

4 of 86

Research Experience

4

ICLR

AAAI

Machine Learning

CL

ACL

NAACL

EMNLP

COLING

EACL

LREC

Computational Linguistics

Data Resources

5 of 86

In this talk

Domain adaptation for machine translation
Model for transliteration mining
Interpretation of neural machine translation
Practical machine translation

5

6 of 86

Translation

Meaningful representation of one language in another language

6

He does not go home

Er geht ja nicht nach hause

No va a su casa

他不回家

English

Spanish

Chinese

German

هو لا يذهب إلى البيت

Arabic

7 of 86

Machine Translation (MT)

Parallel corpus

Pair of sentences in two languages

Machine translation system learns from a large pool of parallel sentences

7

He does not go home

Er geht ja nicht nach hause

I am working on it

Ich arbeite daran

8 of 86

Domain Adaptation for MT

Parallel data comes in various styles, genres, and domains
An MT system trained on heterogeneous data results in suboptimal performance

About the problem of unwanted pregnancy

About the problem of choice overload

8

9 of 86

Domain Adaptation for MT

Parallel data comes in various styles, genres, and domains
An MT system trained on heterogeneous data results in suboptimal performance

“Domain adaptation aims to preserve the identity of a domain while exploiting the large heterogeneous data in favor of it”

9

10 of 86

Domain Adaptation for MT

In this work:

Neural domain adapted models
Fusion model
Published @ EMNLP 2015, Coling 2016, CSL 2017

10

11 of 86

Neural Network Joint Model (Devlin 2014)

Given a parallel corpus, minimize the negative log-likelihood of the training data

Figure - three source words, four target words

11

12 of 86

Neural Network Joint Model (Devlin 2014)

Given a parallel corpus, minimize the negative log-likelihood of the training data

Figure - three source words, four target words

Limitations

Does not perform well when various domains are present in the data
Model deviates towards the large heterogeneous data

12

13 of 86

Neural Domain Adaptation Model

Three novel extensions

First model minimizes the cross entropy by regularizing the loss function with respect to the in-domain data
Second model additionally penalizes data instances that are similar to the out-of-domain data
Third model fuses in-domain and out-of-domain models by adjusting parameters of the composite model in favor of the in-domain data

13

14 of 86

Results

Improved performance up to 0.7 BLEU points
Fusion model (third model) performed the best
Adapting separate models work better than building a single adapted model on the concatenation of the dat

14

15 of 86

In this talk

Domain adaptation for machine translation
Model for transliteration mining
Interpretation of neural machine translation
Practical machine translation

15

16 of 86

Model for Transliteration Mining

Transliteration

Script conversion
Similar pronunciation

Transliteration plays a vital role in major applications of NLP

Cross language information retrieval
Terminology extraction
Word alignment
Machine translation

16

17 of 86

Model for Transliteration Mining

Automatically extract word pairs that are transliteration of each other

Previous work: rule based methods or (semi) supervised methods

17

18 of 86

Model for Transliteration Mining

In this work:

Novel model for unsupervised transliteration mining
Extended to train under semi-supervised and supervised settings
Integrated to Moses
Published @ ACL 2011, ACL 2012, EACL 2014, CL 2017

18

19 of 86

Model for Transliteration Mining

Intuition:

At character level,

Transliteration pairs follow a pattern between them
Non-transliteration pairs can be considered as a random generation of characters

19

20 of 86

Model for Transliteration Mining

Transliteration mining model is defined as a mixture of a transliteration model and a non-transliteration model

Where is the prior probability of non-transliteration, is character language model probability

20

Transliteration Model

non-transliteration Model

21 of 86

Model for Transliteration Mining

Intuition:

At character level,

Transliteration pairs follow a pattern between them
Non-transliteration pairs can be considered as a random generation of characters

Transliteration model

Generates source and target sequences jointly and models the dependencies between them

Non-transliteration model

Two monolingual source and target character sequence models which generates the source and target strings independently of each other

21

22 of 86

Model for Transliteration Mining

Intuition:

At character level,

Transliteration pairs follow a pattern between them
Non-transliteration pairs can be considered as a random generation of characters

Transliteration model

Generates source and target sequences jointly and models the dependencies between them

Non-transliteration model

Two monolingual source and target character sequence models which generates the source and target strings independently of each other

Mining model

Interpolation of transliteration model and non-transliteration model

22

23 of 86

Results

Unsupervised system outperformed all supervised and semi-supervised systems
Better word alignment
Resulted in best Hindi-English machine translation system

23

24 of 86

In this talk

Domain adaptation for machine translation
Model for transliteration mining
Interpretation of neural machine translation
Practical machine translation

24

25 of 86

Interpretation of Neural MT

Deep neural models: state-of-the-art for many tasks
Issue: opaqueness
Interpretation is important

Better understanding
Increase trust in AI systems
Assisting ethical decision making
...

25

26 of 86

Interpretation of Neural MT

In this work:

Increase model transparency
Whole vector representations
Individual neurons

Published @ ACL 2017, IJCNLP 2017, AAAI 2019, ICLR 2019
NeuroX toolkit (AAAI Demo 2019)

26

Input

Layer 1

Layer 2

Layer 3

Output

27 of 86

Analyzing Vector Representations

Research Questions:

Which parts of the neural MT architecture capture word structure?
What is the effect on learning when using different word representations?
Where and how much morphology, syntax and semantics of source and target languages is learned?

27

28 of 86

Analyzing Vector Representations

Methodology:

28

29 of 86

Results

German-, French-, Czech-, Arabic-English, Arabic-Hebrew
Encoder vs. Decoder
Layer-wise analysis
Representation analysis

29

Input

Layer 1

Layer 2

Layer 3

Output

Word-level concepts

Syntax and Semantic

30 of 86

Analyzing Vector Representations

Limitation:

No information about individual neurons

30

Input

Layer 1

Layer 2

Layer 3

Output

Individual neurons

31 of 86

Analyzing Individual Neurons

Limitation:

No information about individual neurons

Open questions:

Learning pattern
Role of individual neurons
Important vs. less important neurons
Representation of information

31

Input

Layer 1

Layer 2

Layer 3

Output

Individual neurons

32 of 86

Analyzing Individual Neurons

Linguistic Correlation Analysis

Identify neurons with respect to a property

Nour, verb, adjective
Month of year

Cross-model Correlation Analysis

Identify neurons salient for the model

32

33 of 86

Linguistic Correlation Analysis

33

34 of 86

Cross-model Correlation Analysis

What does the model care about?

Hypothesis

Different models learn similar properties
Search for neurons that share similar patterns in different networks
Use correlation between neurons as a measure of their importance

34

35 of 86

Visualization - Top Neurons

English Verb # 1902

Position Neuron # 1903

Article Neuron # 590

36 of 86

Focused vs. Distributed Neurons

Open class vs. closed class categories

36

Neuron	Top 10 Words
#1925	August, July, January, September, October, presidential, April, May, February, December
#1960	no, No, not, nothing, nor, neither, or, none, whether, appeal
#1590	50, 10, 51, 61, 47, 37, 48, 33, 43, 49

37 of 86

Controlling of Models

Neurons responsible for specific properties

Can we use this information to control models?

Benefit: mitigating bias in models, e.g. gender bias

Manipulate neurons at test time
Experimented with gender, number and tense

37

38 of 86

Controlling of Models

Result of changing tense neurons

38

39 of 86

Media Coverage

39

40 of 86

In this talk

Domain adaptation for machine translation
Model for transliteration mining
Interpretation of neural machine translation
Practical machine translation

40

41 of 86

Practical Machine Translation

Built state-of-the-art systems
Challenges

Real time processing
Memory bottlenecks
Customization

41

42 of 86

Practical Machine Translation

Ranking top or among the best performing systems

42

WMT 2013

Russian-English - 2nd tier

IWSLT 2013 & 2016

Lecture and speech translation

Arabic-English - 1st

English-Arabic - 1st

NIST 2015

Dialectal Arabic-English - 2nd

43 of 86

Practical Machine Translation

Machine translation API
Speech translation system
Published demo papers @ EACL2017 (SUMMA), EACL2017 (QCRI ST system)

43

Startup grant $100k

32 million tokens translated!

35 countries

44 of 86

Potential Research Directions

Explainable and interpretable NLP models

Fairness
Robustness
Easy to debug

Towards universal representations

Multilingual models
Language independent
Task independent

Adversarial and reinforcement learning

Unsupervised methods

44

45 of 86

Thank you

45

46 of 86

Neural Network Language Model (Bengio 2003)

Given a monolingual corpus, minimize the negative log-likelihood of the training data

is an indicator variable, is the language model context

is the softmax output

46

47 of 86

Neural Domain Adaptation Model

Summary

Fusion model (third model) performed the best when tested for English-German and Arabic-English language pairs
Adapting separate models work better than building a single adapted model on the concatenation of the data

47

48 of 86

Machine Translation

Machine Translation through Transliteration
Domain Adaptation for Machine Translation
Analysis of Neural Machine Translation
Improving Neural Decoder using Multitask Learning
Machine Translation Competitions

48

49 of 86

Machine Translation through Transliteration

Closely related languages

Share grammatical structure
Share some vocabulary

Can we leverage the benefit of similar vocabulary?

49

50 of 86

Machine Translation through Transliteration

Closely related languages

Share grammatical structure
Share some vocabulary

Can we leverage the benefit of similar vocabulary?

Case study using Hindi and Urdu language pair

Share grammatical structure
Share large proportion of vocabulary
Have different writing scripts (Devanagari vs. Perso-Arabic script)

50

Let’s leverage the benefit of similar vocabulary by modeling transliteration between language pairs

51 of 86

Machine Translation through Transliteration

A novel model that considers both translation and transliteration when translating a particular source word given the context

Basic idea:

Transliterate all input words
Pit them against regular translations on the fly
Language model will decide

Either to translate or transliterate given the context
Which translation/transliteration to choose given the context

51

52 of 86

Machine Translation through Transliteration

52

Initial output

Input sentence

Decoder

Translation model

Language model

Transliteration of unknown words

final output

Transliteration as a post-processing step

Input sentence

Decoder

Translation model

Language model

final output

Transliteration as a component of the translation model

Translation

Transliteration

53 of 86

Machine Translation through Transliteration

Apply noisy channel model to compute the most probable translation

53

Translation model

Language model

54 of 86

Machine Translation through Transliteration

Estimate conditional probability of words using an interpolation of translation sub-model and transliteration sub-model

54

Translation model

55 of 86

Machine Translation through Transliteration

Summary

Incorporating transliteration improved the translation quality by 4 BLEU points
Transliteration is very helpful in improving the translation of closely related languages
Published @ ACL 2010, IJCNLP 2011

55

56 of 86

Machine Translation

Machine Translation Through Transliteration
Domain Adaptation for Machine Translation
Analysis of Neural Machine Translation
Improving Neural Decoder using Multitask Learning
Machine Translation Competitions

56

57 of 86

Machine Translation

Machine Translation Through Transliteration
Domain Adaptation for Machine Translation
Analysis of Neural Machine Translation
Improving Neural Decoder using Multitask Learning
Machine Translation Competitions

57

58 of 86

Analysis of Neural Machine Translation

Neural MT obtains state-of-the-art performance
However, little is known about what these models learn about source and target language

In this work:

Analyzed neural MT’s ability to learn source and target language phenomena, such as, syntax, and morphology
Published @ ACL2017, IJCNLP2017
Featured in MIT news

58

59 of 86

Analysis of Neural Machine Translation

Research Questions:

Which parts of the neural MT architecture capture word structure?
What is the effect on learning when using different word representations?
Where and how much morphology, syntax and semantics of source and target languages is learned?

59

60 of 86

Analysis of Neural Machine Translation

Methodology:

Two-step process

Train a neural MT system
Extract feature representations using the trained model
Train a classifier using the feature representations and evaluate it for extrinsic tasks like morphological tagging, semantic tagging, etc.

60

61 of 86

Analysis of Neural Machine Translation

Methodology:

61

62 of 86

Analysis of Neural Machine Translation

Hypothesis:

Performance of the classifier is a quantitative measure of how well the representations are for the task in hand

62

63 of 86

Analysis of Neural Machine Translation

Results:

German-, French-, Czech-, Arabic-English, Arabic-Hebrew
Encoder vs. Decoder

Encoder is good in learning various source language phenomena while decoder lacks in this for target language
Encoder and attention mechanism take the load from the decoder
Decoder role might be limited to be a good language model

Lower layers of encoder learns morphological information while higher layers learn syntax and semantics of the language
Character-based representations model infrequent words better than the word-based model

63

64 of 86

Analysis of Neural Machine Translation

Results:

Translation quality does not reflect the amount of language information model has learned (En-En translation)
For a difficult target language, encoder learns better morphology about the source language

64

65 of 86

Machine Translation

Machine Translation Through Transliteration
Domain Adaptation for Machine Translation
Analysis of Neural Machine Translation
Improving Neural Decoder using Multitask Learning
Machine Translation Competitions

65

66 of 86

Improving Neural Decoder using Multitask Learning

Decoder learns little about target language morphology (Enc 90% vs. Dec 45%)
Learning of target language morphology may help in improving translating into morphologically rich languages

In this work:

Injected morphology into the decoder to facilitate translation into morphologically rich languages
Published @ IJCNLP2017

66

67 of 86

Improving Neural Decoder using Multitask Learning

Train a neural MT system in a multitask setting
Given a set of m tasks, objective function minimizes the overall loss which is a weighted combination of the m individual task losses

For a training pair, source (s), target (t) and task (m):

is a hyperparameter to find the balance between translation and morphology prediction tasks

67

68 of 86

Improving Neural Decoder using Multitask Learning

Results:

English-German, English-Czech
Injecting morphology in the decoder helps to improve the translation of morphologically rich languages
Translation into morphologically poor languages do not benefit from this

68

69 of 86

Machine Translation

Machine Translation Through Transliteration
Domain Adaptation for Machine Translation
Analysis of Neural Machine Translation
Improving Neural Decoder using Multitask Learning
Machine Translation Competitions

69

70 of 86

Semi-supervised Model for Transliteration Mining

Assume there is a small amount of labeled data available to support the training
We smooth the labeled data probability estimates with the unlabeled data probability estimates

is the labeled data counts of the character alignment , is the unlabeled data probability, , is the , and is the number of character alignment types observed in the viterbi alignment of the labeled data

70

71 of 86

Model for Transliteration Mining

Training

Parameters of the transliteration model are learned during EM training
Parameters of the non-transliteration model are trained once using monolingual language models

Summary

Russian-English, Arabic-English, Tamil-English, Hindi-English
Our unsupervised method outperformed semi- and supervised models for three out of four language pairs
Integration to Moses showed an improvement of up to 1 Bleu point
Integration to word alignment showed an absolute gain of 14% in F-measure

71

Highlight integration to moses… how it gets pairs from a parallel corpus and learn alignment

The parameters of the two monolingual character sequence models of the non- transliteration sub-model are directly trained on the source and target part of the list of word pairs and are fixed afterwards. The parameters of the transliteration sub-model are uniformly initialised and then learned during EM training of the complete interpo- lated model. Why does this work? EM training is known to find a (local) maximum. The fixed non-transliteration sub-model assigns reasonable probabilities to any combination of source and target words (such as translations and misalignments) but fails to capture the dependencies between words and their transliterations. The only way for the EM training to increase the data likelihood is therefore a better modeling of translitera- tion pairs by means of the transliteration sub-model. After a couple of EM iterations, the transliteration sub-model is well-adapted to transliterations and the interpolation weight models the relative frequencies of transliteration and non-transliteration pairs.

72 of 86

Miscellaneous Projects

Tutorials on deep learning

Deep learning for machine translation (course material), DGfS Fall school (Sept. 2017)
From theory to practice: deep learning for NLP, (April 2018), University of Duisburg-Essen, Germany

Finding lexical variations in informal text

Presented a clustering framework that uses phonetic features, substring matches and contextual information to find lexical variations under resource poor conditions
Published @ EMNLP 2016, CL (second round of review)

Query refinement in a vertical search

Proposed a method that given a user query, refine the search space for efficient vertical search
Published @ COLING 2012, Patent

72

73 of 86

In this talk

Machine Translation

Machine translation through transliteration
Domain adaptation for machine translation
Analysis of neural machine translation
Improving neural decoder using multitask learning
Competitions

Model for Transliteration Mining
Miscellaneous Projects
Potential Research Directions

73

74 of 86

Miscellaneous Projects

QCRI educational domain parallel corpus

Crawled, cleaned and compiled the first parallel corpus focused on educational content such as, math, Physics, chemistry (in 20 languages)
Part of the IWSLT 2016 evaluation campaign
Published @ LREC 2014

Part of speech tagging of Urdu

Proposed a tagset of Urdu, annotated a dataset of 100,000 words, presented an empirical study on part of speech tagging of Urdu language
Published @ EACL 2009

Rapid classification in a crisis scenario using deep learning

Presented a CNN-based method for the rapid classification of tweets in a crisis situation, such as, earthquake, flood, etc.
Published @ ICWSM 2017

74

75 of 86

Analysis of Neural Machine Translation

Methodology:

Two-step process

Train a neural MT system
Extract feature representations using the trained model
Train a classifier using the feature representations and evaluate it for extrinsic tasks like morphological tagging, semantic tagging, etc.

75

76 of 86

Analysis of Neural Machine Translation

Results:

German-, French-, Czech-, Arabic-English, Arabic-Hebrew
Encoder vs. Decoder

Encoder is good in learning various source language phenomena while decoder lacks in this for target language
Encoder and attention mechanism take the load from the decoder
Decoder role might be limited to be a good language model

Lower layers of encoder learns morphological information while higher layers learn syntax and semantics of the language
Character-based representations model infrequent words better than the word-based model

76

77 of 86

Domain Adaptation for MT

In this work:

Bilingual Neural Language model

Neural domain adapted model
Fusion model
Published @ EMNLP 2015, Coling 2016, CSL 2017

Neural machine translation

Empirical study under various training scenarios
Published @ IWSLT 2017

77

78 of 86

Multi-domain Training Scenario for Neural MT

Several domains, such as, TED, OPUS, News, UN
What are the best strategies to build an optimal neural MT system?

Concatenation, Stacking, Selection, Ensemble

78

79 of 86

Multi-domain Training Scenario for Neural MT

Several domains, such as, TED, OPUS, News, UN
What are the best strategies to build an optimal neural MT system?

Concatenation, Stacking, Selection, Ensemble

79

80 of 86

Multi-domain Training Scenario for Neural MT

Summary

Arabic-English and German-English language pairs
System on the concatenation of data with fine-tuned on the in-domain data performs the best
In contrast to phrase-based system, data selection hurts for NMT
Ensemble of separately trained model did not perform well
Model stacking works well when trained from farthest to closest out-of-domain data from the in-domain data
Fine-tuning on a diverse set results in robust model

80

81 of 86

Unsupervised Model for Transliteration Mining

Transliteration mining model is defined as a mixture of a transliteration model and a non-transliteration model

Where is the prior probability of non-transliteration

81

Transliteration Model

Non-transliteration Model

82 of 86

Unsupervised Model for Transliteration Mining

Transliteration mining model is defined as a mixture of a transliteration model and a non-transliteration model

Where is the prior probability of non-transliteration, is character language model probability

82

Transliteration Model

non-transliteration Model

83 of 86

Neural Domain Adaptation Model

Method 1: Give higher weight to word sequences that are liked by the in-domain data

�

is the probability of training instance according to the in-domain model , and is the probability of the adapted model

83

84 of 86

Neural Domain Adaptation Model

Method 2: Additionally penalizes those sequences that are liked by the out-of-domain data

is the probability of training instance according to the out-of-domain model

84

85 of 86

Neural Domain Adaptation Model

Method 3:

Train separate in-domain and out-of-domain models
Adjust their parameters in favor of the in-domain data

85

86 of 86

Machine Translation

Domain Adaptation for Machine Translation
Practical machine translation

A few other notable projects:

Machine Translation Through Transliteration (@ ACL 2010)
Challenging Language Dependent Segmentation of Arabic (@ ACL 2017)
Machine Translation Evaluation using Eye-tracking (@ NAACL 2016)
Improving Neural Decoder using Multitask Learning (@ IJCNLP 2017)
Translating Dialectal Arabic (@ ACL 2013)

86