1 of 86

Machine Translation in the Real World

Hassan Sajjad

2 of 86

About me

  • Research Scientist, Qatar Computing Research Institute (2014- to date)
    • Dr. Stephan Vogel, Dr. Lluis Marquez, Dr. Preslav Nakov
  • Post-doctorate, Qatar Computing Research Institute (2013-2014)
    • Dr. Stephan Vogel, Dr. Lluis Marquez, Dr. Preslav Nakov
  • Research Intern, Microsoft Research (2011)
    • Dr. Patrick Pantel, Dr. Michael Gamon
  • PhD, University of Stuttgart (2008-2013)
    • Prof. Dr. Hinrich Schütze, Prof. Dr. Alex Fraser, Dr. Helmut Schmid

2

3 of 86

Research Experience

NLP Areas

Statistical machine translation

Neural machine translation

Neural language models

Domain adaptation

Multitask learning

Word alignment

Query expansion

Corpus generation

Transliteration mining

Part of speech tagging

Machine translation evaluation

Comparable corpora extraction

Interpretation of deep models

Techniques

Unsupervised methods

Deep neural networks

Supervised methods

Genre/Domain

Informal language (SMS, Tweet, Chat)

Spoken language (Talks, Lectures)

Formal language (News)

Languages

Low-resource languages

Resource-rich languages

Morphologically-rich languages

3

Research Experience

4 of 86

Research Experience

4

ICLR

AAAI

Machine Learning

CL

ACL

NAACL

EMNLP

COLING

EACL

LREC

Computational Linguistics

Data Resources

5 of 86

In this talk

  • Domain adaptation for machine translation
  • Model for transliteration mining
  • Interpretation of neural machine translation
  • Practical machine translation

5

6 of 86

Translation

Meaningful representation of one language in another language

6

He does not go home

Er geht ja nicht nach hause

No va a su casa

他不回家

English

Spanish

Chinese

German

هو لا يذهب إلى البيت

Arabic

7 of 86

Machine Translation (MT)

  • Parallel corpus
    • Pair of sentences in two languages

  • Machine translation system learns from a large pool of parallel sentences

7

He does not go home

Er geht ja nicht nach hause

I am working on it

Ich arbeite daran

8 of 86

Domain Adaptation for MT

  • Parallel data comes in various styles, genres, and domains
  • An MT system trained on heterogeneous data results in suboptimal performance

About the problem of unwanted pregnancy

About the problem of choice overload

8

9 of 86

Domain Adaptation for MT

  • Parallel data comes in various styles, genres, and domains
  • An MT system trained on heterogeneous data results in suboptimal performance

“Domain adaptation aims to preserve the identity of a domain while exploiting the large heterogeneous data in favor of it”

9

10 of 86

Domain Adaptation for MT

In this work:

  • Neural domain adapted models
  • Fusion model
  • Published @ EMNLP 2015, Coling 2016, CSL 2017

10

11 of 86

Neural Network Joint Model (Devlin 2014)

Given a parallel corpus, minimize the negative log-likelihood of the training data

Figure - three source words, four target words

11

12 of 86

Neural Network Joint Model (Devlin 2014)

Given a parallel corpus, minimize the negative log-likelihood of the training data

Figure - three source words, four target words

Limitations

  • Does not perform well when various domains are present in the data
  • Model deviates towards the large heterogeneous data

12

13 of 86

Neural Domain Adaptation Model

  • Three novel extensions
    • First model minimizes the cross entropy by regularizing the loss function with respect to the in-domain data
    • Second model additionally penalizes data instances that are similar to the out-of-domain data
    • Third model fuses in-domain and out-of-domain models by adjusting parameters of the composite model in favor of the in-domain data

13

14 of 86

Results

  • Improved performance up to 0.7 BLEU points
  • Fusion model (third model) performed the best
  • Adapting separate models work better than building a single adapted model on the concatenation of the dat

14

15 of 86

In this talk

  • Domain adaptation for machine translation
  • Model for transliteration mining
  • Interpretation of neural machine translation
  • Practical machine translation

15

16 of 86

Model for Transliteration Mining

  • Transliteration
    • Script conversion
    • Similar pronunciation

  • Transliteration plays a vital role in major applications of NLP
    • Cross language information retrieval
    • Terminology extraction
    • Word alignment
    • Machine translation

16

17 of 86

Model for Transliteration Mining

  • Automatically extract word pairs that are transliteration of each other

  • Previous work: rule based methods or (semi) supervised methods

17

18 of 86

Model for Transliteration Mining

In this work:

  • Novel model for unsupervised transliteration mining
  • Extended to train under semi-supervised and supervised settings
  • Integrated to Moses
  • Published @ ACL 2011, ACL 2012, EACL 2014, CL 2017

18

19 of 86

Model for Transliteration Mining

Intuition:

  • At character level,
    • Transliteration pairs follow a pattern between them
    • Non-transliteration pairs can be considered as a random generation of characters

19

20 of 86

Model for Transliteration Mining

Transliteration mining model is defined as a mixture of a transliteration model and a non-transliteration model

Where is the prior probability of non-transliteration, is character language model probability

20

Transliteration Model

non-transliteration Model

21 of 86

Model for Transliteration Mining

Intuition:

  • At character level,
    • Transliteration pairs follow a pattern between them
    • Non-transliteration pairs can be considered as a random generation of characters
  • Transliteration model
    • Generates source and target sequences jointly and models the dependencies between them
  • Non-transliteration model
    • Two monolingual source and target character sequence models which generates the source and target strings independently of each other

21

22 of 86

Model for Transliteration Mining

Intuition:

  • At character level,
    • Transliteration pairs follow a pattern between them
    • Non-transliteration pairs can be considered as a random generation of characters
  • Transliteration model
    • Generates source and target sequences jointly and models the dependencies between them
  • Non-transliteration model
    • Two monolingual source and target character sequence models which generates the source and target strings independently of each other
  • Mining model
    • Interpolation of transliteration model and non-transliteration model

22

23 of 86

Results

  • Unsupervised system outperformed all supervised and semi-supervised systems
  • Better word alignment
  • Resulted in best Hindi-English machine translation system

23

24 of 86

In this talk

  • Domain adaptation for machine translation
  • Model for transliteration mining
  • Interpretation of neural machine translation
  • Practical machine translation

24

25 of 86

Interpretation of Neural MT

  • Deep neural models: state-of-the-art for many tasks
  • Issue: opaqueness
  • Interpretation is important
    • Better understanding
    • Increase trust in AI systems
    • Assisting ethical decision making
    • ...

25

26 of 86

Interpretation of Neural MT

In this work:

  • Increase model transparency
  • Whole vector representations
  • Individual neurons

  • Published @ ACL 2017, IJCNLP 2017, AAAI 2019, ICLR 2019
  • NeuroX toolkit (AAAI Demo 2019)

26

Input

Layer 1

Layer 2

Layer 3

Output

27 of 86

Analyzing Vector Representations

Research Questions:

  • Which parts of the neural MT architecture capture word structure?
  • What is the effect on learning when using different word representations?
  • Where and how much morphology, syntax and semantics of source and target languages is learned?

27

28 of 86

Analyzing Vector Representations

Methodology:

28

29 of 86

Results

  • German-, French-, Czech-, Arabic-English, Arabic-Hebrew
  • Encoder vs. Decoder
  • Layer-wise analysis
  • Representation analysis

29

Input

Layer 1

Layer 2

Layer 3

Output

Word-level concepts

Syntax and Semantic

30 of 86

Analyzing Vector Representations

Limitation:

  • No information about individual neurons

30

Input

Layer 1

Layer 2

Layer 3

Output

Individual neurons

31 of 86

Analyzing Individual Neurons

Limitation:

  • No information about individual neurons

Open questions:

  • Learning pattern
  • Role of individual neurons
  • Important vs. less important neurons
  • Representation of information

31

Input

Layer 1

Layer 2

Layer 3

Output

Individual neurons

32 of 86

Analyzing Individual Neurons

  • Linguistic Correlation Analysis
    • Identify neurons with respect to a property
      • Nour, verb, adjective
      • Month of year
  • Cross-model Correlation Analysis
    • Identify neurons salient for the model

32

33 of 86

Linguistic Correlation Analysis

33

34 of 86

Cross-model Correlation Analysis

  • What does the model care about?

Hypothesis

  • Different models learn similar properties
  • Search for neurons that share similar patterns in different networks
  • Use correlation between neurons as a measure of their importance

34

35 of 86

Visualization - Top Neurons

English Verb # 1902

Position Neuron # 1903

Article Neuron # 590

36 of 86

Focused vs. Distributed Neurons

Open class vs. closed class categories

36

Neuron

Top 10 Words

#1925

August, July, January, September, October,

presidential, April, May, February, December

#1960

no, No, not, nothing, nor, neither, or, none,

whether, appeal

#1590

50, 10, 51, 61, 47, 37, 48, 33, 43, 49

37 of 86

Controlling of Models

  • Neurons responsible for specific properties

Can we use this information to control models?

  • Benefit: mitigating bias in models, e.g. gender bias

  • Manipulate neurons at test time
  • Experimented with gender, number and tense

37

38 of 86

Controlling of Models

  • Result of changing tense neurons

38

39 of 86

Media Coverage

39

40 of 86

In this talk

  • Domain adaptation for machine translation
  • Model for transliteration mining
  • Interpretation of neural machine translation
  • Practical machine translation

40

41 of 86

Practical Machine Translation

  • Built state-of-the-art systems
  • Challenges
    • Real time processing
    • Memory bottlenecks
    • Customization

41

42 of 86

Practical Machine Translation

Ranking top or among the best performing systems

42

WMT 2013

Russian-English - 2nd tier

IWSLT 2013 & 2016

Lecture and speech translation

Arabic-English - 1st

English-Arabic - 1st

NIST 2015

Dialectal Arabic-English - 2nd

43 of 86

Practical Machine Translation

43

Startup grant $100k

32 million tokens translated!

35 countries

44 of 86

Potential Research Directions

  • Explainable and interpretable NLP models
    • Fairness
    • Robustness
    • Easy to debug
  • Towards universal representations
    • Multilingual models
    • Language independent
    • Task independent
  • Adversarial and reinforcement learning
    • Unsupervised methods

44

45 of 86

Thank you

45

46 of 86

Neural Network Language Model (Bengio 2003)

Given a monolingual corpus, minimize the negative log-likelihood of the training data

is an indicator variable, is the language model context

is the softmax output

46

47 of 86

Neural Domain Adaptation Model

  • Summary
    • Fusion model (third model) performed the best when tested for English-German and Arabic-English language pairs
    • Adapting separate models work better than building a single adapted model on the concatenation of the data

47

48 of 86

Machine Translation

  • Machine Translation through Transliteration
  • Domain Adaptation for Machine Translation
  • Analysis of Neural Machine Translation
  • Improving Neural Decoder using Multitask Learning
  • Machine Translation Competitions

48

49 of 86

Machine Translation through Transliteration

  • Closely related languages
    • Share grammatical structure
    • Share some vocabulary

Can we leverage the benefit of similar vocabulary?

49

50 of 86

Machine Translation through Transliteration

  • Closely related languages
    • Share grammatical structure
    • Share some vocabulary

Can we leverage the benefit of similar vocabulary?

  • Case study using Hindi and Urdu language pair
    • Share grammatical structure
    • Share large proportion of vocabulary
    • Have different writing scripts (Devanagari vs. Perso-Arabic script)

50

Let’s leverage the benefit of similar vocabulary by modeling transliteration between language pairs

51 of 86

Machine Translation through Transliteration

  • A novel model that considers both translation and transliteration when translating a particular source word given the context

Basic idea:

  • Transliterate all input words
  • Pit them against regular translations on the fly
  • Language model will decide
    • Either to translate or transliterate given the context
    • Which translation/transliteration to choose given the context

51

52 of 86

Machine Translation through Transliteration

52

Initial output

Input sentence

Decoder

Translation model

Language model

Transliteration of unknown words

final output

Transliteration as a post-processing step

Input sentence

Decoder

Translation model

Language model

final output

Transliteration as a component of the translation model

Translation

Transliteration

53 of 86

Machine Translation through Transliteration

  • Apply noisy channel model to compute the most probable translation

53

Translation model

Language model

54 of 86

Machine Translation through Transliteration

Estimate conditional probability of words using an interpolation of translation sub-model and transliteration sub-model

54

Translation model

55 of 86

Machine Translation through Transliteration

Summary

  • Incorporating transliteration improved the translation quality by 4 BLEU points
  • Transliteration is very helpful in improving the translation of closely related languages
  • Published @ ACL 2010, IJCNLP 2011

55

56 of 86

Machine Translation

  • Machine Translation Through Transliteration
  • Domain Adaptation for Machine Translation
  • Analysis of Neural Machine Translation
  • Improving Neural Decoder using Multitask Learning
  • Machine Translation Competitions

56

57 of 86

Machine Translation

  • Machine Translation Through Transliteration
  • Domain Adaptation for Machine Translation
  • Analysis of Neural Machine Translation
  • Improving Neural Decoder using Multitask Learning
  • Machine Translation Competitions

57

58 of 86

Analysis of Neural Machine Translation

  • Neural MT obtains state-of-the-art performance
  • However, little is known about what these models learn about source and target language

In this work:

  • Analyzed neural MT’s ability to learn source and target language phenomena, such as, syntax, and morphology
  • Published @ ACL2017, IJCNLP2017
  • Featured in MIT news

58

59 of 86

Analysis of Neural Machine Translation

Research Questions:

  • Which parts of the neural MT architecture capture word structure?
  • What is the effect on learning when using different word representations?
  • Where and how much morphology, syntax and semantics of source and target languages is learned?

59

60 of 86

Analysis of Neural Machine Translation

Methodology:

  • Two-step process
    • Train a neural MT system
    • Extract feature representations using the trained model
    • Train a classifier using the feature representations and evaluate it for extrinsic tasks like morphological tagging, semantic tagging, etc.

60

61 of 86

Analysis of Neural Machine Translation

Methodology:

61

62 of 86

Analysis of Neural Machine Translation

Hypothesis:

  • Performance of the classifier is a quantitative measure of how well the representations are for the task in hand

62

63 of 86

Analysis of Neural Machine Translation

Results:

  • German-, French-, Czech-, Arabic-English, Arabic-Hebrew
  • Encoder vs. Decoder
    • Encoder is good in learning various source language phenomena while decoder lacks in this for target language
    • Encoder and attention mechanism take the load from the decoder
    • Decoder role might be limited to be a good language model
  • Lower layers of encoder learns morphological information while higher layers learn syntax and semantics of the language
  • Character-based representations model infrequent words better than the word-based model

63

64 of 86

Analysis of Neural Machine Translation

Results:

  • Translation quality does not reflect the amount of language information model has learned (En-En translation)
  • For a difficult target language, encoder learns better morphology about the source language

64

65 of 86

Machine Translation

  • Machine Translation Through Transliteration
  • Domain Adaptation for Machine Translation
  • Analysis of Neural Machine Translation
  • Improving Neural Decoder using Multitask Learning
  • Machine Translation Competitions

65

66 of 86

Improving Neural Decoder using Multitask Learning

  • Decoder learns little about target language morphology (Enc 90% vs. Dec 45%)
  • Learning of target language morphology may help in improving translating into morphologically rich languages

In this work:

  • Injected morphology into the decoder to facilitate translation into morphologically rich languages
  • Published @ IJCNLP2017

66

67 of 86

Improving Neural Decoder using Multitask Learning

  • Train a neural MT system in a multitask setting
  • Given a set of m tasks, objective function minimizes the overall loss which is a weighted combination of the m individual task losses

  • For a training pair, source (s), target (t) and task (m):

is a hyperparameter to find the balance between translation and morphology prediction tasks

67

68 of 86

Improving Neural Decoder using Multitask Learning

Results:

  • English-German, English-Czech
  • Injecting morphology in the decoder helps to improve the translation of morphologically rich languages
  • Translation into morphologically poor languages do not benefit from this

68

69 of 86

Machine Translation

  • Machine Translation Through Transliteration
  • Domain Adaptation for Machine Translation
  • Analysis of Neural Machine Translation
  • Improving Neural Decoder using Multitask Learning
  • Machine Translation Competitions

69

70 of 86

Semi-supervised Model for Transliteration Mining

  • Assume there is a small amount of labeled data available to support the training
  • We smooth the labeled data probability estimates with the unlabeled data probability estimates

is the labeled data counts of the character alignment , is the unlabeled data probability, , is the , and is the number of character alignment types observed in the viterbi alignment of the labeled data

70

71 of 86

Model for Transliteration Mining

  • Training
    • Parameters of the transliteration model are learned during EM training
    • Parameters of the non-transliteration model are trained once using monolingual language models

  • Summary
    • Russian-English, Arabic-English, Tamil-English, Hindi-English
    • Our unsupervised method outperformed semi- and supervised models for three out of four language pairs
    • Integration to Moses showed an improvement of up to 1 Bleu point
    • Integration to word alignment showed an absolute gain of 14% in F-measure

71

72 of 86

Miscellaneous Projects

  • Tutorials on deep learning
    • Deep learning for machine translation (course material), DGfS Fall school (Sept. 2017)
    • From theory to practice: deep learning for NLP, (April 2018), University of Duisburg-Essen, Germany
  • Finding lexical variations in informal text
    • Presented a clustering framework that uses phonetic features, substring matches and contextual information to find lexical variations under resource poor conditions
    • Published @ EMNLP 2016, CL (second round of review)
  • Query refinement in a vertical search
    • Proposed a method that given a user query, refine the search space for efficient vertical search
    • Published @ COLING 2012, Patent

72

73 of 86

In this talk

  • Machine Translation
    • Machine translation through transliteration
    • Domain adaptation for machine translation
    • Analysis of neural machine translation
    • Improving neural decoder using multitask learning
    • Competitions
  • Model for Transliteration Mining
  • Miscellaneous Projects
  • Potential Research Directions

73

74 of 86

Miscellaneous Projects

  • QCRI educational domain parallel corpus
    • Crawled, cleaned and compiled the first parallel corpus focused on educational content such as, math, Physics, chemistry (in 20 languages)
    • Part of the IWSLT 2016 evaluation campaign
    • Published @ LREC 2014
  • Part of speech tagging of Urdu
    • Proposed a tagset of Urdu, annotated a dataset of 100,000 words, presented an empirical study on part of speech tagging of Urdu language
    • Published @ EACL 2009
  • Rapid classification in a crisis scenario using deep learning
    • Presented a CNN-based method for the rapid classification of tweets in a crisis situation, such as, earthquake, flood, etc.
    • Published @ ICWSM 2017

74

75 of 86

Analysis of Neural Machine Translation

Methodology:

  • Two-step process
    • Train a neural MT system
    • Extract feature representations using the trained model
    • Train a classifier using the feature representations and evaluate it for extrinsic tasks like morphological tagging, semantic tagging, etc.

75

76 of 86

Analysis of Neural Machine Translation

Results:

  • German-, French-, Czech-, Arabic-English, Arabic-Hebrew
  • Encoder vs. Decoder
    • Encoder is good in learning various source language phenomena while decoder lacks in this for target language
    • Encoder and attention mechanism take the load from the decoder
    • Decoder role might be limited to be a good language model
  • Lower layers of encoder learns morphological information while higher layers learn syntax and semantics of the language
  • Character-based representations model infrequent words better than the word-based model

76

77 of 86

Domain Adaptation for MT

In this work:

  • Bilingual Neural Language model
    • Neural domain adapted model
    • Fusion model
    • Published @ EMNLP 2015, Coling 2016, CSL 2017
  • Neural machine translation
    • Empirical study under various training scenarios
    • Published @ IWSLT 2017

77

78 of 86

Multi-domain Training Scenario for Neural MT

  • Several domains, such as, TED, OPUS, News, UN
  • What are the best strategies to build an optimal neural MT system?
    • Concatenation, Stacking, Selection, Ensemble

78

79 of 86

Multi-domain Training Scenario for Neural MT

  • Several domains, such as, TED, OPUS, News, UN
  • What are the best strategies to build an optimal neural MT system?
    • Concatenation, Stacking, Selection, Ensemble

79

80 of 86

Multi-domain Training Scenario for Neural MT

  • Summary
    • Arabic-English and German-English language pairs
    • System on the concatenation of data with fine-tuned on the in-domain data performs the best
    • In contrast to phrase-based system, data selection hurts for NMT
    • Ensemble of separately trained model did not perform well
    • Model stacking works well when trained from farthest to closest out-of-domain data from the in-domain data
    • Fine-tuning on a diverse set results in robust model

80

81 of 86

Unsupervised Model for Transliteration Mining

Transliteration mining model is defined as a mixture of a transliteration model and a non-transliteration model

Where is the prior probability of non-transliteration

81

Transliteration Model

Non-transliteration Model

82 of 86

Unsupervised Model for Transliteration Mining

Transliteration mining model is defined as a mixture of a transliteration model and a non-transliteration model

Where is the prior probability of non-transliteration, is character language model probability

82

Transliteration Model

non-transliteration Model

83 of 86

Neural Domain Adaptation Model

Method 1: Give higher weight to word sequences that are liked by the in-domain data

is the probability of training instance according to the in-domain model , and is the probability of the adapted model

83

84 of 86

Neural Domain Adaptation Model

Method 2: Additionally penalizes those sequences that are liked by the out-of-domain data

is the probability of training instance according to the out-of-domain model

84

85 of 86

Neural Domain Adaptation Model

Method 3:

  • Train separate in-domain and out-of-domain models
  • Adjust their parameters in favor of the in-domain data

85

86 of 86

Machine Translation

  • Domain Adaptation for Machine Translation
  • Practical machine translation

A few other notable projects:

  • Machine Translation Through Transliteration (@ ACL 2010)
  • Challenging Language Dependent Segmentation of Arabic (@ ACL 2017)
  • Machine Translation Evaluation using Eye-tracking (@ NAACL 2016)
  • Improving Neural Decoder using Multitask Learning (@ IJCNLP 2017)
  • Translating Dialectal Arabic (@ ACL 2013)

86