1 of 53

Quality Estimation

for

Machine Translation

Fabio Kepler

Unbabel AI

August 2019

BUILDING UNIVERSAL UNDERSTANDING

2 of 53

Outline

  • Definition
  • Models
  • Examples and demonstration
  • WMT19 QE shared task
  • Future approaches

2

3 of 53

Why Quality Estimation?

4 of 53

Is Machine Translation Solved?

4

“The AI Research team wobbles!”

We still need humans in the loop

5 of 53

MT Quality

What could we do if we knew the quality of a translation?

  • If it is good, we can skip the human (+speed, -$)
  • Otherwise, we can at least highlight the parts that are wrong
  • Helps to ensure final quality in all cases (higher MQM)

5

6 of 53

Example Unbabel’s Pipeline

6

7 of 53

Definition

8 of 53

MT Quality Estimation (QE)

  • Use a separate system to estimate how good a translation is
    • Coming from a black-box MT system
  • With no access to a reference translation
  • With different levels of granularity:
    • Word
    • Sentence
    • Document

8

9 of 53

Datasets

  • QE data requires the triplet:
    • SOURCE: text in source language
    • MT: translation in target language
    • PE: human post-edit MT
  • SOURCE and MT are inputs
  • PE is used to compute gold targets depending on the QE level:
    • Word: OK and BAD tags
    • Sentence: HTER score

9

MT: I really like Machine Translation

PE: I love Machine Translation !

delete

replace

insert

BAD

BAD

BAD

OK

OK

OK

Word-level tags

Sentence-level

HTER =

PE words

edit distance

=

5

3

= 0.6

BAD

10 of 53

WMT QE Shared Task

  • Main venue for QE results (yearly)
  • Provided MT is black-box
    • Can be SMT or NMT
    • Since 2019, only NMT
    • Different language pairs each year, except English-German

10

11 of 53

WMT QE Shared Task

  • Word-level main metric is
    • F1-Mult = F1-OK * F1-BAD
    • Measured on TARGET tags (+ GAP tags) and SOURCE tags
  • Sentence-level main metric is Pearson correlation
    • And Spearman as a secondary one (for ranking)

11

12 of 53

Models

13 of 53

QUETCH QUality Estimation from ScraTCH

  • First neural model for QE
  • Very simple architecture
    • SOURCE embeddings aligned and concatenated to TARGET embeddings

13

Kreutzer, J., Schamoni, S., & Riezler, S. (2015). “QUality Estimation from ScraTCH (QUETCH): Deep Learning for Word-level Translation Quality Estimation.” WMT@EMNLP.

14 of 53

NuQE Neural Quality Estimation

  • Deeper version of QUETCH
    • With recurrent layers
  • Also needs to align SOURCE to TARGET
  • Can additionally use POS tags as input
  • First used in Unbabel’s winning participation in WMT16

14

2 x FF

2 x 400

2 x FF

2 x 200

2 x FF

100 + 50

...

...

BiGRU

100

...

...

BiGRU

200

softmax

OK/BAD

source word

source POS

target word

target POS

embeddings

3 x 64

3 x 50

3 x 64

3 x 50

Martins, A.F., Astudillo, R.F., Hokamp, C., & Kepler, F. (2016). “Unbabel's Participation in the WMT16 Word-Level Translation Quality Estimation Shared Task.” WMT16.

15 of 53

Linear Model

  • First-order linear model incorporating rich features (ngrams, POS tags, dependencies)
    • ngrams, POS tags, syntactic dependencies
  • Can be used for ensembling
    • By additionally stacking predictions from different models as features
    • Winning system in WMT16

15

16 of 53

APE for QE

  • Predictions from an APE system can be used to generate QE word tags
    • Or from a different (better) MT system

16

17 of 53

SOTA 2016-2017

  • Predictions from the winning system of the 2016 APE shared task (Marcin) stacked into the WMT16 linear model
  • State-of-the-art in 2017 Q1
  • Winning system (~2nd) in WMT17

17

18 of 53

Predictor-Estimator

  • Winning system in WMT17 (from POSTECH)
  • Uses a two-stage neural model that allows pre-training with large parallel data
    • In a “deep contextualized language-model” fashion
    • The year before any Muppet model appeared

18

Hyun Kim, Jong-Hyeok Lee, and Seung-Hoon Na. (2017). "Predictor-estimator using multilevel task learning with stack propagation for neural quality estimation." WMT17@ACL.

19 of 53

Predictor-Estimator

  • The predictor module is pretrained on parallel corpora
    • Predicting every token on the TARGET side given its left and right context produced by two uni-directional LSTMs
  • The estimator module is trained in a finetuning step
    • Estimating word- and sentence-level scores from the input embedded by the predictor module

19

20 of 53

Predictor

20

sj

21 of 53

ELMo

21

Source: The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning), Jay Alammar, 2019.

22 of 53

BERT

22

Source: The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning), Jay Alammar, 2019.

23 of 53

Estimator

23

24 of 53

Unfortunately, no reference implementation for any of these models

24

25 of 53

26 of 53

Implemented Models

26

27 of 53

OpenKiwi scoreboard

  • Better at word-level
  • Slightly behind on sentence-level
  • Using order of magnitude less compute
    • Predictor-Estimator is pre-trained on only 3M in-domain data

27

28 of 53

OpenKiwi toolkit

Goals

  • Facilitate the research 🔄 production feedback loop
  • Serve as foundation for future research
  • Ease the entry level to the task

28

29 of 53

In a Nutshell

  • One single framework for several WMT winning systems
  • State-of-the-art results
  • Easy-to-use API for training and inference
  • Extensive documentation
  • Modular design for easy extensibility
    • Bulletproofed in this year's shared task submission

29

30 of 53

OpenKiwi production-easy

30

  • Tested (76% coverage)

  • Documented

31 of 53

OpenKiwi production-easy

31

  • Simple usage as a Python package
  • Easy training of models with any data
    • Like Unbabel’s proprietary client data
  • Easy usage of any pre-trained model

32 of 53

OpenKiwi production-easy

Or train and predict in one go

32

33 of 53

Simple example

Source

This is a simple sentence .

MT

C’ est une phrase simple .

33

34 of 53

BAD Example

Source

This is a simple sentence .

MT

C' est une phrase simple qui ajoute beaucoup de mots inutiles .

34

35 of 53

Demonstration

(not publicly available, yet)

36 of 53

Example: Unbabel’s Pipeline

36

37 of 53

WMT19 QE Shared Task

38 of 53

Surfing the wave

  • Given the great modularity provided in OpenKiwi
  • We exploited the similarity of the current wave of Muppet Models® to the 2-stage Predictor-Estimator approach

38

39 of 53

Predictor Flavors

  • We created several variants replacing the predictor by various pretrained models:
    • PredEst-RNN: the original bi-LSTM Predictor-Estimator as implemented in OpenKiwi
    • PredEst-Trans: a Transformer-based version like implemented by Alibaba (2018)
    • PredEst-BERT: the pretrained multilingual BERT as the Predictor
    • PredEst-XLM: the pretrained XLM as the Predictor

39

40 of 53

Validation Results English-German

  • XLM provided the best single model
  • Not that much improvement over plain pre-trained Predictor with Transformers
    • In-domain parallel data of about 3M
  • NuQE and linear are the only models that use no extra data (despite POS tags)
    • They lag behind the weakest Predictor-Estimator by almost 5 points

40

41 of 53

Validation Results English-Russian

  • BERT and XLM performed considerably better
  • Most probably because English-Russian in-domain parallel corpus for pre-training the Predictor was very noisy
  • QE data is also much more skewed than for English-German
    • Large majority of sentences with HTER 0
    • Then a bunch with HTER 1
    • And very few in between

41

42 of 53

Back to APE for QE

  • APE or MT outputs can be treated as a surrogate PE and used to get word-level tags and sentence-level scores
    • PSEUDO-APE: Off-the-shelf translation system (OpenNMT)
    • APE-BERT: APE system built on BERT (Unbabel’s APE task winning system)

42

43 of 53

Ensemble methods

Word level

  • Learn convex combination of model predictions via Powell's conjugate direction method (a variant of coordinate descent) to optimize F1-Mult score on dev set

Sentence level

  • Perform L2 regularized regression on the dev set with model outputs as features
  • Choose best regularization constant via 20-fold cross validation

43

44 of 53

Dev set results English-{German,Russian}

44

45 of 53

Official Results Word-Level

45

46 of 53

Official Results Sentence-Level

46

47 of 53

47

History

48 of 53

Key Takeaways

  • Diversity of models can be just as important as individual model performance for ensembling
  • A smart ensembling strategy is key to be able to scale to many models that have high variance in their individual performance

48

49 of 53

Key Takeaways

  • Having a modular QE framework to build upon was key to quick experimentation with:
    • Submodels
    • Varying architectures
    • Hyper-parameters searching
  • The Muppet models are yet again successful in a transfer learning task
    • But are very brittle in how they are fine-tuned and used

49

50 of 53

Reference

Kepler, F., Trénous, J., Treviso, M., Vera, M., Góis, A., Farajian, M.A., Lopes, A.V. and Martins, A.F. “Unbabel’s Participation in the WMT19 Translation Quality Estimation Shared Task.” WMT19@ACL.

50

51 of 53

Finally

52 of 53

Future Directions

  • How to make QE and MT have calibrated probabilities, and how they relate
  • How to effectively build QE ensembles for use in production
  • What to do when QE has access to MT model (glass-box)
    • How to avoid having two large models trained on similar amounts of data
    • How to make them learn jointly

52

53 of 53

Thanks!

kepler@unbabel.com