1 of 53

Quality Estimation

for

Machine Translation

Fabio Kepler

Unbabel AI

August 2019

BUILDING UNIVERSAL UNDERSTANDING

2 of 53

Outline

Definition
Models
Examples and demonstration
WMT19 QE shared task
Future approaches

3 of 53

Why Quality Estimation?

4 of 53

Is Machine Translation Solved?

“The AI Research team wobbles!”

We still need humans in the loop

5 of 53

MT Quality

What could we do if we knew the quality of a translation?

If it is good, we can skip the human (+speed, -$)
Otherwise, we can at least highlight the parts that are wrong
Helps to ensure final quality in all cases (higher MQM)

6 of 53

Example Unbabel’s Pipeline

8 of 53

MT Quality Estimation (QE)

Use a separate system to estimate how good a translation is

Coming from a black-box MT system

With no access to a reference translation

With different levels of granularity:

Word
Sentence
Document

9 of 53

Datasets

QE data requires the triplet:

SOURCE: text in source language
MT: translation in target language
PE: human post-edit MT

SOURCE and MT are inputs
PE is used to compute gold targets depending on the QE level:

Word: OK and BAD tags
Sentence: HTER score

MT: I really like Machine Translation

PE: I love Machine Translation !

delete

replace

insert

BAD

Word-level tags

Sentence-level

HTER =

PE words

edit distance

= 0.6

BAD

10 of 53

WMT QE Shared Task

Main venue for QE results (yearly)
Provided MT is black-box

Can be SMT or NMT
Since 2019, only NMT
Different language pairs each year, except English-German

11 of 53

WMT QE Shared Task

Word-level main metric is

F1-Mult = F1-OK * F1-BAD
Measured on TARGET tags (+ GAP tags) and SOURCE tags

Sentence-level main metric is Pearson correlation

And Spearman as a secondary one (for ranking)

13 of 53

QUETCH QUality Estimation from ScraTCH

First neural model for QE
Very simple architecture

SOURCE embeddings aligned and concatenated to TARGET embeddings

Kreutzer, J., Schamoni, S., & Riezler, S. (2015). “QUality Estimation from ScraTCH (QUETCH): Deep Learning for Word-level Translation Quality Estimation.” WMT@EMNLP.

14 of 53

NuQE Neural Quality Estimation

Deeper version of QUETCH

With recurrent layers

Also needs to align SOURCE to TARGET
Can additionally use POS tags as input
First used in Unbabel’s winning participation in WMT16

2 x FF

2 x 400

2 x FF

2 x 200

2 x FF

100 + 50

...

BiGRU

100

...

BiGRU

200

softmax

OK/BAD

source word

source POS

target word

target POS

embeddings

3 x 64

3 x 50

3 x 64

3 x 50

Martins, A.F., Astudillo, R.F., Hokamp, C., & Kepler, F. (2016). “Unbabel's Participation in the WMT16 Word-Level Translation Quality Estimation Shared Task.” WMT16.

15 of 53

Linear Model

First-order linear model incorporating rich features (ngrams, POS tags, dependencies)

ngrams, POS tags, syntactic dependencies

Can be used for ensembling

By additionally stacking predictions from different models as features
Winning system in WMT16

16 of 53

APE for QE

Predictions from an APE system can be used to generate QE word tags

Or from a different (better) MT system

17 of 53

SOTA 2016-2017

Predictions from the winning system of the 2016 APE shared task (Marcin) stacked into the WMT16 linear model
State-of-the-art in 2017 Q1
Winning system (~2nd) in WMT17

18 of 53

Predictor-Estimator

Winning system in WMT17 (from POSTECH)
Uses a two-stage neural model that allows pre-training with large parallel data

In a “deep contextualized language-model” fashion
The year before any Muppet model appeared

Hyun Kim, Jong-Hyeok Lee, and Seung-Hoon Na. (2017). "Predictor-estimator using multilevel task learning with stack propagation for neural quality estimation." WMT17@ACL.

19 of 53

Predictor-Estimator

The predictor module is pretrained on parallel corpora

Predicting every token on the TARGET side given its left and right context produced by two uni-directional LSTMs

The estimator module is trained in a finetuning step

Estimating word- and sentence-level scores from the input embedded by the predictor module

20 of 53

Predictor

s_j

21 of 53

ELMo

Source: The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning), Jay Alammar, 2019.

22 of 53

BERT

Source: The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning), Jay Alammar, 2019.

23 of 53

Estimator

24 of 53

Unfortunately, no reference implementation for any of these models

26 of 53

Implemented Models

27 of 53

OpenKiwi scoreboard

Better at word-level
Slightly behind on sentence-level
Using order of magnitude less compute

Predictor-Estimator is pre-trained on only 3M in-domain data

28 of 53

OpenKiwi toolkit

Goals

Facilitate the research 🔄 production feedback loop
Serve as foundation for future research
Ease the entry level to the task

29 of 53

In a Nutshell

One single framework for several WMT winning systems
State-of-the-art results
Easy-to-use API for training and inference
Extensive documentation
Modular design for easy extensibility

Bulletproofed in this year's shared task submission

30 of 53

OpenKiwi production-easy

Tested (76% coverage)

Documented

31 of 53

OpenKiwi production-easy

Simple usage as a Python package

Easy training of models with any data

Like Unbabel’s proprietary client data

Easy usage of any pre-trained model

32 of 53

OpenKiwi production-easy

Or train and predict in one go

33 of 53

Simple example

Source

This is a simple sentence .

C’ est une phrase simple .

34 of 53

BAD Example

Source

This is a simple sentence .

C' est une phrase simple qui ajoute beaucoup de mots inutiles .

35 of 53

Demonstration

(not publicly available, yet)

36 of 53

Example: Unbabel’s Pipeline

37 of 53

WMT19 QE Shared Task

38 of 53

Surfing the wave

Given the great modularity provided in OpenKiwi
We exploited the similarity of the current wave of Muppet Models^® to the 2-stage Predictor-Estimator approach

39 of 53

Predictor Flavors

We created several variants replacing the predictor by various pretrained models:

PredEst-RNN: the original bi-LSTM Predictor-Estimator as implemented in OpenKiwi
PredEst-Trans: a Transformer-based version like implemented by Alibaba (2018)
PredEst-BERT: the pretrained multilingual BERT as the Predictor
PredEst-XLM: the pretrained XLM as the Predictor

40 of 53

Validation Results English-German

XLM provided the best single model
Not that much improvement over plain pre-trained Predictor with Transformers

In-domain parallel data of about 3M

NuQE and linear are the only models that use no extra data (despite POS tags)

They lag behind the weakest Predictor-Estimator by almost 5 points

41 of 53

Validation Results English-Russian

BERT and XLM performed considerably better
Most probably because English-Russian in-domain parallel corpus for pre-training the Predictor was very noisy
QE data is also much more skewed than for English-German

Large majority of sentences with HTER 0
Then a bunch with HTER 1
And very few in between

42 of 53

Back to APE for QE

APE or MT outputs can be treated as a surrogate PE and used to get word-level tags and sentence-level scores

PSEUDO-APE: Off-the-shelf translation system (OpenNMT)
APE-BERT: APE system built on BERT (Unbabel’s APE task winning system)

43 of 53

Ensemble methods

Word level

Learn convex combination of model predictions via Powell's conjugate direction method (a variant of coordinate descent) to optimize F1-Mult score on dev set

Sentence level

Perform L2 regularized regression on the dev set with model outputs as features
Choose best regularization constant via 20-fold cross validation

44 of 53

Dev set results English-{German,Russian}

45 of 53

Official Results Word-Level

46 of 53

Official Results Sentence-Level

48 of 53

Key Takeaways

Diversity of models can be just as important as individual model performance for ensembling
A smart ensembling strategy is key to be able to scale to many models that have high variance in their individual performance

49 of 53

Key Takeaways

Having a modular QE framework to build upon was key to quick experimentation with:

Submodels
Varying architectures
Hyper-parameters searching

The Muppet models are yet again successful in a transfer learning task

But are very brittle in how they are fine-tuned and used

50 of 53

Reference

Kepler, F., Trénous, J., Treviso, M., Vera, M., Góis, A., Farajian, M.A., Lopes, A.V. and Martins, A.F. “Unbabel’s Participation in the WMT19 Translation Quality Estimation Shared Task.” WMT19@ACL.

52 of 53

Future Directions

How to make QE and MT have calibrated probabilities, and how they relate
How to effectively build QE ensembles for use in production
What to do when QE has access to MT model (glass-box)

How to avoid having two large models trained on similar amounts of data
How to make them learn jointly

1 of 53

2 of 53

3 of 53

4 of 53

5 of 53

6 of 53

7 of 53

8 of 53

9 of 53

10 of 53

11 of 53

12 of 53

13 of 53

14 of 53

15 of 53

16 of 53

17 of 53

18 of 53

19 of 53

20 of 53

21 of 53

22 of 53

23 of 53

24 of 53

25 of 53

26 of 53

27 of 53

28 of 53

29 of 53

30 of 53

31 of 53

32 of 53

33 of 53

34 of 53

35 of 53

36 of 53

37 of 53

38 of 53

39 of 53

40 of 53

41 of 53

42 of 53

43 of 53

44 of 53

45 of 53

46 of 53

47 of 53

48 of 53

49 of 53

50 of 53

51 of 53

52 of 53

53 of 53