1 of 27

Metrics for Reliable AI-based Translation

Marine Carpuat

Sweta Agrawal

Eleftheria Briakou

2 of 27

Introduction(s)

Your name and affiliation
What languages do you speak?
How frequently do you use any commercial MT systems and for what purposes? (Any MT usage experience that you would like to share)
What do you aim to learn from this sprint?

3 of 27

Timeline

Sunday

Final Presentation

Saturday I

Annotation + Analysis

Friday

Literature Review

Saturday II

Quality Estimation Metrics + (Optional) Annotation II

4 of 27

Discussion - Literature Review + Takeaways

5 of 27

Literature Review: How do users gauge MT quality when the source or the target language is unknown?

Xiuming Huang. 1990. A Machine Translation System for the Target Language Inexpert. In COLING 1990 Volume 3: Papers presented to the 13th International Conference on Computational Linguistics. [All, Irene Han ]

Marianna Martindale, Kevin Duh, and Marine Carpuat. 2021. Machine Translation Believability. In Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, pages 88–95, Online. Association for Computational Linguistics. [Section 1 & 2, Arsema Tegegne]

6 of 27

Literature Review: Automatic Metrics for Quality Estimation/Machine Translation Evaluation

Lucia Specia, Kashif Shah, Jose G.C. de Souza, and Trevor Cohn. 2013. QuEst - A translation quality estimation framework. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 79–84, Sofia, Bulgaria. Association for Computational Linguistics. [Section 1 & 2, Kate Olsen ]

Ricardo Rei, Ana C Farinha, Chrysoula Zerva, Daan van Stigt, Craig Stewart, Pedro Ramos, Taisiya Glushkova, André F. T. Martins, and Alon Lavie. 2021. Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task. In Proceedings of the Sixth Conference on Machine Translation, pages 1030–1040, Online. Association for Computational Linguistics. [Section 1 & 2, Elliot Bearden]

7 of 27

Literature Review: How does the application or the domain of the text impact MT quality and its perception on users?

Church, K.W., Hovy, E.H. Good applications for crummy machine translation. Mach Translat 8, 239–258 (1993). https://doi.org/10.1007/BF00981759 [till Section 2, p241, Roshni Kainthan ]

Daniel Liebling, Katherine Heller, Samantha Robertson, and Wesley Deng. 2022. Opportunities for Human-centered Evaluation of Machine Translation Systems. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 229–240, Seattle, United States. Association for Computational Linguistics. [Section 1,2 & 3, Svetlana Semenova ]

8 of 27

Annotation

9 of 27

Annotation ~1hr:

2 groups of 3 participants each with one language-pair assigned to each group

high risk: COVID TICO dataset [Group 1]; low risk: TED talk dataset [Group 2]

Goal: evaluate 100 automatic translations for acceptability judgements using Backtranslation signal

Will you accept this translation?
How confident are you in your judgement?

Bilingual Speakers annotate (source, mt) [Li, Svetlana]

Monolingual Speakers look at (source, BT-source)

Instructions: Use the email that you signed up with Technica :)

10 of 27

Annotation Interface

Bilingual

Monolingual

11 of 27

Annotation Statistics

Language-Pair	Dataset	# of Examples Annotated	# of Annotations
en-fr	TICO-19	47	x2 Bilingual x2 Monolingual
en-fr	TED	48	x2 Bilingual x2 Monolingual
en-ru	TICO-19	45	x1 Bilingual x2 Monolingual
en-ru	TED	46	x1 Bilingual x2 Monolingual

12 of 27

Analysis: How reliable is BT signal in distinguishing good/bad translations?

Language-Pair	Dataset	% times Bilinguals accept	% times Monolinguals accept	% times Monolinguals Agree with Bilinguals
en-fr	TICO-19	97.3%	65.3%	53.2%
en-fr	TED	69.4%	60.7%	64.6%
en-ru	TICO-19	68.9%	53.1%	55.6%
en-ru	TED	52.1%	51.7%	67.4%

13 of 27

How confident are monolingual users in their predictions?

When annotators disagree

When annotators agree

Overall Confidence

14 of 27

Understanding and Evaluating QE metrics [2-3 hrs] + Discussion [30 mins]

Sentence-level Quality Score (Comet-Src)
Word-level feedback (SimAlign)
Additional Back-translation output from a rule-based MT system
Dictionary

Starter Colab: https://colab.research.google.com/drive/19fjiwkHXuSkNhGuIy3GO2XGTvE_-C0wn?authuser=2#scrollTo=fUFYzIEab1jB

Research Question: How to convert the QE sentence level score to a feedback that is actionable to users?

15 of 27

Correlation between Chrf and CometSrc

	Covid		Ted
	En-Ru	En-Fr	En-Ru	En-Fr
CometSrc	0.294692	0.235188	0.292845	0.218786
CometSrc-100	0.229571	0.129757	0.281405	0.145827
WordAlign-100			0.101108	0.114441
Length ratio			0.274960	0.152287

16 of 27

How to convert the QE sentence level score to a feedback that is actionable to users?

TedX: Russian

threshold: 0.25

Covid: Russian

Threshold ~ 0.45

17 of 27

How to convert the QE sentence level score to a feedback that is actionable to users?

Covid: French

Threshold ~ 0.7

TedX Dataset: French

Threshold: 0.2

18 of 27

Results on test set

Language-Pair	Dataset	Accuracy (Comet-Src)	Accuracy (Length)
en-fr	TICO-19	-	0.646
en-fr	TED	-	0.170
en-ru	TICO-19	0.600	-
en-ru	TED	0.586	-

19 of 27

How to convert the QE sentence level score to a feedback that is actionable to users?

TedX: Russian

threshold: 0.25

Covid: Russian

Threshold ~ 1.1

20 of 27

How to convert the QE sentence level score to a feedback that is actionable to users?

Covid: French

Threshold ~ 0.9

TedX: French

Threshold: 0.2

21 of 27

Follow up questions/todos

Post-hoc Analysis of Annotations [lana]
Determine a threshold on COMET score to distinguish good vs. bad translations [almost done for En-Ru]

If COMET > threshold, then predict that translation is good on the test set, otherwise bad [Kate, Irene]

Determine a length-based threshold to distinguish good vs. bad translations: [Li, Roshni]

Plot the distribution of the ratio of (MT length) / (source length), and of the ration of (Reference length) / (source length)
Pick a threshold on ratio of (MT length) / (source length) to make good vs. bad predictions

Evaluate good vs. bad predictors. Compute accuracy of each predictor (COMET, length-based) compared to human decision (accept/reject)

on Fr-En, Ru-En data
on each datasets (COVID, TED)

22 of 27

Adding QE feedback to the User-Study Interface [1-2 hrs]

23 of 27

Annotation II ~1hr:

high risk: COVID TICO dataset [Group 2]; low risk: TED talk dataset [Group 1]

Goal: evaluate 100 translations for acceptability judgements using Backtranslation + QE signal

Will you accept this translation?
How confident are you in your judgement?

Q: How did you use the BT + QE signal?

24 of 27

Analysis: How does access to the QE judgement impact users perception of quality?

25 of 27

Analysis: Does QE scores improve reliability and confidence of users in gauging MT quality?

26 of 27

Any additional analysis

27 of 27

Takeaways & Findings