Metrics for Reliable AI-based Translation
Marine Carpuat
Sweta Agrawal
Eleftheria Briakou
Introduction(s)
Timeline
Sunday
Final Presentation
Saturday I
Annotation + Analysis
Friday
Literature Review
Saturday II
Quality Estimation Metrics + (Optional) Annotation II
Discussion - Literature Review + Takeaways
Literature Review: How do users gauge MT quality when the source or the target language is unknown?
Xiuming Huang. 1990. A Machine Translation System for the Target Language Inexpert. In COLING 1990 Volume 3: Papers presented to the 13th International Conference on Computational Linguistics. [All, Irene Han ]
Marianna Martindale, Kevin Duh, and Marine Carpuat. 2021. Machine Translation Believability. In Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, pages 88–95, Online. Association for Computational Linguistics. [Section 1 & 2, Arsema Tegegne]
Literature Review: Automatic Metrics for Quality Estimation/Machine Translation Evaluation
Lucia Specia, Kashif Shah, Jose G.C. de Souza, and Trevor Cohn. 2013. QuEst - A translation quality estimation framework. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 79–84, Sofia, Bulgaria. Association for Computational Linguistics. [Section 1 & 2, Kate Olsen ]
Ricardo Rei, Ana C Farinha, Chrysoula Zerva, Daan van Stigt, Craig Stewart, Pedro Ramos, Taisiya Glushkova, André F. T. Martins, and Alon Lavie. 2021. Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task. In Proceedings of the Sixth Conference on Machine Translation, pages 1030–1040, Online. Association for Computational Linguistics. [Section 1 & 2, Elliot Bearden]
Literature Review: How does the application or the domain of the text impact MT quality and its perception on users?
Church, K.W., Hovy, E.H. Good applications for crummy machine translation. Mach Translat 8, 239–258 (1993). https://doi.org/10.1007/BF00981759 [till Section 2, p241, Roshni Kainthan ]
Daniel Liebling, Katherine Heller, Samantha Robertson, and Wesley Deng. 2022. Opportunities for Human-centered Evaluation of Machine Translation Systems. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 229–240, Seattle, United States. Association for Computational Linguistics. [Section 1,2 & 3, Svetlana Semenova ]
Annotation
Annotation ~1hr:
2 groups of 3 participants each with one language-pair assigned to each group
high risk: COVID TICO dataset [Group 1]; low risk: TED talk dataset [Group 2]
Goal: evaluate 100 automatic translations for acceptability judgements using Backtranslation signal
Bilingual Speakers annotate (source, mt) [Li, Svetlana]
Monolingual Speakers look at (source, BT-source)
Instructions: Use the email that you signed up with Technica :)
Annotation Interface
Bilingual
Monolingual
Annotation Statistics
Language-Pair | Dataset | # of Examples Annotated | # of Annotations |
en-fr | TICO-19 | 47 | x2 Bilingual x2 Monolingual |
TED | 48 | ||
en-ru | TICO-19 | 45 | x1 Bilingual x2 Monolingual |
TED | 46 |
Analysis: How reliable is BT signal in distinguishing good/bad translations?
Language-Pair | Dataset | % times Bilinguals accept | % times Monolinguals accept | % times Monolinguals Agree with Bilinguals |
en-fr | TICO-19 | 97.3% | 65.3% | 53.2% |
TED | 69.4% | 60.7% | 64.6% | |
en-ru | TICO-19 | 68.9% | 53.1% | 55.6% |
TED | 52.1% | 51.7% | 67.4% |
How confident are monolingual users in their predictions?
When annotators disagree
When annotators agree
Overall Confidence
Understanding and Evaluating QE metrics [2-3 hrs] + Discussion [30 mins]
Starter Colab: https://colab.research.google.com/drive/19fjiwkHXuSkNhGuIy3GO2XGTvE_-C0wn?authuser=2#scrollTo=fUFYzIEab1jB
Research Question: How to convert the QE sentence level score to a feedback that is actionable to users?
Correlation between Chrf and CometSrc
| Covid | Ted | ||
| En-Ru | En-Fr | En-Ru | En-Fr |
CometSrc | 0.294692 | 0.235188 | 0.292845 | 0.218786 |
CometSrc-100 | 0.229571 | 0.129757 | 0.281405 | 0.145827 |
WordAlign-100 | | | 0.101108 | 0.114441 |
Length ratio | | | 0.274960 | 0.152287 |
How to convert the QE sentence level score to a feedback that is actionable to users?
TedX: Russian
threshold: 0.25
Covid: Russian
Threshold ~ 0.45
How to convert the QE sentence level score to a feedback that is actionable to users?
Covid: French
Threshold ~ 0.7
TedX Dataset: French
Threshold: 0.2
Results on test set
Language-Pair | Dataset | Accuracy (Comet-Src) | Accuracy (Length) |
en-fr | TICO-19 | - | 0.646 |
TED | - | 0.170 | |
en-ru | TICO-19 | 0.600 | - |
TED | 0.586 | - |
How to convert the QE sentence level score to a feedback that is actionable to users?
TedX: Russian
threshold: 0.25
Covid: Russian
Threshold ~ 1.1
How to convert the QE sentence level score to a feedback that is actionable to users?
Covid: French
Threshold ~ 0.9
TedX: French
Threshold: 0.2
Follow up questions/todos
Adding QE feedback to the User-Study Interface [1-2 hrs]
Annotation II ~1hr:
high risk: COVID TICO dataset [Group 2]; low risk: TED talk dataset [Group 1]
Goal: evaluate 100 translations for acceptability judgements using Backtranslation + QE signal
Q: How did you use the BT + QE signal?
Analysis: How does access to the QE judgement impact users perception of quality?
Analysis: Does QE scores improve reliability and confidence of users in gauging MT quality?
Any additional analysis
Takeaways & Findings