Machine Translation and data at Mozilla
A data-intensive use case
Evgeny Pavlov, Machine Learning Engineer
August 2025
1
Agenda
2
Mozilla and MT intro
Bergamot and beyond
01
3
Translate
web pages
in Firefox
4
Quick stats
Local MT models in Firefox
43
Languages in production
1-5%
COMET22 difference from
Google Translate
20-35
Typical model size in MB
MT and data at Mozilla — Intro
MT Marathon 2025
5
5
Training pipeline
Teacher-student knowledge distillation
MT and data at Mozilla — Intro
MT Marathon 2025
6
6
Fully Automated
Training
Starts with an auto-generated .yml config
> task config-generator -- sv en
> task train -- --config run.yml
7
Powered by Taskcluster
Firefox CI task orchestrator
MT and data at Mozilla — Intro
8
Automated training in parallel
A dashboard of models in progress of being trained
MT and data at Mozilla — Intro
9
Super open source
Based on European research projects
MT and data at Mozilla — Intro
mozilla / translations
browsermt / bergamot-translator
marian-nmt / marian-dev
MT Marathon 2025
10
10
Data management
Just look at the data!
02
11
Data sources
Open-source data
MT and data at Mozilla — Data
MT Marathon 2025
12
12
Data cleaning
aka “Garbage in - garbage out”
MT and data at Mozilla — Data
13
13
Robustness
All kinds of texts on the web -> OpusTrainer
MT and data at Mozilla — Data
14
14
That’s a lot…
Can we simplify it?
15
LLMs for data generation
New beautiful world
03
16
How can LLMs help?
Infinite parallel data? Sounds like a silver bullet
No noise
Dataset distribution
Pre-trained
Low resource
Not mined from the internet -> clean!
Full control on domain, formatting, balancing, diversification etc.
Get rid of teacher training, no distillation
Get at least some data
MT and data at Mozilla — LLMs
MT Marathon 2025
17
17
Inspiration
NewsPALM
WMT24++
Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data (Finkelstein et al., 2024, WMT24)
WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects (Deutsch et al., 2025)
MT and data at Mozilla — LLMs
MT Marathon 2025
18
18
Which model to use?
A lot of questions
MT and data at Mozilla — LLMs
MT Marathon 2025
19
19
LLM evals WMT24++ en-ru
MT and data at Mozilla — LLMs
MT Marathon 2025
20
20
LLM evals WMT24++ en-zh
MT and data at Mozilla — LLMs
MT Marathon 2025
21
21
Learnings
Everything matters
MT and data at Mozilla — LLMs
MT Marathon 2025
22
22
LLM inference speed en-ru
MT and data at Mozilla — LLMs
MT Marathon 2025
23
23
Inference optimization
Throughput focused
MT and data at Mozilla — LLMs
MT Marathon 2025
24
24
Looks promising so far
We can
Run large models
Get high throughput
Reach good translation quality
MT and data at Mozilla — LLMs
25
25
Dataset selection
Diversity is good
Cluster and sample uniformly
Input data
Embeddings
Clustering
Output sizes
MT Marathon 2025
MT and data at Mozilla — LLMs
26
26
Let’s run it!
Metaflow + Nvidia Cloud (H100)
MT and data at Mozilla — LLMs
MT Marathon 2025
27
27
Hallucinations
Where’s my perfect data?
System prompt (WMT24++):
“You are a professional translator …”
“Produce only the {to_lang} translation, without any additional explanation or commentary.”
Input:
“1. start of HTML document up to <body>”
Output (Gemma 27b n=32 QE-reranked, en-ru):
“<!DOCTYPE html>...”
Solution
MT and data at Mozilla — LLMs
MetricX24-QE score distribution,
1M generated examples, Qwen 235b, QE reranked with 8 samples, en-zh
MT Marathon 2025
28
28
Generated datasets
H100 = 11$/hour on GCP
|
Size QE filtered |
Model | Decoding | GPU hours (H100) |
en-ru 1M | 970K | Gemma 27b BF16 | Greedy | 4 |
en-ru 10M | 7M | Gemma 27b BF16 | Greedy | 50 |
en-ru 1M | 890K | Gemma 27b FP8 | Sampling, n=32, MetricX24-QE L | 36 + 70 QE = 106 |
en-zh 1M | 890K | Qwen 235b a22 FP8 | Sampling, n=8, MetricX24-QE L | 32 + 4 QE = 36 |
en-zh 10M | 8.7M | Qwen 235b a22 FP8 | Sampling, n=8, MetricX24-QE L | 296 + 37 QE = 333 |
MT and data at Mozilla — LLMs
MT Marathon 2025
29
29
Finetuning
experiments
4 1M vs 10M How does data size affect quality?
5 Curriculum training vs finetuning
6 QE-reranked vs greedy decoded
1 Boost prod model Finetune production student
2 Replace teacher Pre-train student on best 50M NLLB, finetune on LLM generated
3 Try NewsPALM How does our data compare to the beast?
MT Marathon 2025
30
Finetuning results
|
Pre-training, M |
Generated, M | Model |
Decoding | Post-training | Diff, COMET22 |
de-en | NLLB 50 | 1 | PALM 2 340B | MBR 512 | Finetuning | 0.7 - 4 |
en-ru | NLLB 50 | 1 | Gemma 3 27B | Greedy | Finetuning | 0.7 |
en-ru | - | 7 | Gemma 3 27B | Greedy | CurriculumNLLB 50 | 0.5 - 1.7 |
en-ru | Distillation 600 | 1 | Gemma 3 27B | QE 32 | Finetuning | - 0.1 |
en-zh | Distillation 600 | 1 | Qwen 3 235B | QE 8 | Finetuning | 0.2 - 0.9 |
en-zh | Distillation 600 | 8 | Qwen 3 235B | QE 8 | Finetuning | 0.5 - 2 |
MT and data at Mozilla — LLMs
MT Marathon 2025
31
31
Finetuning results
Negative
A mixed bag
Positive
Marginal improvement with smaller LLMs and datasets
Costlier for larger datasets
Curriculum beats finetuning
Finetuning works, including distilled models
Even 1M examples can boost quality
NLLB pre-trained students are already good
MT Marathon 2025
MT and data at Mozilla — LLMs
32
32
Conclusion
Can we get rid of distillation?
Is it expensive to use LLMs?
Useful, but not quite a silver bullet just yet
Are LLMs useful for parallel data generation?
Probably… but requires careful setup
Less than you think with
Definitely!
MT and data at Mozilla — LLMs
MT Marathon 2025
33
33
Links
Contributions are welcome!
Mozilla translations
github.com/mozilla/translations
Student models and evals
github.com/mozilla/firefox-translations-models
LLMs experiment
github.com/mozilla/translations/tree/main/experiments/llmaat
LLMs evals
wandb.ai/moz-translations/llm-evals
Matrix
https://matrix.to/#/#firefoxtranslations:mozilla.org
MT and data at Mozilla — LLMs
MT Marathon 2025
34
Thank you!
35