1 of 35

Machine Translation and data at Mozilla

A data-intensive use case

Evgeny Pavlov, Machine Learning Engineer

August 2025

1

2 of 35

Agenda

Mozilla and MT intro

Data management

LLMs for data generation

2

3 of 35

Mozilla and MT intro

Bergamot and beyond

01

3

4 of 35

Translate

web pages

in Firefox

4

5 of 35

Quick stats

Local MT models in Firefox

43

Languages in production

1-5%

COMET22 difference from

Google Translate

20-35

Typical model size in MB

MT and data at Mozilla — Intro

MT Marathon 2025

5

6 of 35

Training pipeline

Teacher-student knowledge distillation

MT and data at Mozilla — Intro

MT Marathon 2025

6

7 of 35

Fully Automated

Training

Starts with an auto-generated .yml config

> task config-generator -- sv en

> task train -- --config run.yml

7

8 of 35

Powered by Taskcluster

Firefox CI task orchestrator

MT and data at Mozilla — Intro

8

9 of 35

Automated training in parallel

A dashboard of models in progress of being trained

MT and data at Mozilla — Intro

9

10 of 35

Super open source

Based on European research projects

Project Bergamot (2018-2021)
HPLT (2022 - ongoing)
OPUS
Marian
Bicleaner AI
And many others

MT and data at Mozilla — Intro

mozilla / translations

browsermt / bergamot-translator

marian-nmt / marian-dev

MT Marathon 2025

10

11 of 35

Data management

Just look at the data!

02

11

12 of 35

Data sources

Open-source data

OPUS
mtdata
NewsCrawl (monolingual)
HPLT (parallel + mono)

MT and data at Mozilla — Data

MT Marathon 2025

12

13 of 35

Data cleaning

aka “Garbage in - garbage out”

OpusCleaner
BicleanerAI
Monocleaner
Dataset-specific fixes

MT and data at Mozilla — Data

13

14 of 35

Robustness

All kinds of texts on the web -> OpusTrainer

Casing
Typos
Emojis
Punctuation
URLs
Numbers
Units of measurement
Short phrases

MT and data at Mozilla — Data

14

15 of 35

That’s a lot…

Can we simplify it?

15

16 of 35

LLMs for data generation

New beautiful world

03

16

17 of 35

How can LLMs help?

Infinite parallel data? Sounds like a silver bullet

No noise

Dataset distribution

Pre-trained

Low resource

Not mined from the internet -> clean!

Full control on domain, formatting, balancing, diversification etc.

Get rid of teacher training, no distillation

Get at least some data

MT and data at Mozilla — LLMs

MT Marathon 2025

17

18 of 35

Inspiration

NewsPALM

WMT24++

Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data (Finkelstein et al., 2024, WMT24)

WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects (Deutsch et al., 2025)

MT and data at Mozilla — LLMs

MT Marathon 2025

18

19 of 35

Which model to use?

A lot of questions

Model size vs quality
Sampling parameters
Number of samples for QE-reranking
Prompts
Language support
Inference speed
Cost

MT and data at Mozilla — LLMs

MT Marathon 2025

19

20 of 35

LLM evals WMT24++ en-ru

MT and data at Mozilla — LLMs

MT Marathon 2025

20

21 of 35

LLM evals WMT24++ en-zh

MT and data at Mozilla — LLMs

MT Marathon 2025

21

22 of 35

Learnings

Everything matters

Size matters, but… depends on language (Gemma 27b == Qwen 235b for en-ru)
Sampling parameters matter a lot (good defaults for QE min_p=0.02, t=1)
Diminishing returns for scaling QE-reranking candidates from n=8 to 16, 32, 128
Large MoE rocks (better than a smaller model with higher n)

MT and data at Mozilla — LLMs

MT Marathon 2025

22

23 of 35

LLM inference speed en-ru

MT and data at Mozilla — LLMs

MT Marathon 2025

23

24 of 35

Inference optimization

Throughput focused

vLLM: 10x throughput vs HF transformers
Smart batching, max_tokens etc.
Use quantized models, FP8 is lossless
Use MoE models
H100+ modern GPUS
System prompt is cached

MT and data at Mozilla — LLMs

MT Marathon 2025

24

25 of 35

Looks promising so far

We can

Run large models

Get high throughput

Reach good translation quality

MT and data at Mozilla — LLMs

25

26 of 35

Dataset selection

Diversity is good

Cluster and sample uniformly

Input data

600M distillation mix in English

Embeddings

384-dim multilingual-e5-small

Clustering

K-means on 10M sample
5000 clusters
NN search on centroids

Output sizes

1, 10, 50 million

MT Marathon 2025

MT and data at Mozilla — LLMs

26

27 of 35

Let’s run it!

Metaflow + Nvidia Cloud (H100)

MT and data at Mozilla — LLMs

MT Marathon 2025

27

28 of 35

Hallucinations

Where’s my perfect data?

System prompt (WMT24++):

“You are a professional translator …”

“Produce only the {to_lang} translation, without any additional explanation or commentary.”

Input:

“1. start of HTML document up to <body>”

Output (Gemma 27b n=32 QE-reranked, en-ru):

“<!DOCTYPE html>...”

Solution

More prompt engineering (no guarantees)
QE post-filtering (e.g. MetricX24-QE < 2)

MT and data at Mozilla — LLMs

MetricX24-QE score distribution,

1M generated examples, Qwen 235b, QE reranked with 8 samples, en-zh

MT Marathon 2025

28

29 of 35

Generated datasets

H100 = 11$/hour on GCP

	Size QE filtered	Model	Decoding	GPU hours (H100)
en-ru 1M	970K	Gemma 27b BF16	Greedy	4
en-ru 10M	7M	Gemma 27b BF16	Greedy	50
en-ru 1M	890K	Gemma 27b FP8	Sampling, n=32, MetricX24-QE L	36 + 70 QE = 106
en-zh 1M	890K	Qwen 235b a22 FP8	Sampling, n=8, MetricX24-QE L	32 + 4 QE = 36
en-zh 10M	8.7M	Qwen 235b a22 FP8	Sampling, n=8, MetricX24-QE L	296 + 37 QE = 333

MT and data at Mozilla — LLMs

MT Marathon 2025

29

30 of 35

Finetuning

experiments

4 1M vs 10M How does data size affect quality?

5 Curriculum training vs finetuning

6 QE-reranked vs greedy decoded

1 Boost prod model Finetune production student

2 Replace teacher Pre-train student on best 50M NLLB, finetune on LLM generated

3 Try NewsPALM How does our data compare to the beast?

MT Marathon 2025

30

31 of 35

Finetuning results

	Pre-training, M	Generated, M	Model	Decoding	Post-training	Diff, COMET22
de-en	NLLB 50	1	PALM 2 340B	MBR 512	Finetuning	0.7 - 4
en-ru	NLLB 50	1	Gemma 3 27B	Greedy	Finetuning	0.7
en-ru	-	7	Gemma 3 27B	Greedy	CurriculumNLLB 50	0.5 - 1.7
en-ru	Distillation 600	1	Gemma 3 27B	QE 32	Finetuning	- 0.1
en-zh	Distillation 600	1	Qwen 3 235B	QE 8	Finetuning	0.2 - 0.9
en-zh	Distillation 600	8	Qwen 3 235B	QE 8	Finetuning	0.5 - 2

MT and data at Mozilla — LLMs

MT Marathon 2025

31

32 of 35

Finetuning results

Negative

A mixed bag

Positive

Marginal improvement with smaller LLMs and datasets

Costlier for larger datasets

Curriculum beats finetuning

Finetuning works, including distilled models

Even 1M examples can boost quality

NLLB pre-trained students are already good

MT Marathon 2025

MT and data at Mozilla — LLMs

32

33 of 35

Conclusion

Can we get rid of distillation?

Is it expensive to use LLMs?

Useful, but not quite a silver bullet just yet

Are LLMs useful for parallel data generation?

Probably… but requires careful setup

LLM choice
QE-reranking
QE-filtering
Curriculum training

Less than you think with

vLLM
Modern GPUs
Inference optimizations

Definitely!

Last mile boost
Noisy original corpus
Low resource

MT and data at Mozilla — LLMs

MT Marathon 2025

33

34 of 35

Links

Contributions are welcome!

Mozilla translations

github.com/mozilla/translations

Student models and evals

github.com/mozilla/firefox-translations-models

LLMs experiment

github.com/mozilla/translations /tree/main/experiments/llmaat

LLMs evals

wandb.ai/moz-translations/llm-evals

Matrix

https://matrix.to/#/#firefoxtranslations:mozilla.org

MT and data at Mozilla — LLMs

MT Marathon 2025

34

35 of 35

Thank you!

35