1 of 35

Machine Translation and data at Mozilla

A data-intensive use case

Evgeny Pavlov, Machine Learning Engineer

August 2025

1

2 of 35

Agenda

  1. Mozilla and MT intro

  • Data management

  • LLMs for data generation

2

3 of 35

Mozilla and MT intro

Bergamot and beyond

01

3

4 of 35

Translate

web pages

in Firefox

4

5 of 35

Quick stats

Local MT models in Firefox

43

Languages in production

1-5%

COMET22 difference from

Google Translate

20-35

Typical model size in MB

MT and data at Mozilla Intro

MT Marathon 2025

5

5

6 of 35

Training pipeline

Teacher-student knowledge distillation

MT and data at Mozilla Intro

MT Marathon 2025

6

6

7 of 35

Fully Automated

Training

Starts with an auto-generated .yml config

> task config-generator -- sv en

> task train -- --config run.yml

7

8 of 35

Powered by Taskcluster

Firefox CI task orchestrator

MT and data at Mozilla Intro

8

9 of 35

Automated training in parallel

A dashboard of models in progress of being trained

MT and data at Mozilla Intro

9

10 of 35

Super open source

Based on European research projects

  • Project Bergamot (2018-2021)
  • HPLT (2022 - ongoing)
  • OPUS
  • Marian
  • Bicleaner AI
  • And many others

MT and data at Mozilla Intro

mozilla / translations

browsermt / bergamot-translator

marian-nmt / marian-dev

MT Marathon 2025

10

10

11 of 35

Data management

Just look at the data!

02

11

12 of 35

Data sources

Open-source data

  • OPUS
  • mtdata
  • NewsCrawl (monolingual)
  • HPLT (parallel + mono)

MT and data at Mozilla Data

MT Marathon 2025

12

12

13 of 35

Data cleaning

aka “Garbage in - garbage out”

  • OpusCleaner
  • BicleanerAI
  • Monocleaner
  • Dataset-specific fixes

MT and data at Mozilla Data

13

13

14 of 35

Robustness

All kinds of texts on the web -> OpusTrainer

  • Casing
  • Typos
  • Emojis
  • Punctuation
  • URLs
  • Numbers
  • Units of measurement
  • Short phrases

MT and data at Mozilla Data

14

14

15 of 35

That’s a lot…

Can we simplify it?

15

16 of 35

LLMs for data generation

New beautiful world

03

16

17 of 35

How can LLMs help?

Infinite parallel data? Sounds like a silver bullet

No noise

Dataset distribution

Pre-trained

Low resource

Not mined from the internet -> clean!

Full control on domain, formatting, balancing, diversification etc.

Get rid of teacher training, no distillation

Get at least some data

MT and data at Mozilla LLMs

MT Marathon 2025

17

17

18 of 35

Inspiration

NewsPALM

WMT24++

Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data (Finkelstein et al., 2024, WMT24)

WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects (Deutsch et al., 2025)

MT and data at Mozilla LLMs

MT Marathon 2025

18

18

19 of 35

Which model to use?

A lot of questions

  • Model size vs quality
  • Sampling parameters
  • Number of samples for QE-reranking
  • Prompts
  • Language support
  • Inference speed
  • Cost

MT and data at Mozilla LLMs

MT Marathon 2025

19

19

20 of 35

LLM evals WMT24++ en-ru

MT and data at Mozilla LLMs

MT Marathon 2025

20

20

21 of 35

LLM evals WMT24++ en-zh

MT and data at Mozilla LLMs

MT Marathon 2025

21

21

22 of 35

Learnings

Everything matters

  • Size matters, but… depends on language (Gemma 27b == Qwen 235b for en-ru)
  • Sampling parameters matter a lot (good defaults for QE min_p=0.02, t=1)
  • Diminishing returns for scaling QE-reranking candidates from n=8 to 16, 32, 128
  • Large MoE rocks (better than a smaller model with higher n)

MT and data at Mozilla LLMs

MT Marathon 2025

22

22

23 of 35

LLM inference speed en-ru

MT and data at Mozilla LLMs

MT Marathon 2025

23

23

24 of 35

Inference optimization

Throughput focused

  • vLLM: 10x throughput vs HF transformers
  • Smart batching, max_tokens etc.
  • Use quantized models, FP8 is lossless
  • Use MoE models
  • H100+ modern GPUS
  • System prompt is cached

MT and data at Mozilla LLMs

MT Marathon 2025

24

24

25 of 35

Looks promising so far

We can

Run large models

Get high throughput

Reach good translation quality

MT and data at Mozilla LLMs

25

25

26 of 35

Dataset selection

Diversity is good

Cluster and sample uniformly

Input data

  • 600M distillation mix in English

Embeddings

  • 384-dim multilingual-e5-small

Clustering

  • K-means on 10M sample
  • 5000 clusters
  • NN search on centroids

Output sizes

  • 1, 10, 50 million

MT Marathon 2025

MT and data at Mozilla LLMs

26

26

27 of 35

Let’s run it!

Metaflow + Nvidia Cloud (H100)

MT and data at Mozilla LLMs

MT Marathon 2025

27

27

28 of 35

Hallucinations

Where’s my perfect data?

System prompt (WMT24++):

“You are a professional translator …”

“Produce only the {to_lang} translation, without any additional explanation or commentary.”

Input:

“1. start of HTML document up to <body>”

Output (Gemma 27b n=32 QE-reranked, en-ru):

“<!DOCTYPE html>...”

Solution

  1. More prompt engineering (no guarantees)
  2. QE post-filtering (e.g. MetricX24-QE < 2)

MT and data at Mozilla LLMs

MetricX24-QE score distribution,

1M generated examples, Qwen 235b, QE reranked with 8 samples, en-zh

MT Marathon 2025

28

28

29 of 35

Generated datasets

H100 = 11$/hour on GCP

Size QE filtered

Model

Decoding

GPU hours (H100)

en-ru 1M

970K

Gemma 27b BF16

Greedy

4

en-ru 10M

7M

Gemma 27b BF16

Greedy

50

en-ru 1M

890K

Gemma 27b FP8

Sampling, n=32, MetricX24-QE L

36 + 70 QE = 106

en-zh 1M

890K

Qwen 235b a22 FP8

Sampling, n=8,

MetricX24-QE L

32 + 4 QE = 36

en-zh 10M

8.7M

Qwen 235b a22 FP8

Sampling, n=8,

MetricX24-QE L

296 + 37 QE = 333

MT and data at Mozilla LLMs

MT Marathon 2025

29

29

30 of 35

Finetuning

experiments

4 1M vs 10M How does data size affect quality?

5 Curriculum training vs finetuning

6 QE-reranked vs greedy decoded

1 Boost prod model Finetune production student

2 Replace teacher Pre-train student on best 50M NLLB, finetune on LLM generated

3 Try NewsPALM How does our data compare to the beast?

MT Marathon 2025

30

31 of 35

Finetuning results

Pre-training, M

Generated, M

Model

Decoding

Post-training

Diff, COMET22

de-en

NLLB 50

1

PALM 2 340B

MBR 512

Finetuning

0.7 - 4

en-ru

NLLB 50

1

Gemma 3 27B

Greedy

Finetuning

0.7

en-ru

-

7

Gemma 3 27B

Greedy

CurriculumNLLB 50

0.5 - 1.7

en-ru

Distillation 600

1

Gemma 3 27B

QE 32

Finetuning

- 0.1

en-zh

Distillation 600

1

Qwen 3 235B

QE 8

Finetuning

0.2 - 0.9

en-zh

Distillation 600

8

Qwen 3 235B

QE 8

Finetuning

0.5 - 2

MT and data at Mozilla LLMs

MT Marathon 2025

31

31

32 of 35

Finetuning results

Negative

A mixed bag

Positive

Marginal improvement with smaller LLMs and datasets

Costlier for larger datasets

Curriculum beats finetuning

Finetuning works, including distilled models

Even 1M examples can boost quality

NLLB pre-trained students are already good

MT Marathon 2025

MT and data at Mozilla LLMs

32

32

33 of 35

Conclusion

Can we get rid of distillation?

Is it expensive to use LLMs?

Useful, but not quite a silver bullet just yet

Are LLMs useful for parallel data generation?

Probably… but requires careful setup

  • LLM choice
  • QE-reranking
  • QE-filtering
  • Curriculum training

Less than you think with

  • vLLM
  • Modern GPUs
  • Inference optimizations

Definitely!

  • Last mile boost
  • Noisy original corpus
  • Low resource

MT and data at Mozilla LLMs

MT Marathon 2025

33

33

34 of 35

Links

Contributions are welcome!

Mozilla translations

github.com/mozilla/translations

Student models and evals

github.com/mozilla/firefox-translations-models

LLMs experiment

github.com/mozilla/translations/tree/main/experiments/llmaat

LLMs evals

wandb.ai/moz-translations/llm-evals

Matrix

https://matrix.to/#/#firefoxtranslations:mozilla.org

MT and data at Mozilla LLMs

MT Marathon 2025

34

35 of 35

Thank you!

35