1 of 39

Machine learning

at Wikipedia

Santhosh Thottingal�Principal Software Engineer, Wikimedia Foundation

state of

2 of 39

Projects using Machine Learning at Wikipedia

Use cases
Guiding principles
Product design
Challenges
Impact

3 of 39

Content Translation

01

Machine Translation

4 of 39

Content Translation

Easier translation of articles between languages.

Reusing work done by another community (notability, verifiability....) lowers the risk of deletion.

It also expands the number of people who can contribute, as it requires a different set of skills compared to writing completely new content.

5 of 39

Language gap

English

6M

German

2M

Indonesian

657K

Telugu

84K

6 of 39

7 of 39

Human curation of Machine translation

8 of 39

Machine Translation misuse prevention

9 of 39

Content Translation Impact

1.6 Million+	Articles published by translating�Combined, this would be a top 10 wikipedia
4%	Low deletion rate Compared with 13% deletion rate of articles created without translation

10 of 39

Apertium

Google

Yandex

Elia

LingoCloud

MinT

Machine translation services

We use a collection of machine translation services.

The first machine translation system that we integrated was Apertium. As you may know, it is a Rule based machine translation system. When we started using this in 2015 this was the best machine translation system for some of the European languages, especially spanish, catalan and similar languages.

Apertium is used for 45 languages currently

Then we made some arrangements with Google to use Google machine translation. Google was providing API credits to us. Having Google MT helped many languages to increase their content. The quality of translation was also reported good enough. 134 languages are covered by Google

Similar to Google, we made arrangements with Yandex to use the Machine translation service. About 100 languages are covered by Yandex.

Note that we provide multiple options per language to users. They select an MT engine as per their preference and quality of translation.

Lingocloud is another provider for English-Chinese translation. It is also an externally hosted service with free API credits for Wikimedia Foundation.

Elia is a MT service for Spanish, Catalan, Basque, Galician, French languages.

And finally we have MinT, a self hosted neural machine translation service, which is recently launched by Wikimedia foundation

11 of 39

12 of 39

MinT

A self hosted Neural Machine Translation service by Wikipedia

Serves multiple MT models and provides a single API interface

NLLB�Generic model by Meta
NLLB-Wikipedia�Wikipedia Optimized models
OpusMT�For low resource languages
SoftCatala�For English-Catalan
IndicTrans2 �for 22 indic languages and english

13 of 39

MinT

A self hosted Neural Machine Translation service by Wikipedia

Serves multiple MT models and provides a single API interface

198�Languages

35924 �Language pairs

14 of 39

Knowledge Integrity

02

AI article & edit quality assessment, vandalism patrol/prevention

15 of 39

16 of 39

17 of 39

Objective Revision Evaluation Service (ORES)

18 of 39

19 of 39

Prediction Threshold preferences

Prediction Threshold

20 of 39

Revert Risk�is now a service hosted in �Lift Wing system

21 of 39

Technology Revert Risk

	Revert Risk Language Agnostic	Revert Risk Multilingual
Characteristics	Can run in all Wikipedia Language Editions Mainly Based on Meta-Data	Can run in the top-47 Wikipedia Language Editions Uses an LLM (mBert)
Training Data	Implicit Annotations (past reverts)
Pros	Fast Light on resources usage Covers all languages	Advanced NLP power Fair on IP Edits
Cons	Lower accuracy on IP Edits. Basic NLP power.	Covers just 47 languages Heavy on computation resources.

22 of 39

Structured Tasks

03

“add a link” and “add an image” to help new editors get started with easy tasks

23 of 39

Add a link

Newcomer task

New editors review machine suggestions for making words in one Wikipedia article link to other Wikipedia articles.

24 of 39

“Add a link” is available via the Suggested edits feed on Homepage

Onboarding 1: Explains value and impact of this small contribution

Onboarding 2: “Human in the loop” reviews machine suggestions

25 of 39

Evaluating machine suggestions of specific text to make into links...

...as an easy and fast way of contributing

Encouragement to do more post-edit

.

26 of 39

Add a link

Algorithm

Algorithm developed by the WMF Research team automatically generates link recommendations for Wikipedia articles. ��The model's performance is evaluated based on precision and recall. Based on manual feedback from editors, hard-coded rules are implemented to avoid unwanted linking (e.g. links to dates).

Machine-learning model �The model predicts the probability of a link in the article (anchor-text + target-page).

Identify unlinked text that could potentially contain a link
Generate candidate links by looking up existing links with this text
Score candidates and pick the most likely as the target-page

Training�The model is trained with existing sentences of millions of positive (what is linked) and negative examples (what is not linked).

27 of 39

Add a link Impact

+17%	Activation�increase in probability that a newcomer makes their first edit
+16%	Retention�increase in probability that a new editor is retained
+18%	Productivity increase in the number of edits newcomers make during their first couple of weeks
-11%	Reverts decrease in revert rates compared to baseline newcomer edits (although this comparison is imperfect)

28 of 39

Add a link Challenges

Moderation burden Burden on patrollers: More edits = more work for patrollers.
Wider language support�Language characteristic and complexity affects parsing the sentences. ML models perform relatively poor on low resource languages
Data scarcity Data scarcity for small wikis cause less performant ML models

29 of 39

Optical Character Recognition

04

Document digitization

30 of 39

Optical Character Recognition

Tesseract	Self hosted Open source OCR engine
Transkribus	Externally hosted OCR system. Used for digitizing historical and handwritten documents relevant for Wikisource.
Google Cloud Vision OCR	External service

31 of 39

Lift Wing

05

Machine learning hosting platform

32 of 39

Lift Wing

Scalability�Microservices can be independently scaled based on demand, allowing for more efficient resource utilization and improved performance.
Flexibility�Microservices architecture enables the use of different languages, and frameworks for each model service, providing greater flexibility in development.
Faster Deployment Smaller codebases and independent deployment of microservices enable faster and more frequent releases, accelerating release to production
Fault Isolation: Failure in one microservice is less likely to impact the entire system, improving overall system resilience and uptime.

33 of 39

Lift Wing

k8s

KServe

API Gateway

k8s

KServe

Lift Wing

Production environment

Community Model Governance

Model Cards

34 of 39

More machine learning use cases

Topic Classification system�Language agnostic link-based article topic classification - Label a given wikipedia article in any language to a topic
Language identification�Given a content snippet, this model can detect the language of the snippet. Supports ~200 languages.
Section alignment Identify missing section between two existing article pairs in any languages. Used in Section translation feature of Content Translation

35 of 39

Third Party Machine Learning Services

Machine Translation	Google, Yandex, Elia machine translation services in Content Translation
Text to Speech	The Phonos extension to read IPA use Google TTS API*
Machine Vision	Machine Vision use Google's Cloud Vision API to identify potential depicts statements for images in Commons.
Image to Text(OCR)	Wikisource use Google's OCR API, and Transkribus
Content moderation	Community Tech's CopyPatrol make use of Turnitin's API for detecting plagiarism between passages added to Wikipedia and external documents
Named Entity Recognition	Architecture team used Rosette to identify Wikidata items from text

36 of 39

Model Cards

on-wiki model cards

for every model hosted

on WMF servers

to make open source, transparent, human-centered machine learning

Use case, users
Training data
Ethical considerations
Owners
License
Model architecture

37 of 39

38 of 39

Machine learning at Wikipedia

Santhosh Thottingal

Thank You

1 of 39

2 of 39

3 of 39

4 of 39

5 of 39

6 of 39

7 of 39

8 of 39

9 of 39

10 of 39

11 of 39

12 of 39

13 of 39

14 of 39

15 of 39

16 of 39

17 of 39

18 of 39

19 of 39

20 of 39

21 of 39

22 of 39

23 of 39

24 of 39

25 of 39

26 of 39

27 of 39

28 of 39

29 of 39

30 of 39

31 of 39

32 of 39

33 of 39

34 of 39

35 of 39

36 of 39

37 of 39

38 of 39

39 of 39