1 of 39

Machine learning

at Wikipedia

Santhosh Thottingal�Principal Software Engineer, Wikimedia Foundation

state of

2 of 39

Projects using Machine Learning at Wikipedia

  • Use cases
  • Guiding principles
  • Product design
  • Challenges
  • Impact

3 of 39

Content Translation

01

Machine Translation

4 of 39

Content Translation

Easier translation of articles between languages.

Reusing work done by another community (notability, verifiability....) lowers the risk of deletion.

It also expands the number of people who can contribute, as it requires a different set of skills compared to writing completely new content.

5 of 39

Language gap

English

6M

German

2M

Indonesian

657K

Telugu

84K

6 of 39

7 of 39

Human curation of Machine translation

8 of 39

Machine Translation misuse prevention

9 of 39

Content Translation Impact

1.6

Million+

Articles published by translatingCombined, this would be a top 10 wikipedia

4%

Low deletion rate

Compared with 13% deletion rate of articles created without translation

10 of 39

Apertium

Google

Yandex

Elia

LingoCloud

MinT

Machine translation services

11 of 39

12 of 39

MinT

A self hosted Neural Machine Translation service by Wikipedia

Serves multiple MT models and provides a single API interface

  • NLLB�Generic model by Meta
  • NLLB-Wikipedia�Wikipedia Optimized models
  • OpusMT�For low resource languages
  • SoftCatala�For English-Catalan
  • IndicTrans2 �for 22 indic languages and english

13 of 39

MinT

A self hosted Neural Machine Translation service by Wikipedia

Serves multiple MT models and provides a single API interface

198�Languages

35924 �Language pairs

14 of 39

Knowledge Integrity

02

AI article & edit quality assessment, vandalism patrol/prevention

15 of 39

16 of 39

17 of 39

Objective Revision Evaluation Service (ORES)

18 of 39

19 of 39

Prediction Threshold preferences

Prediction Threshold

20 of 39

Revert Risk�is now a service hosted in �Lift Wing system

21 of 39

Technology Revert Risk

Revert Risk

Language Agnostic

Revert Risk

Multilingual

Characteristics

  • Can run in all Wikipedia Language Editions
  • Mainly Based on Meta-Data

  • Can run in the top-47 Wikipedia Language Editions
  • Uses an LLM (mBert)

Training Data

Implicit Annotations (past reverts)

Pros

  • Fast
  • Light on resources usage
  • Covers all languages
  • Advanced NLP power
  • Fair on IP Edits

Cons

  • Lower accuracy on IP Edits.
  • Basic NLP power.
  • Covers just 47 languages
  • Heavy on computation resources.

22 of 39

Structured Tasks

03

“add a link” and “add an image” to help new editors get started with easy tasks

23 of 39

Add a link

Newcomer task

New editors review machine suggestions for making words in one Wikipedia article link to other Wikipedia articles.

24 of 39

“Add a link” is available via the Suggested edits feed on Homepage

Onboarding 1: Explains value and impact of this small contribution

Onboarding 2: “Human in the loop” reviews machine suggestions

25 of 39

Evaluating machine suggestions of specific text to make into links...

...as an easy and fast way of contributing

Encouragement to do more post-edit

.

26 of 39

Add a link

Algorithm

Algorithm developed by the WMF Research team automatically generates link recommendations for Wikipedia articles. ��The model's performance is evaluated based on precision and recall. Based on manual feedback from editors, hard-coded rules are implemented to avoid unwanted linking (e.g. links to dates).

Machine-learning model �The model predicts the probability of a link in the article (anchor-text + target-page).

  • Identify unlinked text that could potentially contain a link
  • Generate candidate links by looking up existing links with this text
  • Score candidates and pick the most likely as the target-page

Training�The model is trained with existing sentences of millions of positive (what is linked) and negative examples (what is not linked).

27 of 39

Add a link Impact

+17%

Activation�increase in probability that a newcomer makes their first edit

+16%

Retention�increase in probability that a new editor is retained

+18%

Productivity

increase in the number of edits newcomers make during their first couple of weeks

-11%

Reverts

decrease in revert rates compared to baseline newcomer edits (although this comparison is imperfect)

28 of 39

Add a link Challenges

Moderation burden

Burden on patrollers: More edits = more work for patrollers.

Wider language support�Language characteristic and complexity affects parsing the sentences. ML models perform relatively poor on low resource languages

Data scarcity

Data scarcity for small wikis cause less performant ML models

29 of 39

Optical Character Recognition

04

Document digitization

30 of 39

Optical Character Recognition

Tesseract

Self hosted Open source OCR engine

Transkribus

Externally hosted OCR system. Used for digitizing historical and handwritten documents relevant for Wikisource.

Google Cloud Vision OCR

External service

31 of 39

Lift Wing

05

Machine learning hosting platform

32 of 39

Lift Wing

Scalability�Microservices can be independently scaled based on demand, allowing for more efficient resource utilization and improved performance.

Flexibility�Microservices architecture enables the use of different languages, and frameworks for each model service, providing greater flexibility in development.

Faster Deployment

Smaller codebases and independent deployment of microservices enable faster and more frequent releases, accelerating release to production

Fault Isolation:

Failure in one microservice is less likely to impact the entire system, improving overall system resilience and uptime.

33 of 39

Lift Wing

k8s

KServe

API Gateway

k8s

KServe

Lift Wing

Production environment

Community Model Governance

Model Cards

34 of 39

More machine learning use cases

Topic Classification system�Language agnostic link-based article topic classification - Label a given wikipedia article in any language to a topic

Language identification�Given a content snippet, this model can detect the language of the snippet. Supports ~200 languages.

Section alignment

Identify missing section between two existing article pairs in any languages. Used in Section translation feature of Content Translation

35 of 39

Third Party Machine Learning Services

Machine Translation

Google, Yandex, Elia machine translation services in Content Translation

Text to Speech

The Phonos extension to read IPA use Google TTS API*

Machine Vision

Machine Vision use Google's Cloud Vision API to identify potential depicts statements for images in Commons.

Image to Text(OCR)

Wikisource use Google's OCR API, and Transkribus

Content moderation

Community Tech's CopyPatrol make use of Turnitin's API for detecting plagiarism between passages added to Wikipedia and external documents

Named Entity Recognition

Architecture team used Rosette to identify Wikidata items from text

36 of 39

Model Cards

on-wiki model cards

for every model hosted

on WMF servers

to make open source, transparent, human-centered machine learning

  • Use case, users
  • Training data
  • Ethical considerations
  • Owners
  • License
  • Model architecture

37 of 39

38 of 39

Machine learning at Wikipedia

Santhosh Thottingal

Thank You

39 of 39