Machine learning
at Wikipedia
Santhosh Thottingal�Principal Software Engineer, Wikimedia Foundation
state of
Projects using Machine Learning at Wikipedia
Content Translation
01
Machine Translation
Content Translation
Easier translation of articles between languages.
Reusing work done by another community (notability, verifiability....) lowers the risk of deletion.
It also expands the number of people who can contribute, as it requires a different set of skills compared to writing completely new content.
Language gap
English
6M
German
2M
Indonesian
657K
Telugu
84K
Human curation of Machine translation
Machine Translation misuse prevention
Content Translation Impact
1.6 Million+ | Articles published by translating�Combined, this would be a top 10 wikipedia |
4% | Low deletion rate Compared with 13% deletion rate of articles created without translation |
Apertium
Yandex
Elia
LingoCloud
MinT
Machine translation services
MinT
A self hosted Neural Machine Translation service by Wikipedia
Serves multiple MT models and provides a single API interface
MinT
A self hosted Neural Machine Translation service by Wikipedia
Serves multiple MT models and provides a single API interface
198�Languages
35924 �Language pairs
Knowledge Integrity
02
AI article & edit quality assessment, vandalism patrol/prevention
Objective Revision Evaluation Service (ORES)
Prediction Threshold preferences
Prediction Threshold
Revert Risk�is now a service hosted in �Lift Wing system
Technology Revert Risk
| Revert Risk Language Agnostic | Revert Risk Multilingual |
Characteristics |
|
|
Training Data | Implicit Annotations (past reverts) | |
Pros |
|
|
Cons |
|
|
Structured Tasks
03
“add a link” and “add an image” to help new editors get started with easy tasks
Add a link
Newcomer task
New editors review machine suggestions for making words in one Wikipedia article link to other Wikipedia articles.
“Add a link” is available via the Suggested edits feed on Homepage
Onboarding 1: Explains value and impact of this small contribution
Onboarding 2: “Human in the loop” reviews machine suggestions
Evaluating machine suggestions of specific text to make into links...
...as an easy and fast way of contributing
Encouragement to do more post-edit
.
Add a link
Algorithm
Algorithm developed by the WMF Research team automatically generates link recommendations for Wikipedia articles. ��The model's performance is evaluated based on precision and recall. Based on manual feedback from editors, hard-coded rules are implemented to avoid unwanted linking (e.g. links to dates).
Machine-learning model �The model predicts the probability of a link in the article (anchor-text + target-page).
Training�The model is trained with existing sentences of millions of positive (what is linked) and negative examples (what is not linked).
Add a link Impact
+17% | Activation�increase in probability that a newcomer makes their first edit |
+16% | Retention�increase in probability that a new editor is retained |
+18% | Productivity increase in the number of edits newcomers make during their first couple of weeks |
-11% | Reverts decrease in revert rates compared to baseline newcomer edits (although this comparison is imperfect) |
Add a link Challenges
Moderation burden Burden on patrollers: More edits = more work for patrollers. |
Wider language support�Language characteristic and complexity affects parsing the sentences. ML models perform relatively poor on low resource languages |
Data scarcity Data scarcity for small wikis cause less performant ML models |
Optical Character Recognition
04
Document digitization
Optical Character Recognition
Tesseract | Self hosted Open source OCR engine |
Transkribus | Externally hosted OCR system. Used for digitizing historical and handwritten documents relevant for Wikisource. |
Google Cloud Vision OCR | External service |
Lift Wing
05
Machine learning hosting platform
Lift Wing
Scalability�Microservices can be independently scaled based on demand, allowing for more efficient resource utilization and improved performance. |
Flexibility�Microservices architecture enables the use of different languages, and frameworks for each model service, providing greater flexibility in development. |
Faster Deployment Smaller codebases and independent deployment of microservices enable faster and more frequent releases, accelerating release to production |
Fault Isolation: Failure in one microservice is less likely to impact the entire system, improving overall system resilience and uptime. |
Lift Wing
k8s
KServe
API Gateway
k8s
KServe
Lift Wing
Production environment
Community Model Governance
Model Cards
More machine learning use cases
Topic Classification system�Language agnostic link-based article topic classification - Label a given wikipedia article in any language to a topic |
Language identification�Given a content snippet, this model can detect the language of the snippet. Supports ~200 languages. |
Section alignment Identify missing section between two existing article pairs in any languages. Used in Section translation feature of Content Translation |
Third Party Machine Learning Services
Machine Translation | Google, Yandex, Elia machine translation services in Content Translation |
Text to Speech | The Phonos extension to read IPA use Google TTS API* |
Machine Vision | Machine Vision use Google's Cloud Vision API to identify potential depicts statements for images in Commons. |
Image to Text(OCR) | Wikisource use Google's OCR API, and Transkribus |
Content moderation | Community Tech's CopyPatrol make use of Turnitin's API for detecting plagiarism between passages added to Wikipedia and external documents |
Named Entity Recognition | Architecture team used Rosette to identify Wikidata items from text |
Model Cards
on-wiki model cards
for every model hosted
on WMF servers
to make open source, transparent, human-centered machine learning
Machine learning at Wikipedia
Santhosh Thottingal
Thank You