1 of 65

Research at the Service of Free Knowledge

Leila Zia, Head of Research

K-CAP 2021

2021-12-03

2 of 65

2001

3 of 65

An online encyclopedia that anyone can edit and access for free

4 of 65

0.5M volunteer editors

280+ languages

15B monthly

pageviews

10M monthly

edits

The largest encyclopedia

55M

articles

5 of 65

6 of 65

7 of 65

8 of 65

Wikipedia is an evolving radical model for the governance of knowledge.

9 of 65

Who operates Wikipedia?

10 of 65

Wikimedia Foundation

  • It is a non-profit organization of ~500 staff
  • It provides broad support to Wikimedia communities and projects: servers, data centers, legal and communications support, etc.
  • It does not create or modify content.
  • It does not define or enforce policies on the projects

11 of 65

Wikimedia projects

12 of 65

Research

Research by Victoruler (CC BY 3.0, from the Noun Project)

13 of 65

Research priorities

DARIO TARABORELLI /CC0

Addressing knowledge gaps

Improving knowledge integrity

14 of 65

Verifiability

Transparency

Neutrality

Consensus

Privacy

Mission

A connected and open Web and internet

Freedom of speech and thought

Autonomy and ownership

Decentralization

Independence

Open data, science, and code

Multilinguality

Equity

infrastructure and compute resources

15 of 65

Addressing knowledge gaps

16 of 65

The research program:�Addressing Knowledge Gaps

Identify gaps

Bridge gaps

Measure gaps

17 of 65

https://ddll.inf.tu-dresden.de/web/Wikidata/Maps-06-2015/en

English Wikipedia (950,277)

Native Speakers 527M

18 of 65

https://ddll.inf.tu-dresden.de/web/Wikidata/Maps-06-2015/en

Russian Wikipedia (298,215)

Native Speakers 254M

19 of 65

https://ddll.inf.tu-dresden.de/web/Wikidata/Maps-06-2015/en

Spanish Wikipedia (261,495)

Native Speakers 389M

20 of 65

https://ddll.inf.tu-dresden.de/web/Wikidata/Maps-06-2015/en

Portuguese Wikipedia (185,133)

Native Speakers 193M

21 of 65

https://ddll.inf.tu-dresden.de/web/Wikidata/Maps-06-2015/en

Arabic Wikipedia (87,017)

Native Speakers 467M

22 of 65

Content

In order to understand knowledge gaps we must understand not only the content gaps but also the readership and contributorship gaps.

Knowledge

is socially constructed.

Bruno Latour and Steve Woolgar. Laboratory life: The construction of scientific facts.

23 of 65

Johnson, Isaac, Florian Lemmerich, Diego Sáez-Trumper, Robert West, Markus Strohmaier, and Leila Zia. "Global gender differences in Wikipedia readership." AAAI ICWSM 2021

Wikipedia pageviews of readers by language and gender

24 of 65

The Knowledge Gap Index

Aiko Chou, Martin Gerlach, Fabian Kaelin, Isaac Johnson, Marc Miquel, Miriam Redi, Leila Zia

25 of 65

Knowledge Equity

[from Wikimedia 2030 strategy]

_Knowledge equity: As a social movement, we will focus_

_our efforts on the knowledge and communities that have_

_been left out by structures of power and privilege. We will_

_welcome people from every background to build strong and _

_diverse communities. We will break down the social, political,_

_and technical barriers preventing people from accessing and_

_contributing to free knowledge._

26 of 65

How far are we from reaching knowledge equity?

Operationalize knowledge equity

Identify and measure the individual components (knowledge gaps) based on which we can track our progress towards this goal

27 of 65

Goal: The Knowledge Gap Index

  • Identify and quantify all knowledge gaps
  • A research product for WMF, communities, affiliates, and researchers to monitor the progress towards knowledge equity based on quantifiable evidence.

Example: EU’s Gender Equality Index

28 of 65

Image Credits: Marc Miquel Ribe

Example: Monitoring the Gender Gap on Wikipedia

29 of 65

1: Identify

Build a Taxonomy of Knowledge Gaps

2: Quantify

Develop Metrics to Quantify Knowledge Gaps

3: Expose

Surface gaps in the Knowledge Gap Index

0.5

We are HERE

Our Roadmap

30 of 65

Taxonomy of Knowledge gaps: how we built it

Knowledge is not only about content!

Readers

Contributors

Content

Readers

Contributors

Content

31 of 65

Knowledge gaps:

Disparities with respect to coverage of specific groups of readers, contributors or content across Wikimedia projects.

32 of 65

Taxonomy of Knowledge gaps: how we built it

Finding evidence of knowledge gaps from different sources

Academic Literature

Movement Strategy and Initiatives

Community Surveys

33 of 65

34 of 65

2: Quantify

Develop Metrics to Quantify Knowledge Gaps

1: Identify

Build a Taxonomy of Knowledge Gaps

35 of 65

Knowledge Gaps Metrics:

Data Categorization

Data

Metrics Generation

Questions about knowledge gaps

What is the most popular motivation for editing Farsi Wikipedia?

Which gender group has higher quality articles and images in Greek Wikipedia?

What is the geographical distribution of articles in kiswahili?

Stakeholder consultations

36 of 65

Cultural Background Gap:

What is the extent of local content coverage?

The language cultural context is defined as all the places, people, objects ...

that relate to the territories where the language is spoken

Miquel-Ribé, Marc, and David Laniado. "Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions." Proceedings of the International AAAI Conference on Web and Social Media. Vol. 13. 2019.

Content Classifier

Cultural context? y/n

Cultural context label

Mapping between languages and local territories

Articles labeled as local and non-local content

Data Categorization

Aggregation over all articles: proportion of local/nonlocal content by language

Metrics Generation

Data

37 of 65

Readability Gap:

What is the readability of content on Wikipedia?

Readability is the ease with which a reader can understand a written text

There exist automatic readability scores for English … but what about other languages?

38 of 65

Motivation Gap:

Why do people read Wikipedia?

Previous research developed survey questions in different languages to measure the motivation behind readership..

But how to measure this at scale, and for all languages?

39 of 65

Open questions

  • What are the barriers of knowledge gaps?
  • What are the causes of knowledge gaps?
  • What is the relationship between different gaps?
  • How can we more efficiently and at scale measure gaps?

40 of 65

Open questions

Content

  • How to find knowledge that is not already on the projects across languages, content types, and in a scalable way?
  • Build a notability model.
  • Automatically generating tags for images
  • How should we define and measure article importance?

Readership

  • How can we address the imbalances in readership?
  • How do people learn on Wikipedia?
  • What is the role of visual knowledge in (encyclopedic) learning?
  • How are images being used on Wikipedia?

41 of 65

Open questions

Contributorship

  • How can we help diversify Wikipedia’s editor population?

General

  • What is the relationship between content, readership, and contributorship gaps?
  • How do we measure knowledge gaps?
  • ...

42 of 65

Link recommendation

Martin Gerlach (Research Team)

Growth Team (Marshall Miller, Rita Ho, Kosta Harlan, many more)

Djellel Difallah (NYU Abu Dhabi)

43 of 65

Editing is hard

Problem

??

??

Does this article need an update? How do I start editing?

Is this saying a reference is needed? How do I add it?

Is this where the references are? How is it different from citations?

??

Technical

What is an infobox?

Conceptual

What is notability?

Cultural

Why are people

so mean?

44 of 65

Structured task editing

  • Break down editing into simpler tasks
    • Easier to: understand, do on mobile, get positive experience
  • We picked the task of: "adding a link"
    • Well-defined, frequent, and attractive task type
  • Machine-in-the-loop:
    • Generate recommendations using ML approaches.
    • Editors verify the output and validate the insertion.

Solution

45 of 65

Link recommendation

Entity-linking task

  • Mention detection
  • Link generation
  • Link Disambiguation

Hypatia (born c. 350–370; died 415 AD) was a Hellenistic Neoplatonist philosopher, astronomer, and mathematician, who lived in Alexandria, Egypt, then part of the Eastern Roman Empire.

astronomer

Astronomy

Astronomer

--no link--

?

46 of 65

The Add-a-link Task in Wikipedia

  • Language support
    • 280+ language version of Wikipedia (multilinguality); highest impact for smaller communities; bias
  • Other considerations
    • Manual of style constraints (decentralization)
    • Prefer simpler over complex models (scalability, infrastructure resources, transparency, etc)
    • Utility: find a balance between precision over recall

47 of 65

Step 1: Mention detection

  • Build mention dictionary from all existing links
    • e.g. English Wikipedia: 7M anchors, 170M links
  • String-matching sweeping window for all possible n-grams (N=1...10) that match the dictionary
  • Give preference to larger N

48 of 65

Step 2: Link generation

  • From the anchor-dictionary extract all used links
  • Drops links based on constraints (type-based: disambiguation pages, etc.)
  • Drop links based on link probability heuristic:
    • Text to Link ratio < 6.5% (picked empirically, and supported by previous work)

49 of 65

Step 3-a: Link disambiguation- features

  • N-gram size: the number of tokens in the anchor (based on simple tokenization).
  • Frequency: count of the anchor-link pair in the anchor-dictionary.
  • Ambiguity: how many different candidate links exist for an anchor in the anchor-dictionary.
  • Kurtosis: the kurtosis of the shape of the distribution of candidate-links for a given anchor in the anchor-dictionary
  • Levenshtein-distance: a string similarity measure between the anchor and the link, e.g., the Levensthein-distance between “kitten” and “sitting” is 3.
  • Wiki2Vec Distance (entity embedding): similarity between the article (source-page) and the link (target-page) based on the content of the pages.

50 of 65

Step 3-b: Link disambiguation- classifier

  • Extract fully linked sentences from the lead sections
    • Positive example: linked mention with correct link
    • Negative example: linked mention with incorrect link, unlinked mention
  • Train a binary classifier (XGBoost)

51 of 65

Evaluation

Held-out test set + Manual evaluation�(thanks: Bennoit Evellin, Habib Mhenni, Martin Urbanec, Bluetpp, -revi)�

Tested Wikis: Arabic, Bengali, Czech, English, French, Vietnamese�

Precision: 70% - 92%�How many suggestions are correct?

Recall: 30% - 66%�How many of the possible links captured?

52 of 65

Link recommendation model

  • Training pipeline for each language.
  • Models/datasets published publicly
  • Link recommendation API on kubernetes

53 of 65

User interface

Evaluate the suggestion

Feedback on algorithm

Edit summary

Next suggestion

54 of 65

In practice

  • Results from pilot-wikis (Arabic, Bengali, Czech, Vietnamese)
    • Newcomers prefer structured editing
    • Revert rate of structured edits is much lower (7.9% vs 25.5% for unstructured tasks)
    • Careless editing is rare
    • Reactions from community mostly positive
  • Deployment
    • Currently: Arabic, Bengali, Czech, French, Hungarian, Persian, Polish, Romanian, Russian, Vietnamese
    • Planned: Catalan, Hebrew, Hindi, Korean, Norwegian, Portuguese, Simple English, Swedish, Ukrainian

55 of 65

Open questions

Content

  • How to find knowledge that is not already on the projects across languages, content types, and in a scalable way?
  • Build a notability model.
  • Automatically generating tags for images
  • How should we define and measure article importance?

Readership

  • How can we address the imbalances in readership?
  • How do people learn on Wikipedia?
  • What is the role of visual knowledge in (encyclopedic) learning?
  • How are images being used on Wikipedia?

56 of 65

Open questions

Contributorship

  • How can we help diversify Wikipedia’s editor population?

General

  • What is the relationship between content, readership, and contributorship gaps?
  • How do we measure knowledge gaps?
  • ...

57 of 65

Scaling Research on

Free Knowledge

Growth by Fabio Rinaldi (CC BY 3.0, from the Noun Project)

58 of 65

DARIO TARABORELLI /CC0

A sustainable distributed network of Wikimedia projects relies on an empowered global network of Wikimedia researchers.

59 of 65

4:1,000,000,000

The Research Team

Martin Gerlach

Research Scientist

Isaac Johnson

Research Scientist

Emily Lescak

Senior Research Community Officer

Miriam Redi

Research Manager

Diego Sáez-Trumper

Senior Research�Scientist

Pablo Aragón

Research Scientist

Leila Zia

Director, Head of Research

Fabian Kaelin

Senior Research�Engineer

60 of 65

Formal Collaborators

61 of 65

The current initiatives and principles

To further expand and nurture the research community around the Wikimedia projects we:

  • have a commitment to open code, open data, and open science (OA Policy)
  • organize annual research events including Wiki Workshop
  • organize monthly Public Research Showcases and Research Office Hours
  • build Formal Collaborations
  • Offer internship opportunities or mentor Outreachy interns, with a focus on underrepresented communities
  • Offer research funds
  • award the Wikimedia Foundation Research Award of the Year
  • ...

62 of 65

Support the

Wikimedia projects

Help by shashank singh (CC BY 3.0, from the Noun Project)

63 of 65

  • Free your code! All code that is used in production for the Wikimedia projects must be open sourced. We can’t use your learnings if you don’t license it under a free license.

  • Free your data! Did you develop a taxonomy or glossary and didn’t publish it under a CC BY-SA 4.0 or a more permissive license? Wikimedia projects cannot benefit from your work!

  • Free your knowledge! Paywalls create major barriers in equity and knowledge sharing. CC BY-SA 4.0 or more permissive license on papers!

64 of 65

  • Submit to Wiki Workshop 2022!

wikiworkshop.org (expect updated information in a week)

The 9th edition will take place as part of TheWebConf 2022.

  • Apply for a Research Fund (deadline: January 3, 2022)

https://w.wiki/4VR8

USD 2K-50K

or follow us on Twitter: @WikiResearch

  • Join us in our Monthly Office Hours https://w.wiki/uJo or Showcases https://w.wiki/uJn

  • Volunteer your time for research https://bit.ly/3oiOnXu , edit wikipedia or donate!

65 of 65

leila@wikimedia.org

http://research.wikimedia.org