1 of 80

Wikipedia beyond the encyclopedic value

Diego Sáez Trumper - May. 2022

Sapienza University of Rome

diego@wikimedia.org

2 of 80

I am Diego Sáez Trumper

  • Senior Research Scientist
  • Studies
    • PhD in Information Technology
    • Bachelor in Acoustic Engineering
  • Academic Research
    • UPF, Cambridge, UMFG, UACH
  • Industry Experience
    • Yahoo Labs!
    • Start-ups
    • Qatar Computing Research Institute
  • Activism
    • Free Education
    • Free Software

3 of 80

Agenda

  • Introduction to Wikimedia Ecosystem
  • Basic concepts
  • Data and tools
  • Wikidata
  • Open Questions and Challengues

4 of 80

5 of 80

200K

Registered volunteer editors

280 languages

15B monthly

pageviews

10M monthly

edits

48M

articles

6 of 80

Wikimedia Foundation

  • Non-profit organization
  • Operates Wikipedia and sister projects
  • ~380 employees
    • Research team have only 8 people!

7 of 80

Martin Gerlach

Research Scientist

Isaac Johnson

Research Scientist

Emily Lescak

Senior Research Community Officer

Miriam Redi

Research Manager

Diego Sáez-Trumper

Senior Research�Scientist

Leila Zia

Director, Head of Research

Fabian Kaelin

Senior Research�Engineer

Pablo Aragón

Research Scientist

Wikimedia Research Team

8 of 80

Wikimedia Research

  • Fundamental Research
    • Make our data easier to understand and consume
    • Grow the researchers community around (local) Wikipedia.
  • Knowledge Gaps
    • Measure and address gaps in readers, editors and content.

  • Knowledge Integrity:
    • Measure and improve the accountability of the content.
    • Fighting disinformation

https://research.wikimedia.org

9 of 80

DARIO TARABORELLI /CC0

10 of 80

Minimalist user data collection

DARIO TARABORELLI /CC0

Revision histories are public data

No 3rd-party collection or sharing of user data

Private user data removed after 90 days

11 of 80

The world’s most visited online medical resource

[Heilman and West (2015) doi.org/10.2196/jmir.4069]

12 of 80

Correlation between HDI and Wikipedia usage

Lemmerich, F., Sáez-Trumper, D., West, R., & Zia, L. (2019, January). Why the world reads Wikipedia: Beyond English speakers. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (pp. 618-626).

13 of 80

All languages

�Geotagged articles across all Wikipedia languages (2 million)

iccl.inf.tu-dresden.de/web/Wikidata/Maps-06-2015/en

14 of 80

Just the 20% of sessions includes more than one language.

15 of 80

English Wikipedia

�Geotagged articles in English Wikipedia (950,000)

16 of 80

Spanish Wikipedia

�Geotagged articles in Spanish Wikipedia (261,000)

17 of 80

Portuguese Wikipedia

�Geotagged articles in Portuguese Wikipedia (185,000)

18 of 80

Arabic Wikipedia

�Geotagged articles in Arabic Wikipedia (87,000)

19 of 80

Resources

  • Statistics
  • Dumps
  • MediaWiki Utilities
  • MediaWiki API
  • Page Views
  • SQL Replicas / Quarry
  • Clicks
  • Event Stream
  • Wikidata
  • Commons
  • ORES
  • PAWS

20 of 80

21 of 80

22 of 80

23 of 80

24 of 80

Article quality

Manually done quality assessments for Articles in the English Wikipedia

https://en.wikipedia.org/wiki/Wikipedia:Content_assessment

25 of 80

26 of 80

27 of 80

Reminder

  • Article: An encyclopedia entry.
  • Revision: Specific version of an article.
  • Wikitext: Markup language used to write Wikipedia.
  • Templates: Pages that are embedded (transcluded) into other pages .
  • MediaWiki: The software behind Wikipedia.
  • Namespace: Used for the separation of content.
  • Edition: Usually used to refer to specific Wikipedia Language edition.

28 of 80

(try to) Don’t get confused.

Wikipedia Data ≠ Wikidata

Wikipedia Project ≠ WikiProject

29 of 80

How to access content?

  • (XML) Dumps:
    • Historical (full history)
    • Current (Last revision article)
  • (MediaWiki) APIs

30 of 80

Quarry / Sql Replicas

Quarry is a public querying interface for Wiki Replicas, a set of live replica SQL databases of public Wikimedia Wikis.

  • https://quarry.wmflabs.org

31 of 80

It’s collaborative and searchable

32 of 80

Wikimedia Statistics

Live dashboards about:

  • Editing
  • Reading
  • Users

33 of 80

34 of 80

Page Views

This page documents the Pageview API (v1), a public API developed and maintained by the Wikimedia Foundation that serves analytical data about article pageviews of Wikipedia and its sister projects. With it, you can get pageview trends on specific articles or projects; filter by agent type or access method, among other things.

35 of 80

36 of 80

37 of 80

38 of 80

39 of 80

40 of 80

Click dataset

The Wikipedia Clickstream dataset contains counts of (referer, resource) pairs extracted from the request logs of Wikipedia. A referer is an HTTP header field that identifies the address of the webpage that linked to the resource being requested. The data shows how people get to a Wikipedia article and what links they click on.

Automatic monthly releases of datasets for Wikipedia in English, Russian, German, Spanish and Japanese.

41 of 80

42 of 80

Ores

ORES ( Objective Revision Evaluation Service) is a web service and API that provides machine learning as a service for Wikimedia projects maintained by the Scoring Platform team.

  • http://ores.wikimedia.org

43 of 80

Ores

44 of 80

Wikimedia Toolforge

Toolforge is a hosting environment for developers working on services that provide value to the Wikimedia movement. These services allow developers to easily do ad hoc analytics, administer bots, run webservices, and generally create tools to help editors and other volunteers in their work. The environment also includes access to a variety of data services.

  • https://tools.wmflabs.org/

45 of 80

PAWS

A Web Shell (PAWS) is a Jupyter notebooks deployment that has been customized to make interacting with Wikimedia wikis easier.

46 of 80

Wikimedia Commons

Wikimedia Commons (or simply Commons) is an online repository of free-use images, sound, and other media files.

As of 2018, the repository contains over 44 million free media files, managed and editable by registered volunteers. In July 2013, the number of edits on Commons reached 100,000,000

47 of 80

Event Stream

EventStreams is a web service that exposes continuous streams of structured event data. It does so over HTTP using chunked transfer encoding following the Server-Sent Events protocol. EventStreams can be consumed directly via HTTP, but is more commonly used via a client library.

48 of 80

Working Examples

49 of 80

Homework

  • Select one Wikipedia Article (A).
  • Considering the current version of A, get the Wikitext from the first paragraph.
  • List all the Wikilinks on that paragraph using the mwparserfromhell.
  • Using the Pageview API find the most popular pages on the A’s neighborhood, during 2022.
  • Using the MediaWiki API get the last 20 revisions of A and repeat the procedure above for all those revisions.

Details and tips here.

50 of 80

Wikidata

Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.

Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wikisource, and others.

In October 2018, there 23 M of edits in Wikidata, compared with around 4 M edits in English Wikipedia

51 of 80

52 of 80

53 of 80

54 of 80

55 of 80

56 of 80

Interact with Wikidata

  • Wikidata UI
  • Wikidata (JSON) Dump
  • WikiBase API
  • SPARQL

57 of 80

Wikibase API Example

  • Get the Wikidata item from an article title:

https://www.wikidata.org/w/api.php?action=wbgetentities&sites=enwiki&titles=Valdivia

58 of 80

59 of 80

60 of 80

61 of 80

62 of 80

Unstructured vs Structured data

Q1421401

63 of 80

Linked Data (property: subclass of)

64 of 80

Linked Data (property: has part)

65 of 80

Structured Geo-located Data

Q985517

66 of 80

Hierarchies

67 of 80

68 of 80

Mining Wikidata in Practice!

(Slides by Miriam Redi)

69 of 80

Wikidata:

Show me all characters in the Marvel Universe!

70 of 80

Wikidata:

Show me all characters in the Marvel Universe!

  1. Query language called SPARQL
  2. Interface at query.wikidata.org

71 of 80

SPARQL:

A simple Query

SELECT ?aWHERE�{� ?a property1 value1.� ?a property2 value2.�}

SELECT = variables that you want returned (variables start with a question mark),

WHERE = restrictions on them, mostly in the form of triples.

?a= variable

72 of 80

SPARQL:

Triples

Return all fruits with yellow color and sour taste

1 SUBJECT 2 PROPERTY 3 VALUE

SELECT ?fruitWHERE�{� ?fruit property:color value:yellow.� ?fruit property:taste value:sour.�}

73 of 80

SPARQL:

Names of Bach’s children

Semi-Human form:

SELECT ?childWHERE�{� ?child property:father value:Johann Sebastian Bach.

}�

74 of 80

SPARQL:

Names of Bach’s children

Wikidata form:

1: items should be prefixed with wd:, and properties with wdt:

2: find the QID corresponding to the VALUE

3: find the PID corresponding to the PROPERTY

SELECT ?childWHERE�{�# ?child father Bach?child wdt:P22 wd:Q1339.�}

75 of 80

SPARQL:

Names of Bach’s children

Adding back human-readable form ;)

SELECT ?child ?childLabelWHERE�{�# ?child father Bach?child wdt:P22 wd:Q1339.� SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". } #JUST TAKE THIS AS LIP OF FAITH ;)�}

76 of 80

SPARQL:

Count Bach’s children

SELECT (COUNT(?child) as ?childrenCount)�WHERE�{�# ?child father Bach?child wdt:P22 wd:Q1339.� SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". } #JUST TAKE THIS AS LIP OF FAITH ;)�}

77 of 80

SPARQL:

Names and images of Bach’s children

SELECT ?child ?childLabel ?imageWHERE�{�# ?child father Bach?child wdt:P22 wd:Q1339.

?child wdt:P18 ?image.SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }

}

78 of 80

SPARQL:

Visualizing Names and images of Bach’s children

SELECT ?child ?childLabel ?image

#defaultView:ImageGrid

WHERE

{

# ?child father Bach

?child wdt:P22 wd:Q1339.

?child wdt:P18 ?image.

SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }

}

79 of 80

SPARQL:

From question to query :)

FIND THE COUNTRY OF DEATH OF ALL BACH’S CHILDREN

TIPS:

USE THE LEFT PANEL!

CTRL+SPACE to search for properties/items

USE EXAMPLES PROVIDED :)

MORE INFO at: https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial

80 of 80

Homework II

  • Modify this query to get all Presidential Elections in the last 12 years: https://w.wiki/4uUe
  • Choose at least two countries, for example: Italy and France.
  • Using the sitelinks of corresponding Wikidata Items, get the number of Pageviews of each of the past 3 elections, at the time of the election (month). For example, for the USA Presidential Election in 2016, that was held in Nov 8, get all the pageviews that page got in November 2016 in each Wikipedia.
  • Now, using the data of the 2 oldest elections, develop an ML model to predict the number of pageviews in the last election. For example: Use the number of pageviews in USA Presindetial Election in 2016 and 2012, to predict the number of pageviews in 2020. Do this for at least 10 Wikipedia Language editions.
  • Share a notebook from PAWS with your results.