Wikipedia beyond the encyclopedic value
Diego Sáez Trumper - May. 2022
Sapienza University of Rome
diego@wikimedia.org
I am Diego Sáez Trumper
Agenda
200K
Registered volunteer editors
280 languages
15B monthly
pageviews
10M monthly
edits
48M
articles
Wikimedia Foundation
Research Scientist
Research Scientist
Senior Research Community Officer
Research Manager
Director, Head of Research
Senior Research�Engineer
Research Scientist
Learn more: https://research.wikimedia.org
Wikimedia Research Team
...and many formal collaborators
Wikimedia Research
https://research.wikimedia.org
DARIO TARABORELLI /CC0
Minimalist user data collection
DARIO TARABORELLI /CC0
Revision histories are public data
No 3rd-party collection or sharing of user data
Private user data removed after 90 days
The world’s most visited online medical resource
[Heilman and West (2015) doi.org/10.2196/jmir.4069]
Correlation between HDI and Wikipedia usage
Lemmerich, F., Sáez-Trumper, D., West, R., & Zia, L. (2019, January). Why the world reads Wikipedia: Beyond English speakers. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (pp. 618-626).
All languages
�Geotagged articles across all Wikipedia languages (2 million)
Just the 20% of sessions includes more than one language.
English Wikipedia
�Geotagged articles in English Wikipedia (950,000)
Spanish Wikipedia
�Geotagged articles in Spanish Wikipedia (261,000)
Portuguese Wikipedia
�Geotagged articles in Portuguese Wikipedia (185,000)
Arabic Wikipedia
�Geotagged articles in Arabic Wikipedia (87,000)
Resources
Article quality
Manually done quality assessments for Articles in the English Wikipedia
https://en.wikipedia.org/wiki/Wikipedia:Content_assessment
Reminder
(try to) Don’t get confused.
Wikipedia Data ≠ Wikidata
Wikipedia Project ≠ WikiProject
How to access content?
Quarry / Sql Replicas
Quarry is a public querying interface for Wiki Replicas, a set of live replica SQL databases of public Wikimedia Wikis.
It’s collaborative and searchable
Wikimedia Statistics
Live dashboards about:
Page Views
This page documents the Pageview API (v1), a public API developed and maintained by the Wikimedia Foundation that serves analytical data about article pageviews of Wikipedia and its sister projects. With it, you can get pageview trends on specific articles or projects; filter by agent type or access method, among other things.
Click dataset
The Wikipedia Clickstream dataset contains counts of (referer, resource) pairs extracted from the request logs of Wikipedia. A referer is an HTTP header field that identifies the address of the webpage that linked to the resource being requested. The data shows how people get to a Wikipedia article and what links they click on.
Automatic monthly releases of datasets for Wikipedia in English, Russian, German, Spanish and Japanese.
Ores
ORES ( Objective Revision Evaluation Service) is a web service and API that provides machine learning as a service for Wikimedia projects maintained by the Scoring Platform team.
Ores
Wikimedia Toolforge
Toolforge is a hosting environment for developers working on services that provide value to the Wikimedia movement. These services allow developers to easily do ad hoc analytics, administer bots, run webservices, and generally create tools to help editors and other volunteers in their work. The environment also includes access to a variety of data services.
PAWS
A Web Shell (PAWS) is a Jupyter notebooks deployment that has been customized to make interacting with Wikimedia wikis easier.
Wikimedia Commons
Wikimedia Commons (or simply Commons) is an online repository of free-use images, sound, and other media files.
As of 2018, the repository contains over 44 million free media files, managed and editable by registered volunteers. In July 2013, the number of edits on Commons reached 100,000,000
Event Stream
EventStreams is a web service that exposes continuous streams of structured event data. It does so over HTTP using chunked transfer encoding following the Server-Sent Events protocol. EventStreams can be consumed directly via HTTP, but is more commonly used via a client library.
Working Examples
http://paws-public.wmflabs.org/paws-public/User:Diego_(WMF)/WikiMediaPublicTools.ipynb
https://github.com/digitalTranshumant/Wiki-examples
Homework
Details and tips here.
Wikidata
Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.
Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wikisource, and others.
In October 2018, there 23 M of edits in Wikidata, compared with around 4 M edits in English Wikipedia
Interact with Wikidata
Wikibase API Example
https://www.wikidata.org/w/api.php?action=wbgetentities&sites=enwiki&titles=Valdivia
Unstructured vs Structured data
Q1421401
Linked Data (property: subclass of)
Linked Data (property: has part)
Structured Geo-located Data
Q985517
Hierarchies
Mining Wikidata in Practice!
(Slides by Miriam Redi)
Wikidata:
Show me all characters in the Marvel Universe!
Wikidata:
Show me all characters in the Marvel Universe!
SPARQL:
A simple Query
SELECT ?a�WHERE�{� ?a property1 value1.� ?a property2 value2.�}
SELECT = variables that you want returned (variables start with a question mark),
WHERE = restrictions on them, mostly in the form of triples.
?a= variable
SPARQL:
Triples
Return all fruits with yellow color and sour taste
1 SUBJECT 2 PROPERTY 3 VALUE
SELECT ?fruit�WHERE�{� ?fruit property:color value:yellow.� ?fruit property:taste value:sour.�}
SPARQL:
Names of Bach’s children
Semi-Human form:
SELECT ?child�WHERE�{� ?child property:father value:Johann Sebastian Bach.
}�
SPARQL:
Names of Bach’s children
Wikidata form:
1: items should be prefixed with wd:, and properties with wdt:
2: find the QID corresponding to the VALUE
3: find the PID corresponding to the PROPERTY
SELECT ?child�WHERE�{�# ?child father Bach� ?child wdt:P22 wd:Q1339.�}
SPARQL:
Names of Bach’s children
Adding back human-readable form ;)
SELECT ?child ?childLabel�WHERE�{�# ?child father Bach� ?child wdt:P22 wd:Q1339.� SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". } #JUST TAKE THIS AS LIP OF FAITH ;)�}
SPARQL:
Count Bach’s children
SELECT (COUNT(?child) as ?childrenCount)�WHERE�{�# ?child father Bach� ?child wdt:P22 wd:Q1339.� SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". } #JUST TAKE THIS AS LIP OF FAITH ;)�}
SPARQL:
Names and images of Bach’s children
SELECT ?child ?childLabel ?image�WHERE�{�# ?child father Bach� ?child wdt:P22 wd:Q1339.
?child wdt:P18 ?image.� SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
SPARQL:
Visualizing Names and images of Bach’s children
SELECT ?child ?childLabel ?image
#defaultView:ImageGrid
WHERE
{
# ?child father Bach
?child wdt:P22 wd:Q1339.
?child wdt:P18 ?image.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
SPARQL:
From question to query :)
FIND THE COUNTRY OF DEATH OF ALL BACH’S CHILDREN
TIPS:
USE THE LEFT PANEL!
CTRL+SPACE to search for properties/items
USE EXAMPLES PROVIDED :)
MORE INFO at: https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial
Homework II