A First Date with Wikidata�Structuring Wikipedia and Beyond
Andrew Lih - Wikimedia District of Columbia
Author and digital strategist
Museum Computer Network
November 14, 2018
#Wikidata
Licensed via CC-BY-SA 4.0
andrew@ANDREWLIH.com�@fuzheado
Andrew Lih
Wikidata-vangelist
Digital strategist, communications professor
Author of The Wikipedia Revolution; Contributor to Leveraging Wikipedia by ALA
Wikimedia District of Columbia (USA)
GLAM
Galleries, Libraries, Archives and Museums - Cultural Partners
Wikimedia DC
Library of Congress
US NARA
Smithsonian
Local chapter for Wikipedia / Wikimedia community in USA – Washington, D.C.
Wikipedia-oriented meetups and workshops
Full-time Wikipedian in Residence��Research facility hosts Wikipedia exhibit and conferences
Meetups, article improvement drives, Wikidata training
Joint projects funded by Knight Foundation
GLAM + Wikipedia/�Wikimedia
2010 - British Museum hosted first Wikipedian in Residence (WiR)
2013 - GLAM Wiki Toolset, mass uploading to Wikimedia Commons
GLAM Wiki US Consortium, hosted by US National Archives, full time WiR
2014 - Full time GLAM Wiki coordinator at Europeana
2017 - Wikidata work with Smithsonian, National Archives, LC, OCLC
Wikidata: �The evolution of Wikipedia into the ultimate, free �linked open database
Wikidata�in one page
Basic Wikidata info and tools in one PDF page.
Clickable live hyperlinks.
Available in 8 10 languages
2017:�A Wikidata turning point
A hub for the future of Wikimedia, popular and scholarly content
Overview
Why Wikidata?
Design of Wikidata
Features, RDF, triples
Queries and tools
Case studies
Calls to action
Wikipedia
today
More than 5.7 million English articles
Top 10 most visited site
High reputation and cultural partnerships
Tate outsources artist biographies on its website to Wikipedia
The museum does “not have the resources to create biographies for every individual” in its collection, spokeswoman says (10 September 2018)
https://www.theartnewspaper.com/news/tate-uses-wikipedia-entries-on-artists-for-its-website
A Tate spokesperson says that it is “working on a partnership with Wikipedia to ensure the biographies for artists in our collection are as accurate as possible”.
Fake news and social media
Facebook tries fighting fake news with publisher info button on links (October 5, 2017)
https://techcrunch.com/2017/10/05/facebook-article-information-button/
Facebook thinks showing Wikipedia entries about publishers and additional Related Articles will give users more context about the links they see. So today it’s beginning a test of a new “i” button on News Feed links that opens up an informational panel.
Fake news and social media
Fascinating to think...
�2001 �(wiki=fast/loose) ��2018 �(wiki=stable/reliable)
13
Wikipedia
challenges
Knowledge scattered among 30 million articles in 200+ languages
Inconsistency, gaps and replication of info
How to consolidate knowable facts?
Lesson: Consolidating images and multimedia
2001: Images scattered across Wikipedia editions
2004: Wikimedia Commons created to centralize and consolidate multimedia
Wikidata
as the future
Convert encyclopedic lexical content into "structured" statements
Turn human readable into machine understandable
Link to stable external data of LAM institutions
"Semantic web" realized
Facts and figures from articles, infoboxes are only in human-readable prose
Navigation boxes at bottom of Wikipedia articles done by hand
Wikidata
capabilities
Launched in 2012
Provies power of searching, sorting and querying
Sum Modular interconnected mesh of all human knowledge
Fundamentals: Statements
Factual claims are stored as statements in Wikidata
subject - predicate - object
or
item - property - value
or
thing - relationship - thing
Moving to structured data
United States Congress
instance of
bicameral legislature
United States Congress
in country
United States
Lexical
Unstructured
Semantic
Structured
The United States Congress is the bicameral legislature of the Federal government of the United States.
Wikidata items and
Q numbers
Q numbers - unique identifiers for items in Wikidata
Each item should have:
Wikidata item for United States Congress (Q11268)
Q numbers - unique identifiers in Wikidata
In a web browser, mobile or laptop, go to wikidata.org
Hit search icon. Start typing "United States Congress." Notice how many matches appear.
Select the item with the US seal.
Look at Statements and Identifiers
Wikidata item for United States Congress (Q11268)
Wikidata Basics
Q numbers - item/object
P numbers - property/relationship
"Statement triple"�item - property - value
Wikidata item page
Claims capture factual, provable information
Any number of statements can be associated with an item
Wikidata statement triples
Underneath the surface...
Using symbols makes them language independent (identifiers vs names)
Item �"George Washington"
Property�"instance of"
Value�"human"
Q5
Q23
P31
"George Washington"
"place of burial"
"Mount Vernon"
Q731635
Q23
P119
"George Washington"
"LCAuth ID"
"n86140996"
Q23
P244
Wikidata and languages
Labels and descriptions for different languages
Exciting for translation applications
en�"George Washington"
en�"instance of"
en�"human"
Q5
Q23
P31
de�"George Washington"
de�"ist ein(e)"
de�"Mensch"
es�"George Washington"
es�"instancia de"
es�"ser humano"
ms�"George Washington"
ms�"contoh"
ms�"manusia"
zh�乔治·华盛顿
zh�性質
zh�人類
Sidebar:
Museu Nacional fire in Brazil
Wikimedia community in Brazil and overseas helping to digitally reconstruct rooms/collections via Wikidata
Fire at Museu Nacional in Brazil
What can we do when a museum burns down?
Full presentation at GLAM WIKI 2018 conference in Tel Aviv
Felipe de Souza Beck/Commons
Lessons learned
Wikidata supports cross-cultural/language collaboration, independent of language.
Example: TABernacle lists [link]
English user knowing no Portuguese can help catalog and describe artifacts
Felipe de Souza Beck/Commons
Impact and learning
Alfonso Gómez Paiva/Commons
Wikidata link on Wikipedia articles - hidden in plain sight
Wikidata stores statements as explicit triples - item + property + value
Item United States Congress
Property�"instance of"
Value�"bicameral legislature"
P31
Q189445
Q11268
Wikidata statement triples
Claims capture factual, provable information
Using symbols makes them language independent (identifiers vs names)
Item �United States Congress
Property�"instance of"
Value�"bicameral legislature"
Q189445
Q11268
P31
Wikidata statement triples
Relationships are "first class" = �very fast to search and sort
Seconds vs minutes to search
Ad hoc data model highly adaptive
Well-suited to the wiki way
Item �United States Congress
Property�"instance of"
Value�"bicameral legislature"
Q189445
Q11268
P31
Traditional databases
Schemas well-defined and controlled
Relational databases and SQL: Columns need lots of planning and forethought
Changes can be complex, with many cascading effects
Searches involving relationships can be slow or expensive (join operations)
Artist | Date of birth | Country | Medium |
Henri Matisse | December 31, 1869 | France | Painting |
Claude Monet | November 14, 1840 | France | Painting |
Edward Hopper | July 22, 1882 | United States | Painting |
Work | Creator | Date | Location |
Les Bêtes De La Mer | Henri Matisse | 1950 | NGA |
Cape Cod Morning | Edward Hopper | 1950 | SAAM |
Nighthawks | Edward Hopper | 1942 | Art Inst Chicago |
Wikidata and RDF databases
Relationships are explicit and precise
Database can take any shape and grow according to need
Also known as "graph databases"
Edward Hopper
July 22, 1882
Nighthawks
painting
United States
Cape Cod Morning
citizen of
date of birth
creator
Instance of
creative work
subclass of
Summary
UPSIDES
RDF triples make for a very flexible and fast system
Suitable for the BEBOLD wiki culture
Multiple parallel ontologies can co-exist
DOWNSIDES
Schema-on-the-fly system can make modeling inconsistent and difficult
Hard for newcomers to understand
Multiple parallel ontologies can co-exist
Wikidata items
Using identifiers removes language dependence and ambiguity in:
Writing systems (Chinese, Serbian, Kazakh, et al)
Phonetization variations
Spelling variations
Maiden vs. married names
Canonical identifiers help link to external databases
Item �Muammar Gaddafi
Q19878
Muammar Gaddafi
Muammar Muhammad Abu � Minyar al-Gaddafi
Colonel Gaddafi
Kadhafi
Mu‘ammar al Qaḏḏāfi
Moammar Al Qadhafi
Qaḏḏāfi
Gadafi
Kadaffi
Al-Khadafy
Gadaffi
Kaddafi
Muammar al–Gaddafi
Jaddafi
Qaddafy
Muammar Gaddafi
Muhamar Gadaffi
Mu‘ammar al-Qaḏḏafī
Al-Qadhdhaafi
Gadhafi
Qaḏḏafi
Qaḏḏāfī
Muammar Muhammad Abu � Minyar al-Gaddafi
Khadafi
Mu‘ammar al-Qaḏḏāfi
Gaddafi
Muammar el Gadafi�Muamar al-Gaddafi
Muamar al Gaddafi
Mu‘ammar al Qaḏḏafi
Kadafi
Omar Gadafi
Kaddaffi
Moammar Jaddafi
Muamar Gadafi
Muamar el-Gadafi
Mu‘ammar al Qaḏḏāfī
Al-Qathafi
Mu‘ammar al-Qaḏḏafi
Muamar al Gadafi
Moammar Gadafi
Muammar al-Gaddafi
Muhammad Ghadaffi
Muammar el Gaddafi
Muamar al Gaddafhi
Mu‘ammar al-Qaḏḏāfī
Mu‘ammar al Qaḏḏafī
Khaddafi
Muammar al Gaddafi
Qaḏḏafī
El Kazzafi
Muhamad Gadafi
Muamar al-Gaddafhi
53 Latinized variations!
(May 2017)
Speed, consistency, automation
Wikidata has more than �52 million items
Simple searches take less than a second
Complex queries supported by open standards like SPARQL
Search example - Find all bicameral legislatures
Item �?
Property�"instance of"
Value�"bicameral legislature"
Q189445
?
P31
Wikidata Search - Result from Query
52 million items in less than a second
Hands On:
Editing Wikidata items
In a web browser, mobile or laptop, go to wikidata.org
Hit search icon.
Look for the most local village/township/neighborhood you call home (ie. Not New York City, but Greenwich Village)
How accurate is it? Can you help improve it?
Anonymous editing allowed
Editing Wikidata
Just as with Wikipedia:
Edit button
+add button
Wikidata items have identifiers - VIAF links back
Properties: �Identifiers
Indexes into other databases
Authority control
Accession numbers
Catalog identifiers
Stable URLs to other sites
Wikidata items have identifiers - links to external databases
Barack Obama (Q76) has more than 83 identifiers!
Some prominent identifiers - links to external databases
WorldCat
VIAF
LC Name Authority File
ISNI
GND (Integrated Authority File)
SUDOC (French universities)
BNF (Bibliotheque France)
MusicBrainz
Bio Directory of Congress
Quora topic ID
C-SPAN person ID
Freebase
NDLAuth ID (National Diet Library of Japan)
SELIBR (National Library of Sweden Libris)
NLA (Australia) ID
NKCR Czech National Authority Database (National Library of Czech Republic)
RSL ID (person) Russian State Library
IMDB
Dutch National Thesaurus for Author names
Declarator.org - Russian non-governmental database with information on the income of government officials
NUKAT - Center of Warsaw
University Library catalog
CiNii (Scholarly and Academic Information Navigator) Japan
NNDB people ID - Notable Names Database
Politifact
Encyclopedia Britannica ID
CONOR ID (Slovenia)
NYT topic ID
Guardian topic ID�Parlement & Politiek ID (Dutch politics site)
Social Networks and Archival Context ID (SNAC)
NARA
California Digital Library
University of Virginia
University of California, Berkeley
Smithsonian related properties
OCLC related properties
Wikidata as database of databases
As meta-database, Wikidata doesn't need to hold all the data but can act as a hub for federated searches - search across multiple databases
Wikidata has more than 44 SPARQL endpoints in federation:�Europeana, Smithsonian AAM, Getty, Yale Center for British Art, British Museum, et al.
Scenario: Heritage months
Smithsonian American Art Museum database does not record gender or ethnicity of artists.
For Black History Month and Women's History Month this posed a problem.
Sum of all databases - Wikidata and federated search can solve these issues
Consistency and bounds checking
Constraint reports/violations provide warnings �on logic and bounds
Searching and displaying Wikidata
Querying tools
Presentation
Visualization
APIs and endpoints
Basic search with Wikidata
SPARQL endpoint at�query.wikidata.org
Superficially similar to SQL
One of the busiest endpoints on the Internet
Basic search with Wikidata
SPARQL is an open standard
Try the "Examples" button for lots of interesting searches
Hint: Don't write queries from scratch. Modify existing ones!
Use auto-complete with CTRL-SPACE (Beware Mac users!)
Hands On:
Run your first Wikidata Query
Visit query.wikidata.org
Or directly click: http://bit.ly/wikidata-catquery
Hit blue triangle button to run!
Try horses (Q726)
Advanced search with Wikidata
Statistics, graphs, maps via Wikidata
Discover stories in the data:
Example: Where have members of Congress been educated?
Example: Education of Congress
List all members of Congress who have ever served and examine where they have been educated
?moc wdt:P31 wd:Q5 . #"instances of" humans
?moc wdt:P1157 ?lcbioid . #LC "Congress Bio ID"
?moc wdt:P69 ?school . #grab "educated at"
COUNT the occurrences of each school
ORDER them from highest to lowest
LIMIT it to the top 15 results
Education of Congress
Run time: about 15 seconds
Results can be shown in multiple ways
Tables
Maps
Charts
Timelines
Education of Congress
H-Y-P dominate
University of Michigan very prominent
Some surprises - Union College?
Union College
In 1800, the U.S.�Big Four colleges:
Harvard� Yale� Princeton� Union (!)
Union College lost ground amid a financial scandal and Civil War attrition (1861-1865)
Deeper query: members of Congress educated at Union - table mode, date of birth
Members of Congress educated at Union - Timeline mode, Civil War (1861-1865)
Union College
Schenectady, NY
Columbia University
New York, NY
Impact of Wikidata
Google closed their own Freebase project in 2016, in favor of backing Wikidata
Google search results and Knowledge Graph use Wikidata
Schema.org has endorsed using Wikidata
Interesting Wikidata tools
Wikidata Query
Wikidata Graph Builder
Monumental
Reasonator
Vizquery
Gender Gap Tool
Wikidata Query: SPARQL endpoint
Try example queries
Reasonator: Nicely formatted Wikidata pages
SQID: Browsing Wikidata entries and linkages
SQID: Browsing Wikidata entries and linkages
Wikidata Distributed Game: click to contribute
Quickstatements: Bulk upload tables
https://tools.wmflabs.org/wikidata-todo/quick_statements.php
Wikidata Graph Builder: Visualizing relationships
Query: Washington DC museums, metadata
Consider this query to find all museums in DC...
Query: Washington DC museums, metadata
Atlernative, within 100 km of Washington DC city center:
Query: Washington DC museums, results
Raw table results from query...
Wikidata helps make RDF more accessible
Wikidata: making RDF and semantic web more accessible/human
Generic SPARQL query - scary
Wikidata query - user friendly
Vizquery: Simple Wikidata item selection
Much simpler way to do queries
Advanced Wikidata tools
Scholia - citations/authors of scholarly articles and journals
Bulk uploading - Quickstatements, Petscan
Wikidata Game, Distributed Game - contributing by clicking
Alternative contribution methods
Instead of individual item pages...
Task lists, games and other interfaces contribute to Wikidata
Wikidata Games: "One-click" contributions based on task lists
Notable external databases
Art and museum databases, thesauri, dictionaries, encyclopedias, national and academic libraries
Internet-based databases - IMDb, MusicBrainz, Quora
Mix'n'Match a one-click "game" interface to help match external data to Wikidata��https://tools.wmflabs.org/mix-n-match
Wikidata Mix-n-match
Wikidata Game: Identifier match for SAAM
Getty AAT
Query: Washington DC museums, multiple views
Sum of all Paintings
Project�Wiki Art Depiction Explorer�Knight Prototype Fund 2018 - Arts and Tech
Rob Fernandez, WMDC
Effie Kapsalis, Smithsonian
Andrew Lih, WMDC
Background
Wikipedia and rich multimedia metadata
1
One of the toughest problems: metadata is typically created for art scholars.
Public-friendly metadata would be useful for discovery and re-use.
93
What is depicted in an artwork? ��How do you consistently convey objects, themes, or concepts?
94
How can we provide the public with predictible points-of-entry to discover artwork?
95
While making it language independent and machine searchable?
96
This has been one of the grand challenges in the cultural and heritage sector.
97
Vision and machine algorithms are limited in what they can do
98
Wikipedia has shown that volunteer contributors can solve these types of problems at scale
99
Wiki Art Depiction Explorer (WADE)
Using Wikipedia, Wikimedia Commons and Wikidata for art metadata
2
The Wiki Art Depiction Explorer will help enrich semantic data about artworks in Wikidata, making it free for the world
101
Every artwork described in Wikidata can have a depicts property, pointing to any number of terms
102
Wiki editors are already adding depicts statements to artworks
103
WikiProject Sum of All Paintings
Goal – richly describe more than 300,000+ notable paintings in Wikidata/Wikipedia
104
Adding depicts properties now is haphazard, unguided and unassisted
105
106
Terms differ among individuals, institutions, and fields of study
107
What if there was a way to visualize the state of metadata for groups of artworks and provide direction for collaborators?
108
How might a context-driven metadata interface support effective and consistent crowdsourced results for describing art?
109
Approach
Established experience, leadership, and audience
3
What "vectors" can we use to suggest good depicts terms?
111
Goal: Create a tool to engage art-loving public in enhancing cultural heritage metadata for discovery and creativity
112
Most depicted items in all paintings in Wikidata
European, Christian art themes prominent
113
114
115
Previous depicts �statements for �Edward Hopper�works
Find us, talk to us!
Effie Kapsalis, SI @digitaleffie
Andrew Lih, WMDC @fuzheado
117
Caveats
Wikidata still an early work in progress
Many areas well-modeled
Many areas quite bare (items with no statements)
Instances vs subclasses
Issues
Shifting to Wikidata:
The future is structured through Wikidata
Wikidata: Internet duct tape
Research, academic hub
CC0 - no copyright
Join top cultural and commercial institutions already working with Wikidata
Ask questions!
121
Wikidata�In one page
Actionable steps
@fuzheado
andrew.LIH@gmail.com
andrew@ANDREWLIH.com
Try editing Wikidata items
Try modifying SPARQL queries
Mix'n'match - Connect your data set or help match terms
Meetup - Contact us, not just DC or NY
Install Wikibase instance - engine that runs Wikidata - http://wikiba.se/
124
Thank you! Discussion - Q&A
FOLLOWUPS?