1 of 125

A First Date with Wikidata�Structuring Wikipedia and Beyond

Andrew Lih - Wikimedia District of Columbia

Author and digital strategist

Museum Computer Network

November 14, 2018

#Wikidata

Licensed via CC-BY-SA 4.0

andrew@ANDREWLIH.com�@fuzheado

2 of 125

Andrew Lih

Wikidata-vangelist

Digital strategist, communications professor

Author of The Wikipedia Revolution; Contributor to Leveraging Wikipedia by ALA

Wikimedia District of Columbia (USA)

3 of 125

GLAM

Galleries, Libraries, Archives and Museums - Cultural Partners

Wikimedia DC

Library of Congress

US NARA

Smithsonian

Local chapter for Wikipedia / Wikimedia community in USA – Washington, D.C.

Wikipedia-oriented meetups and workshops

Full-time Wikipedian in Residence��Research facility hosts Wikipedia exhibit and conferences

Meetups, article improvement drives, Wikidata training

Joint projects funded by Knight Foundation

4 of 125

GLAM + Wikipedia/�Wikimedia

2010 - British Museum hosted first Wikipedian in Residence (WiR)

2013 - GLAM Wiki Toolset, mass uploading to Wikimedia Commons

GLAM Wiki US Consortium, hosted by US National Archives, full time WiR

2014 - Full time GLAM Wiki coordinator at Europeana

2017 - Wikidata work with Smithsonian, National Archives, LC, OCLC

5 of 125

Wikidata: �The evolution of Wikipedia into the ultimate, free �linked open database

6 of 125

Wikidata�in one page

Basic Wikidata info and tools in one PDF page.

Clickable live hyperlinks.

Available in 8 10 languages

7 of 125

2017:�A Wikidata turning point

  • Wikidata used by
    • Google Knowledge Graph
    • Digital assistants: Siri, Alexa
  • Infoboxes on Wikipedia
  • Wikicite, WikidataCon conferences
  • Structured Data on Wikimedia Commons

A hub for the future of Wikimedia, popular and scholarly content

8 of 125

Overview

Why Wikidata?

Design of Wikidata

Features, RDF, triples

Queries and tools

Case studies

Calls to action

9 of 125

Wikipedia

today

More than 5.7 million English articles

Top 10 most visited site

High reputation and cultural partnerships

10 of 125

Tate outsources artist biographies on its website to Wikipedia

The museum does “not have the resources to create biographies for every individual” in its collection, spokeswoman says (10 September 2018)

https://www.theartnewspaper.com/news/tate-uses-wikipedia-entries-on-artists-for-its-website

A Tate spokesperson says that it is “working on a partnership with Wikipedia to ensure the biographies for artists in our collection are as accurate as possible”.

11 of 125

Fake news and social media

Facebook tries fighting fake news with publisher info button on links (October 5, 2017)

https://techcrunch.com/2017/10/05/facebook-article-information-button/

Facebook thinks showing Wikipedia entries about publishers and additional Related Articles will give users more context about the links they see. So today it’s beginning a test of a new “i” button on News Feed links that opens up an informational panel.

12 of 125

Fake news and social media

13 of 125

Fascinating to think...

2001 �(wiki=fast/loose) ��2018 �(wiki=stable/reliable)

13

14 of 125

Wikipedia

challenges

Knowledge scattered among 30 million articles in 200+ languages

Inconsistency, gaps and replication of info

How to consolidate knowable facts?

15 of 125

Lesson: Consolidating images and multimedia

2001: Images scattered across Wikipedia editions

2004: Wikimedia Commons created to centralize and consolidate multimedia

commons.wikimedia.org

16 of 125

Wikidata

as the future

Convert encyclopedic lexical content into "structured" statements

Turn human readable into machine understandable

Link to stable external data of LAM institutions

"Semantic web" realized

17 of 125

Facts and figures from articles, infoboxes are only in human-readable prose

18 of 125

Navigation boxes at bottom of Wikipedia articles done by hand

19 of 125

Wikidata

capabilities

Launched in 2012

Provies power of searching, sorting and querying

Sum Modular interconnected mesh of all human knowledge

20 of 125

Fundamentals: Statements

Factual claims are stored as statements in Wikidata

subject - predicate - object

or

item - property - value

or

thing - relationship - thing

21 of 125

Moving to structured data

United States Congress

instance of

bicameral legislature

United States Congress

in country

United States

Lexical

Unstructured

Semantic

Structured

The United States Congress is the bicameral legislature of the Federal government of the United States.

22 of 125

Wikidata items and

Q numbers

Q numbers - unique identifiers for items in Wikidata

Each item should have:

  • Labels
  • Descriptions
  • Aliases

23 of 125

Wikidata item for United States Congress (Q11268)

Q numbers - unique identifiers in Wikidata

24 of 125

Hands On:

View Wikidata item

Q11268

In a web browser, mobile or laptop, go to wikidata.org

Hit search icon. Start typing "United States Congress." Notice how many matches appear.

Select the item with the US seal.

Look at Statements and Identifiers

25 of 125

Wikidata item for United States Congress (Q11268)

26 of 125

Wikidata Basics

Q numbers - item/object

  • Anyone can make a Q item
  • Corresponds to Wikipedia article / concept
  • Examples
    • Q1 - the Universe
    • Q2 - Earth
    • Q5 - human
    • Q146 - cat
    • Q729 - animal
    • Q571 - book
    • Q7075 - library
    • Q33506 - museum

P numbers - property/relationship

  • Controlled vocabulary for consistency
  • Proposal / discussion / approval process
  • Examples
    • P31 - instance of
    • P279 - subclass of
    • P214 - VIAF ID
    • P217 - inventory number
    • P569 - date of birth
    • P625 - coordinate location
    • P1014 - Getty AAT ID
  • See: Wikidata:List_of_properties

"Statement triple"�item - property - value

27 of 125

Wikidata item page

Claims capture factual, provable information

Any number of statements can be associated with an item

28 of 125

Wikidata statement triples

Underneath the surface...

Using symbols makes them language independent (identifiers vs names)

Item �"George Washington"

Property�"instance of"

Value�"human"

Q5

Q23

P31

"George Washington"

"place of burial"

"Mount Vernon"

Q731635

Q23

P119

"George Washington"

"LCAuth ID"

"n86140996"

Q23

P244

29 of 125

Wikidata and languages

Labels and descriptions for different languages

Exciting for translation applications

en�"George Washington"

en�"instance of"

en�"human"

Q5

Q23

P31

de�"George Washington"

de�"ist ein(e)"

de�"Mensch"

es�"George Washington"

es�"instancia de"

es�"ser humano"

ms�"George Washington"

ms�"contoh"

ms�"manusia"

zh�乔治·华盛顿

zh性質

zh人類

30 of 125

Sidebar:

Museu Nacional fire in Brazil

Wikimedia community in Brazil and overseas helping to digitally reconstruct rooms/collections via Wikidata

31 of 125

Fire at Museu Nacional in Brazil

What can we do when a museum burns down?

Full presentation at GLAM WIKI 2018 conference in Tel Aviv

32 of 125

Felipe de Souza Beck/Commons

Lessons learned

Wikidata supports cross-cultural/language collaboration, independent of language.

Example: TABernacle lists [link]

English user knowing no Portuguese can help catalog and describe artifacts

33 of 125

Felipe de Souza Beck/Commons

Impact and learning

  • Commons: 471 → 3,724 images
    • Successful public outreach campaign
    • Many pictures of same items; hard to aggregate
    • Three uploaders were key
    • Remaining image gaps (~30% very relevant items with no found image)
  • Wikipedia: 8 → 57 articles
    • Scarce bibliographical references
    • Community will engage more when project is already successful
  • Wikidata: 8 → 429 items
    • 2,506 item statements
  • Wikisource: 0 → 3 items
    • Awareness of project low

Alfonso Gómez Paiva/Commons

34 of 125

Wikidata link on Wikipedia articles - hidden in plain sight

35 of 125

Wikidata stores statements as explicit triples - item + property + value

Item United States Congress

Property�"instance of"

Value�"bicameral legislature"

P31

Q189445

Q11268

36 of 125

Wikidata statement triples

Claims capture factual, provable information

Using symbols makes them language independent (identifiers vs names)

Item �United States Congress

Property�"instance of"

Value�"bicameral legislature"

Q189445

Q11268

P31

37 of 125

Wikidata statement triples

Relationships are "first class" = �very fast to search and sort

Seconds vs minutes to search

Ad hoc data model highly adaptive

Well-suited to the wiki way

Item �United States Congress

Property�"instance of"

Value�"bicameral legislature"

Q189445

Q11268

P31

38 of 125

Traditional databases

Schemas well-defined and controlled

Relational databases and SQL: Columns need lots of planning and forethought

Changes can be complex, with many cascading effects

Searches involving relationships can be slow or expensive (join operations)

Artist

Date of birth

Country

Medium

Henri Matisse

December 31, 1869

France

Painting

Claude Monet

November 14, 1840

France

Painting

Edward Hopper

July 22, 1882

United States

Painting

Work

Creator

Date

Location

Les Bêtes De La Mer

Henri Matisse

1950

NGA

Cape Cod Morning

Edward Hopper

1950

SAAM

Nighthawks

Edward Hopper

1942

Art Inst Chicago

39 of 125

Wikidata and RDF databases

Relationships are explicit and precise

Database can take any shape and grow according to need

Also known as "graph databases"

Edward Hopper

July 22, 1882

Nighthawks

painting

United States

Cape Cod Morning

citizen of

date of birth

creator

Instance of

creative work

subclass of

40 of 125

Summary

UPSIDES

RDF triples make for a very flexible and fast system

Suitable for the BEBOLD wiki culture

Multiple parallel ontologies can co-exist

DOWNSIDES

Schema-on-the-fly system can make modeling inconsistent and difficult

Hard for newcomers to understand

Multiple parallel ontologies can co-exist

41 of 125

Wikidata items

Using identifiers removes language dependence and ambiguity in:

Writing systems (Chinese, Serbian, Kazakh, et al)

Phonetization variations

Spelling variations

Maiden vs. married names

Canonical identifiers help link to external databases

ItemMuammar Gaddafi

Q19878

Muammar Gaddafi

Muammar Muhammad Abu � Minyar al-Gaddafi

Colonel Gaddafi

Kadhafi

Mu‘ammar al Qaḏḏāfi

Moammar Al Qadhafi

Qaḏḏāfi

Gadafi

Kadaffi

Al-Khadafy

Gadaffi

Kaddafi

Muammar al–Gaddafi

Jaddafi

Qaddafy

Muammar Gaddafi

Muhamar Gadaffi

Mu‘ammar al-Qaḏḏafī

Al-Qadhdhaafi

Gadhafi

Qaḏḏafi

Qaḏḏāfī

Muammar Muhammad Abu � Minyar al-Gaddafi

Khadafi

Mu‘ammar al-Qaḏḏāfi

Gaddafi

Muammar el Gadafi�Muamar al-Gaddafi

Muamar al Gaddafi

Mu‘ammar al Qaḏḏafi

Kadafi

Omar Gadafi

Kaddaffi

Moammar Jaddafi

Muamar Gadafi

Muamar el-Gadafi

Mu‘ammar al Qaḏḏāfī

Al-Qathafi

Mu‘ammar al-Qaḏḏafi

Muamar al Gadafi

Moammar Gadafi

Muammar al-Gaddafi

Muhammad Ghadaffi

Muammar el Gaddafi

Muamar al Gaddafhi

Mu‘ammar al-Qaḏḏāfī

Mu‘ammar al Qaḏḏafī

Khaddafi

Muammar al Gaddafi

Qaḏḏafī

El Kazzafi

Muhamad Gadafi

Muamar al-Gaddafhi

53 Latinized variations!

(May 2017)

42 of 125

Speed, consistency, automation

Wikidata has more than �52 million items

Simple searches take less than a second

Complex queries supported by open standards like SPARQL

43 of 125

Search example - Find all bicameral legislatures

Item �?

Property�"instance of"

Value�"bicameral legislature"

Q189445

?

P31

44 of 125

Wikidata Search - Result from Query

52 million items in less than a second

45 of 125

Hands On:

Editing Wikidata items

In a web browser, mobile or laptop, go to wikidata.org

Hit search icon.

Look for the most local village/township/neighborhood you call home (ie. Not New York City, but Greenwich Village)

How accurate is it? Can you help improve it?

Anonymous editing allowed

46 of 125

Editing Wikidata

Just as with Wikipedia:

Edit button

+add button

47 of 125

Wikidata items have identifiers - VIAF links back

48 of 125

Properties: �Identifiers

Indexes into other databases

Authority control

Accession numbers

Catalog identifiers

Stable URLs to other sites

49 of 125

Wikidata items have identifiers - links to external databases

Barack Obama (Q76) has more than 83 identifiers!

50 of 125

Some prominent identifiers - links to external databases

WorldCat

VIAF

LC Name Authority File

ISNI

GND (Integrated Authority File)

SUDOC (French universities)

BNF (Bibliotheque France)

MusicBrainz

Bio Directory of Congress

Quora topic ID

C-SPAN person ID

Freebase

NDLAuth ID (National Diet Library of Japan)

SELIBR (National Library of Sweden Libris)

NLA (Australia) ID

NKCR Czech National Authority Database (National Library of Czech Republic)

RSL ID (person) Russian State Library

IMDB

Dutch National Thesaurus for Author names

Declarator.org - Russian non-governmental database with information on the income of government officials

NUKAT - Center of Warsaw

University Library catalog

CiNii (Scholarly and Academic Information Navigator) Japan

NNDB people ID - Notable Names Database

Politifact

Encyclopedia Britannica ID

CONOR ID (Slovenia)

NYT topic ID

Guardian topic ID�Parlement & Politiek ID (Dutch politics site)

Social Networks and Archival Context ID (SNAC)

NARA

California Digital Library

University of Virginia

University of California, Berkeley

51 of 125

Smithsonian related properties

52 of 125

OCLC related properties

53 of 125

Wikidata as database of databases

As meta-database, Wikidata doesn't need to hold all the data but can act as a hub for federated searches - search across multiple databases

Wikidata has more than 44 SPARQL endpoints in federation:�Europeana, Smithsonian AAM, Getty, Yale Center for British Art, British Museum, et al.

See: SPARQL Federation endpoints

54 of 125

Scenario: Heritage months

Smithsonian American Art Museum database does not record gender or ethnicity of artists.

For Black History Month and Women's History Month this posed a problem.

Sum of all databases - Wikidata and federated search can solve these issues

55 of 125

Consistency and bounds checking

Constraint reports/violations provide warnings �on logic and bounds

56 of 125

Searching and displaying Wikidata

Querying tools

Presentation

Visualization

APIs and endpoints

57 of 125

Basic search with Wikidata

SPARQL endpoint at�query.wikidata.org

Superficially similar to SQL

One of the busiest endpoints on the Internet

https://wikidata.org/wiki/Wikidata:SPARQL_query_service

58 of 125

Basic search with Wikidata

SPARQL is an open standard

Try the "Examples" button for lots of interesting searches

Hint: Don't write queries from scratch. Modify existing ones!

Use auto-complete with CTRL-SPACE (Beware Mac users!)

http://tinyurl.com/y7nvjgm9

59 of 125

Hands On:

Run your first Wikidata Query

Visit query.wikidata.org

  • Choose "Examples" folder
  • Select "Cats"

Or directly click: http://bit.ly/wikidata-catquery

Hit blue triangle button to run!

Try horses (Q726)

60 of 125

Advanced search with Wikidata

Statistics, graphs, maps via Wikidata

Discover stories in the data:

Example: Where have members of Congress been educated?

61 of 125

Example: Education of Congress

List all members of Congress who have ever served and examine where they have been educated

http://tinyurl.com/k8tqzj7

?moc wdt:P31 wd:Q5 . #"instances of" humans

?moc wdt:P1157 ?lcbioid . #LC "Congress Bio ID"

?moc wdt:P69 ?school . #grab "educated at"

COUNT the occurrences of each school

ORDER them from highest to lowest

LIMIT it to the top 15 results

62 of 125

Education of Congress

Run time: about 15 seconds

Results can be shown in multiple ways

Tables

Maps

Charts

Timelines

http://tinyurl.com/k8tqzj7

63 of 125

Education of Congress

H-Y-P dominate

University of Michigan very prominent

Some surprises - Union College?

64 of 125

Union College

In 1800, the U.S.�Big Four colleges:

Harvard� Yale� Princeton� Union (!)

Union College lost ground amid a financial scandal and Civil War attrition (1861-1865)

65 of 125

Deeper query: members of Congress educated at Union - table mode, date of birth

66 of 125

Members of Congress educated at Union - Timeline mode, Civil War (1861-1865)

Union College

Schenectady, NY

Columbia University

New York, NY

67 of 125

Impact of Wikidata

Google closed their own Freebase project in 2016, in favor of backing Wikidata

Google search results and Knowledge Graph use Wikidata

Schema.org has endorsed using Wikidata

68 of 125

Interesting Wikidata tools

Wikidata Query

Wikidata Graph Builder

Monumental

Reasonator

Vizquery

Gender Gap Tool

69 of 125

Wikidata Query: SPARQL endpoint

Try example queries

70 of 125

Reasonator: Nicely formatted Wikidata pages

71 of 125

SQID: Browsing Wikidata entries and linkages

72 of 125

SQID: Browsing Wikidata entries and linkages

73 of 125

Wikidata Distributed Game: click to contribute

74 of 125

Quickstatements: Bulk upload tables

https://tools.wmflabs.org/wikidata-todo/quick_statements.php

75 of 125

Wikidata Graph Builder: Visualizing relationships

76 of 125

Query: Washington DC museums, metadata

Consider this query to find all museums in DC...

77 of 125

Query: Washington DC museums, metadata

Atlernative, within 100 km of Washington DC city center:

78 of 125

Query: Washington DC museums, results

Raw table results from query...

79 of 125

Wikidata helps make RDF more accessible

80 of 125

Wikidata: making RDF and semantic web more accessible/human

Wikidata query - user friendly

81 of 125

Vizquery: Simple Wikidata item selection

Much simpler way to do queries

82 of 125

Advanced Wikidata tools

Scholia - citations/authors of scholarly articles and journals

Bulk uploading - Quickstatements, Petscan

Wikidata Game, Distributed Game - contributing by clicking

83 of 125

Alternative contribution methods

Instead of individual item pages...

Task lists, games and other interfaces contribute to Wikidata

84 of 125

Wikidata Games: "One-click" contributions based on task lists

85 of 125

Notable external databases

Art and museum databases, thesauri, dictionaries, encyclopedias, national and academic libraries

Internet-based databases - IMDb, MusicBrainz, Quora

Mix'n'Match a one-click "game" interface to help match external data to Wikidata��https://tools.wmflabs.org/mix-n-match

86 of 125

Wikidata Mix-n-match

87 of 125

Wikidata Game: Identifier match for SAAM

88 of 125

Getty AAT

89 of 125

Query: Washington DC museums, multiple views

90 of 125

Sum of all Paintings

  • Have an item for every notable painting in the world
  • Helps identify high-impact, high-traffic entries in Wikidata that need improvement
  • Wikidata:WikiProject_sum_of_all_paintings

91 of 125

ProjectWiki Art Depiction ExplorerKnight Prototype Fund 2018 - Arts and Tech

Rob Fernandez, WMDC

Effie Kapsalis, Smithsonian

Andrew Lih, WMDC

92 of 125

Background

Wikipedia and rich multimedia metadata

1

93 of 125

One of the toughest problems: metadata is typically created for art scholars.

Public-friendly metadata would be useful for discovery and re-use.

93

94 of 125

What is depicted in an artwork? ��How do you consistently convey objects, themes, or concepts?

94

95 of 125

How can we provide the public with predictible points-of-entry to discover artwork?

95

96 of 125

While making it language independent and machine searchable?

96

97 of 125

This has been one of the grand challenges in the cultural and heritage sector.

97

98 of 125

Vision and machine algorithms are limited in what they can do

98

99 of 125

Wikipedia has shown that volunteer contributors can solve these types of problems at scale

99

100 of 125

Wiki Art Depiction Explorer (WADE)

Using Wikipedia, Wikimedia Commons and Wikidata for art metadata

2

101 of 125

The Wiki Art Depiction Explorer will help enrich semantic data about artworks in Wikidata, making it free for the world

101

102 of 125

Every artwork described in Wikidata can have a depicts property, pointing to any number of terms

102

103 of 125

Wiki editors are already adding depicts statements to artworks

103

104 of 125

WikiProject Sum of All Paintings

Goal – richly describe more than 300,000+ notable paintings in Wikidata/Wikipedia

104

105 of 125

Adding depicts properties now is haphazard, unguided and unassisted

105

106 of 125

106

107 of 125

Terms differ among individuals, institutions, and fields of study

107

108 of 125

What if there was a way to visualize the state of metadata for groups of artworks and provide direction for collaborators?

108

109 of 125

How might a context-driven metadata interface support effective and consistent crowdsourced results for describing art?

109

110 of 125

Approach

Established experience, leadership, and audience

3

111 of 125

What "vectors" can we use to suggest good depicts terms?

  • Similar work by artist
  • Same series P179 follows/followed by P155/156
  • Movement - P135, eg. Impressionism
  • GPS "coordinates of the point of view" - P1259
  • Based on - P144
  • Art vocabularies (Europeana, Getty, Smithsonian)
  • Descriptions/critical analysis from Wikipedia article

111

112 of 125

Goal: Create a tool to engage art-loving public in enhancing cultural heritage metadata for discovery and creativity

112

113 of 125

Most depicted items in all paintings in Wikidata

http://tinyurl.com/yanpbxh5

European, Christian art themes prominent

113

114 of 125

Most depicted items in SAAM paintings in Wikidata

http://tinyurl.com/y7uq2rxv

114

115 of 125

Most depicted items in NPG paintings in Wikidata

http://tinyurl.com/yclxf326

115

116 of 125

Previous depicts �statements for �Edward Hopper�works

117 of 125

Find us, talk to us!

Effie Kapsalis, SI @digitaleffie

Andrew Lih, WMDC @fuzheado

117

118 of 125

Caveats

Wikidata still an early work in progress

Many areas well-modeled

Many areas quite bare (items with no statements)

Instances vs subclasses

119 of 125

Issues

Shifting to Wikidata:

  • Control of modeling
  • Display and formatting
  • Features vs. commonality

120 of 125

The future is structured through Wikidata

Wikidata: Internet duct tape

Research, academic hub

CC0 - no copyright

Join top cultural and commercial institutions already working with Wikidata

Ask questions!

121 of 125

121

122 of 125

Wikidata�In one page

123 of 125

Actionable steps

@fuzheado

andrew.LIH@gmail.com

andrew@ANDREWLIH.com

Try editing Wikidata items

Try modifying SPARQL queries

Mix'n'match - Connect your data set or help match terms

Meetup - Contact us, not just DC or NY

Install Wikibase instance - engine that runs Wikidata - http://wikiba.se/

124 of 125

124

125 of 125

Thank you! Discussion - Q&A

FOLLOWUPS?