1 of 28

From Freebase to Wikidata

The Great Migration

Thomas Pellissier Tanon,[ex-Google] Denny Vrandečić,[Google] Sebastian Schaffert,[Google] Thomas Steiner,[Google] and Lydia Pintscher[Wikimedia Deutschland]

WWW 2016, April 14, 2016, Montreal, Canada

2 of 28

Introduction

Proprietary + Confidential

3 of 28

Web-based Knowledge Bases

  • Web-based knowledge bases that make their data available under free licenses in a machine-readable form have become central for the data strategy of many projects and organizations.
  • They find applications in areas as diverse as Web search, natural language annotation, or translation.�
  • One such collaborative knowledge base is Freebase, publicly launched by Metaweb in 2007 and acquired by Google in 2010.
  • Another example is Wikidata, a collaborative knowledge base developed by Wikimedia Deutschland since 2012 and operated by the Wikimedia Foundation.

4 of 28

Freebase

  • Freebase is an open and collaborative knowledge base publicly launched in 2007 by Metaweb and acquired in 2010 by Google.
  • It was used as the open core of the Google Knowledge Graph, and has found many use cases outside of Google.
  • Due to the success of Wikidata, Google announced in 2014 its intent to close Freebase and help with the migration of the content to Wikidata.

5 of 28

Wikidata

  • Wikidata is a collaborative knowledge base launched in October 2012 and hosted by the Wikimedia Foundation.
  • Its community has been growing quickly, and as of mid-2015, the community comprises about 6,000 active contributors.
  • The content of Wikidata is in the public domain under Creative Commons CC0.
  • As of September 2015, Wikidata counted about 70 million statements on 18 million entities.

6 of 28

Freebase Data Model

  • Freebase is built on the notions of objects, facts, types, and properties.
  • Each Freebase object has a stable identifier called a “mid” (for Machine ID), one or more types, and uses properties from these types in order to provide facts.
  • Freebase uses Compound Value Types (CVTs) to represent n-ary relations with n > 2.
  • CVT values are just objects, i.e., they have a mid and can have types.

7 of 28

Wikidata Data Model

  • Wikidata’s data model relies on the notions of item and statement. An item represents an entity, has a stable identifier called “Q-ID”, and may have labels, descriptions, and aliases in multiple languages; further statements and links to pages about the entity in other Wikimedia projects.
  • Contrary to Freebase, Wikidata statements do not aim to encode true facts, but claims from different sources that can also contradict each other, which, for example, allows for border conflicts to be expressed from different political points of view.

8 of 28

Motivation for the Freebase Shutdown

  • When Google publicly launched Freebase back in 2007, Freebase was thought of as a “Wikipedia for structured data”.
  • The Knowledge Graph team at Google have been closely watching the Wikimedia Foundation’s project Wikidata, and believe strongly in a robust community-driven effort to collect and curate structured knowledge about the world.
  • The team now think they can serve that goal best by supporting Wikidata, as the project is growing fast, has an active community, and is better-suited to lead an open collaborative knowledge base.
  • The Knowledge Graph Search API was launched.

9 of 28

Migration Challenges—Licensing

  • Wikidata is published under Creative Commons 0 (CC0 1.0), which means that all rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law, are waived.
  • Freebase is published under a Creative Commons Attribution (CC BY 2.5) license.
  • Google does not own the copyright of some parts of the content of the knowledge base, such as images or long entity descriptions extracted from Wikipedia.
  • We filtered the Freebase dump by removing this kind of content before creating a data dump that Google could relicense under CC0.
  • This step reduced the number of Freebase facts that could be republished by about 42 million facts from 3 billion facts in the original corpus, i.e., by about 1.4%.

10 of 28

Migration Challenges—References

  • The Wikidata community is very eager to have references for their statements, i.e., sources that Freebase usually did not store.
  • In order to provide the Wikidata community with references for the facts in Freebase, we have reused data from the Google Knowledge Vault, which aims at extracting facts from the Web.
  • Issue: the pages the facts were extracted from in many cases did not meet the Wikidata requirements for reliable references.
  • It became necessary to filter the references before potential inclusion in Wikidata by introducing a domain blacklist.

11 of 28

Migration Challenges—Data Quality and Maintenance

  • The data quality of Freebase was discussed by the Wikidata community, and it was decided that the quality overall was not sufficient for a direct import.
  • In consequence, a fully automatic upload of all the content from Freebase into Wikidata did not seem advisable as the expectations of the Wikidata community regarding the quality of automatically uploaded data are high.
  • Simply ingesting the Freebase data would have meant overwhelming the existing editors and thereby harming the project in the long run.
  • As an alternative approach, we decided to rely on crowdsourced human curation and created the Primary Sources Tool.

12 of 28

Migration Challenges—Data Topic Mappings

  • Challenging to create data mappings between Freebase topics and properties, and Wikidata items and properties.
  • Two mappings between Freebase topics and Wikidata items were initially available, and in addition to those, we worked on further mappings.
  • One existing mapping is from Google and one from Samsung. Both are based on Wikipedia links already present in Freebase: if a Freebase topic and a Wikidata item share at least 2 (Google) or 1 (Samsung) Wikipedia links, they are assumed to describe the same subject.
  • For the mapping between Freebase and Wikidata properties, we have chosen to apply a manual approach with the help of the Wikidata community.

13 of 28

Primary Sources Tool—Front-end

Proprietary + Confidential

14 of 28

Primary Sources Tool—Introduction

  • A crowdsourced human curation software solution that displays Freebase statements for verification to the Wikidata contributor so that these statements can be added to the currently shown Wikidata item.
  • With just one click, the contributor can reject or approve a statement, and, in case of approval, add it to Wikidata.
  • The code of the Primary Sources Tool is openly available (https://github.com/google/primarysources) under the terms of the Apache 2.0 license.
  • The tool is deployed as a gadget, so that it can be easily enabled as an add-on feature in Wikidata.
  • It is independent from Freebase and can be—and already has been—reused to import other datasets into Wikidata.

15 of 28

Primary Sources Tool—Front-end

  • For the front-end, our objective was to model the migration process as close to the natural Wikidata editing flow as possible.
  • To achieve this, we have in a first step created a Wikidata user script. Wikidata user scripts are part of the Wikidata tool chain and are created by users, but unlike gadgets do not appear in a user’s preferences.
  • Once a user script has matured, it can be converted into a gadget. Gadgets are scripts that are likewise created by users, but which can be simply enabled in user preferences under the section “Gadgets”.
  • They can only be edited by administrators and are assumed to be stable.

16 of 28

Primary Sources Tool—Leaderboard

  • At the time of writing (April, 2016), the tool has been used by more than a hundred users who performed about 160,000 approval or rejection actions.
  • More than 14 million statements have been uploaded to the tool in total.
  • To visualize the migration progress, we have created a realtime leaderboard.

17 of 28

Primary Sources Tool—Other Datasets

  • The Primary Sources Tool from the start was designed to be used with other datasets apart from the Freebase dataset.
  • A concrete first example is the “FBK StrepHit Soccer” dataset, which, among other datasets, can be activated in the tool by clicking on its gears icon.
  • Only one specific or all datasets can be active at once.

18 of 28

Primary Sources Tool—Back-end

Proprietary + Confidential

19 of 28

Implementation

Back-end:

  • Service-oriented architecture in C++
  • FastCGI served by lighttpd
  • REST API using CppCMS
  • Data model using Protocol Buffers

Persistence:

  • CppDB as abstraction layer
  • Relational database (MySQL—Schema)

20 of 28

Requirements vs. Reality

Requirements:

  • Serve >100 million statements and keep track of approval state
  • Access data through a REST API
  • Average response time <0.5 seconds
  • Run on limited resources of http://tools.wmflabs.org

Implementation:

  • Currently ~15M statements
  • Response time (GET): 54ms median, 260ms average, 840ms 90th percentile
  • Response time (POST): 80ms median, 197ms average, 622ms 90th percentile
  • Maximum memory consumption: 40MB
  • Maintenance-free since ~1 year

21 of 28

Evaluation / Statistics

Proprietary + Confidential

22 of 28

Freebase vs. Wikidata (August 2015)

Freebase:

  • 48 million topics
  • 3 billion triples
  • 442 million “useful” facts
  • 68 million labels

Wikidata:

  • 14.5 million items
  • 66 million statements
  • 82 million labels

Differences:

  • Encoding of statements in Wikidata is more efficient.
  • Many Freebase topics do not match Wikidata’s “Notability Criteria”.
  • Freebase contains many redundancies, e.g., reverse edges.
  • Freebase contains duplicate data.

23 of 28

Raw Data

  • 4.56 million topics from Freebase mapped to Wikidata �(~9.5% of Freebase, 21% increase for Wikidata).
  • 19.6 million Wikidata statements �(out of 64 million Freebase triples).
  • 14 million new and unique Wikidata statements �(after removing duplicates and facts already in Wikidata).

24 of 28

Usage Data (April 2016)

Data Statistics:

  • 14.6M statements from Freebase and other sources served by backend
  • 137k statements manually approved for Wikidata.
  • 30k statements manually marked as wrong.
  • 100k approvals (out of 133k) done by top 10 users.

Access Statistics:

  • ~4000 entities shown per day.
  • ~500 approvals, ~150 rejections per day.

25 of 28

Conclusions and Future Work

Proprietary + Confidential

26 of 28

Future Work

  • The largest gains for the migration can be achieved by extending the mapping to more Freebase topics. A possible way of realizing this is to create a user interface that would allow users to create new Wikidata items for a suggested topic or to add a mapping for them to an existing Wikidata item.
  • In order to suggest interesting topics to add to Wikidata, we could rank the topics that are not mapped yet by the number of incoming links from already mapped topics.
  • More improvement through upload high quality datasets using a bot, like the reviewed facts or some sets for external IDs.

27 of 28

Conclusions

  • Provided the Wikidata community with more than 14 million new Wikidata statements using generalizable approach consisting of data preparation scripts and the Primary Sources Tool.
  • The effort needed to map two fairly different knowledge bases has also been a good occasion to highlight the difficulty of having adequate metrics to measure the size of knowledge bases in a meaningful way, and, in consequence, their “value”.
  • With the help of the Primary Sources Tool and the Freebase dataset—and in future even more datasets—we will increase the completeness and accuracy of Wikidata.

Freebase article spatio-temporal distribution

Wikidata article spatio-temporal distribution

28 of 28

Resources