1 of 19

Data quality issues in DBpedia and the challenges on redesign the Ontology

Presented by Gustavo Publio, PhD. Candidate at AKSW Group, Leipzig University

2 of 19

DBpedia ontology - overview

  • DBpedia exists for over a decade now�
  • Almost 30 millions instances, 80mi properties and more than 88mi type statements�
  • Ontology releases actually follows 31 out of 35 of W3C’s “Data on the Web” best practices

2

3 of 19

DBpedia data quality

  • Many data quality issues can be identified in DBpedia, including:
    • Inconsistencies
    • Incompleteness
    • Non-conforming or wrong data
    • Etc.�
  • Such issues are caused by different failure points such as:
    • Mappings
    • Misleading in ontology structure
    • Wrong information at source (Wikipedia)

3

4 of 19

DBpedia data quality - mapping issues

  • Wrong classification of properties
    • Considering the example of http://dbpedia.org/ontology/sex, the property has domain dbo:Person - but all instances of Persons uses foaf:gender instead of it. If we query for this property as predicate...

4

5 of 19

DBpedia data quality - mapping issues

  • Wrong classification of properties
    • Considering the example of http://dbpedia.org/ontology/sex, the property has domain dbo:Person - but all instances of Persons uses foaf:gender instead of it. If we query for this property as predicate...

5

...

6 of 19

DBpedia data quality - ontology structure

  • Inaccuracy or omission of types and properties
    • SPARQL query for retrieving the historic list of all Presidents of Brazil

6

7 of 19

DBpedia data quality - ontology structure

  • Inaccuracy or omission of types and properties
    • SPARQL query for retrieving the historic list of all Presidents of Brazil

Try #1

7

SELECT DISTINCT ?person

WHERE {

?person a dbo:President .

?person dbo:birthPlace :Brazil

}

8 of 19

DBpedia data quality - ontology structure

  • Inaccuracy or omission of types and properties
    • SPARQL query for retrieving the historic list of all Presidents of Brazil

Try #1

  • Brings only 15 results!

8

SELECT DISTINCT ?person

WHERE {

?person a dbo:President .

?person dbo:birthPlace :Brazil

}

9 of 19

DBpedia data quality - ontology structure

  • Inaccuracy or omission of types and properties
    • SPARQL query for retrieving the historic list of all Presidents of Brazil

Try #2

9

SELECT DISTINCT ?person

WHERE {

?person a dbo:President .

?person dbo:birthPlace/dbo:country :Brazil

}

10 of 19

DBpedia data quality - ontology structure

  • Inaccuracy or omission of types and properties
    • SPARQL query for retrieving the historic list of all Presidents of Brazil

Try #2

  • Brings 69 results!

10

SELECT DISTINCT ?person

WHERE {

?person a dbo:President .

?person dbo:birthPlace/dbo:country :Brazil

}

    • Still many missing entries (recent ones such as Fernando Henrique, Luís Inácio, Dilma Rousseff, etc.)
    • It lists also Vice-presidents that has rdf:type as dbo:President

11 of 19

DBpedia data quality - ontology structure

  • Inaccuracy or omission of types and properties
    • SPARQL query for retrieving the historic list of all Presidents of Brazil

Query with expected results

11

SELECT DISTINCT ?person

WHERE {

?person a dbo:Person .

?person dct:subject <http://dbpedia.org/resource/Category:Presidents_of_Brazil> .

}

12 of 19

Initiatives for improving DBpedia data quality

  • DBpedia Links
    • Assures the constantly links increment�
  • DBpedia Fusion
    • Fusioning chapters and having more sources other than Wikipedia�
  • Redesigning the ontology
    • Having a cleaner alternative version released with the actual version�
  • Moving the Ontology hosting to Github
    • For better issues tracking, versioning, validation, etc.

12

13 of 19

DBpedia ontology structure redesign

  • Top-bottom approach

13

14 of 19

DBpedia ontology structure redesign

  • Top-bottom approach
    • Actually there are 50 to-level classes, including Colours, Diploma and Polyhedron (?)

14

15 of 19

DBpedia ontology structure redesign

  • Top-bottom approach
    • Actually there are 50 to-level classes, including Colours, Diploma and Polyhedron (?)�
  • Top-level classes new hierarchy, as suggested by Gerard Kuys

Activity

Agent

Concept

Communication System

Condition

Event

Physical Thing

Place

TimePeriod

15

{

16 of 19

DBpedia ontology hosting and validation

  • Moving from actual FTP to GitHub host
    • Better versioning
    • Issues tracking
    • Support for different versions�
  • Travis-CI as continuous integration system�
  • Using RDFUnit to run tests over the ontology�
  • Tests were written with Shapes Constraint Language
    • SHACL is a W3C recommendation

16

RDFUnit

17 of 19

Demo

17

18 of 19

DBpedia wants your support!

  • Existing ways to contribute
    • Mappings
    • Chapters
    • Errors tracking
    • Etc.�
  • Give us feedback on ontology redesign�
  • Collaboratively creation of SHACL tests and validations

18

19 of 19

Thank you!

19

Special thanks to Dimitris Kontokostas and Jörn Hees for supporting with initial environment setup