1 of 26

The Vadalog System: Datalog-based Reasoning for Knowledge Graphs

Luigi Bellomarini, Georg Gottlob, and Emanuel Sallinger

University of Oxford

VLDB 2018

Presented by Yuheng Wang

2 of 26

Knowledge Graphs (KGs)

Stores interlinked descriptions of entities/concepts
Encodes the semantics underlying the used terminology

3 of 26

Reasoning over KGs

Most modern ontology languages cannot express this example
Ontological reasoning over KGs has numerous requirements

4 of 26

Three Concrete Requirements for Reasoning over KGs

Recursion over KGs

Full recursion combined with arbitrary joins
At least encompass Datalog

Ontological reasoning

To express SPARQL reasoning under the OWL 2 QL entailment regime and set semantics

Low complexity

To allow scalability

5 of 26

Reasoning over KGs with Vadalog

Warded Datalog is promising

Satisfies all 3 requirements
However, the algorithms are alternating-time Turing machine algorithms, far away from a practical implementation

Vadalog’s core: Warded Datalog

Guarantees termination while keeping small memory footprint

6 of 26

Warded Datalog

Why wardedness?

To tame the propagation of nulls during the construction
Otherwise the reasoning is undecidable and the program will be infinite

7 of 26

Warded Datalog

What is wardedness?

affected(Σ): KeyPerson[1]

Positions that has an existentially quantified variable
If there’s a position 𝐩 with 𝒗 in the body belongs to affected(Σ), the position 𝐩′ with 𝒗 in the head also belongs to affected(Σ)

𝒗 is harmless in a rule: appears at least once in nonaffected(Σ) of the body
𝒗 is harmful in a rule: appears in affected(Σ) of the body
𝒗 is dangerous in a rule: is harmful and appears appears in the head

8 of 26

Warded Datalog

What is wardedness?

Warded rule

Dangerous variables in a rule appear within a single atom (the ward)
The ward shares with other atoms only harmless variables

9 of 26

Termination and Recursion Control

3 complementary ways

Warded forest
Harmless Warded Datalog
Lifted linear forest

10 of 26

Warded Forest

Warded forest of a chase graph for Warded Datalog

All nodes of the chase graphs
All edges of the chase graph that correspond to the application of linear rules
One edge for each nonlinear rule – namely the one from the fact bound to the ward

11 of 26

Warded Forest

In each connected component we only have edges representing the application of linear rules or rules with warded dangerous variables
Every single fact will then inherit all its labelled nulls from exactly one fact

which is its direct parent in the application of a linear rule
or the ward in the application of a warded rule

12 of 26

Warded Forest

Two facts are isomorphic if they refer to the same predicate name, have the same constants in the same positions and there exists a bijection of labelled nulls into labelled nulls

13 of 26

Warded Forest

Theorem 1 gives a fundamental result in terms of description of the “topology” of the warded chase forest, since we see that each subtree is uniquely identified up to isomorphism of the nulls, by its root

14 of 26

Warded Forest

It points out a form of structural periodicity that we will exploit to efficiently guarantee termination

15 of 26

Harmless Warded Datalog

Theorem 1 ensures that for isomorphic facts, subtrees rooted at these facts will be isomorphic as well
Unfortunately, this property of subtrees in warded forests does not extend to generic chase graphs

16 of 26

Harmless Warded Datalog

Pruning a chase graph on the basis of isomorphism on labelled nulls is incorrect
However, if we forbid joins on harmful variables, then Theorem 1 extends to generic chase graphs

17 of 26

Harmless Warded Datalog

The limitation of Theorem 2 is it is restricted to Harmless Warded Datalog
To solve this, a Harmful Joins Elimination Algorithm is designed

18 of 26

Lifted Linear Forest

Ideally Theorem 1 and 2 would be enough for the implementation of a termination strategy
However, the algorithm has very low chances to detect isomorphic subtrees because the isomorphism check relies on the specific ground values in the facts

19 of 26

Lifted Linear Forest

Pattern-isomorphic

P(1,2,x,y) ⇔ P(3,4,z,y)
P(5,5,x,y) not ⇔ P(3,4,z,y)

20 of 26

Lifted Linear Forest

Counterexample on pattern-isomorphic pruning

There is a third fact k sharing a constant c with a′ ′ but not with b′ ′
A non-linear rule could then join a′ ′ with k, producing a′
Since k does not join with b′ ′, there would be no facts b′ pattern-isomorphic to a′

21 of 26

Lifted Linear Forest

A lifted linear forest for a chase graph is the subgraph that contains all the facts from the graph and only the edges that correspond to the application of linear rules

22 of 26

Algorithm and Architecture

23 of 26

Evaluation I: Synthetic Data

Data integration scenarios generated by iBench
STB-128

250 warded rules, 25% of which contain existentials
15 cases of harmful joins and 30 cases of propagation of labeled nulls with warded rules
Target instance contains 800k facts, with 20% of labelled nulls
We ran 16 different queries and averaged the response times
Queries involve on average 5 joins, harmful in 8 cases

ONT-256

It is a set of 789 warded rules, 35% of which contain existentials
295 cases of harmful joins and more than 300 propagations of labelled nulls
Rules contain multiple joins as well as pervasive recursion
Target instance contains ∼2 million facts, with an incidence of 50% of labelled nulls
We ran 11 different queries and averaged the response times
Queries involve an average of 5 joins, which in 5 cases are harmful.

24 of 26

Evaluation I: Synthetic Data

Competitors are the systems reported in ChaseBench, who also had the results on scenarios STB-128 and ONT-256

25 of 26

Evaluation II: Real Scenarios

Company and person data extracted from DBpedia
PSC (persons with significant control)

All 67K companies and subsets of 1K, 10K, 100K, 1M and 1.5M of all persons
Executed the 5 settings with indexed tables in PostgreSQL, MySQL, Oracle and graphs in Neo4J

AllPSC

This scenario is a variation of the previous one
We actually want to group in a single set all the PSC for a company

1 of 26

2 of 26

3 of 26

4 of 26

5 of 26

6 of 26

7 of 26

8 of 26

9 of 26

10 of 26

11 of 26

12 of 26

13 of 26

14 of 26

15 of 26

16 of 26

17 of 26

18 of 26

19 of 26

20 of 26

21 of 26

22 of 26

23 of 26

24 of 26

25 of 26

26 of 26