1 of 26

The Vadalog System: Datalog-based Reasoning for Knowledge Graphs

Luigi Bellomarini, Georg Gottlob, and Emanuel Sallinger

University of Oxford

VLDB 2018

Presented by Yuheng Wang

2 of 26

Knowledge Graphs (KGs)

  • Stores interlinked descriptions of entities/concepts
  • Encodes the semantics underlying the used terminology

3 of 26

Reasoning over KGs

  1. Most modern ontology languages cannot express this example
  2. Ontological reasoning over KGs has numerous requirements

4 of 26

Three Concrete Requirements for Reasoning over KGs

  • Recursion over KGs
    • Full recursion combined with arbitrary joins
    • At least encompass Datalog
  • Ontological reasoning
    • To express SPARQL reasoning under the OWL 2 QL entailment regime and set semantics
  • Low complexity
    • To allow scalability

5 of 26

Reasoning over KGs with Vadalog

  • Warded Datalog is promising
    • Satisfies all 3 requirements
    • However, the algorithms are alternating-time Turing machine algorithms, far away from a practical implementation
  • Vadalog’s core: Warded Datalog
    • Guarantees termination while keeping small memory footprint

6 of 26

Warded Datalog

Why wardedness?

  • To tame the propagation of nulls during the construction
  • Otherwise the reasoning is undecidable and the program will be infinite

7 of 26

Warded Datalog

What is wardedness?

  • Σ:

  • affected(Σ): KeyPerson[1]
    • Positions that has an existentially quantified variable
    • If there’s a position 𝐩 with 𝒗 in the body belongs to affected(Σ), the position 𝐩 with 𝒗 in the head also belongs to affected(Σ)
  • 𝒗 is harmless in a rule: appears at least once in nonaffected(Σ) of the body
  • 𝒗 is harmful in a rule: appears in affected(Σ) of the body
  • 𝒗 is dangerous in a rule: is harmful and appears appears in the head

8 of 26

Warded Datalog

What is wardedness?

  • Σ:

  • Warded rule
    • Dangerous variables in a rule appear within a single atom (the ward)
    • The ward shares with other atoms only harmless variables

9 of 26

Termination and Recursion Control

  • 3 complementary ways
    • Warded forest
    • Harmless Warded Datalog
    • Lifted linear forest

10 of 26

Warded Forest

  • Warded forest of a chase graph for Warded Datalog
    • All nodes of the chase graphs
    • All edges of the chase graph that correspond to the application of linear rules
    • One edge for each nonlinear rule – namely the one from the fact bound to the ward

11 of 26

Warded Forest

  • In each connected component we only have edges representing the application of linear rules or rules with warded dangerous variables
  • Every single fact will then inherit all its labelled nulls from exactly one fact
    • which is its direct parent in the application of a linear rule
    • or the ward in the application of a warded rule

12 of 26

Warded Forest

  • Two facts are isomorphic if they refer to the same predicate name, have the same constants in the same positions and there exists a bijection of labelled nulls into labelled nulls

13 of 26

Warded Forest

  • Theorem 1 gives a fundamental result in terms of description of the “topology” of the warded chase forest, since we see that each subtree is uniquely identified up to isomorphism of the nulls, by its root

14 of 26

Warded Forest

  • It points out a form of structural periodicity that we will exploit to efficiently guarantee termination

15 of 26

Harmless Warded Datalog

  • Theorem 1 ensures that for isomorphic facts, subtrees rooted at these facts will be isomorphic as well
  • Unfortunately, this property of subtrees in warded forests does not extend to generic chase graphs

16 of 26

Harmless Warded Datalog

  • Pruning a chase graph on the basis of isomorphism on labelled nulls is incorrect
  • However, if we forbid joins on harmful variables, then Theorem 1 extends to generic chase graphs

17 of 26

Harmless Warded Datalog

  • The limitation of Theorem 2 is it is restricted to Harmless Warded Datalog
  • To solve this, a Harmful Joins Elimination Algorithm is designed

18 of 26

Lifted Linear Forest

  • Ideally Theorem 1 and 2 would be enough for the implementation of a termination strategy
  • However, the algorithm has very low chances to detect isomorphic subtrees because the isomorphism check relies on the specific ground values in the facts

19 of 26

Lifted Linear Forest

  • Pattern-isomorphic
    • P(1,2,x,y) ⇔ P(3,4,z,y)
    • P(5,5,x,y) not ⇔ P(3,4,z,y)

20 of 26

Lifted Linear Forest

  • Counterexample on pattern-isomorphic pruning
    • There is a third fact k sharing a constant c with a′ ′ but not with b′ ′
    • A non-linear rule could then join a′ ′ with k, producing a
    • Since k does not join with b′ ′, there would be no facts b pattern-isomorphic to a

21 of 26

Lifted Linear Forest

  • A lifted linear forest for a chase graph is the subgraph that contains all the facts from the graph and only the edges that correspond to the application of linear rules

22 of 26

Algorithm and Architecture

23 of 26

Evaluation I: Synthetic Data

  • Data integration scenarios generated by iBench
  • STB-128
    • 250 warded rules, 25% of which contain existentials
    • 15 cases of harmful joins and 30 cases of propagation of labeled nulls with warded rules
    • Target instance contains 800k facts, with 20% of labelled nulls
    • We ran 16 different queries and averaged the response times
    • Queries involve on average 5 joins, harmful in 8 cases
  • ONT-256
    • It is a set of 789 warded rules, 35% of which contain existentials
    • 295 cases of harmful joins and more than 300 propagations of labelled nulls
    • Rules contain multiple joins as well as pervasive recursion
    • Target instance contains ∼2 million facts, with an incidence of 50% of labelled nulls
    • We ran 11 different queries and averaged the response times
    • Queries involve an average of 5 joins, which in 5 cases are harmful.

24 of 26

Evaluation I: Synthetic Data

  • Competitors are the systems reported in ChaseBench, who also had the results on scenarios STB-128 and ONT-256

25 of 26

Evaluation II: Real Scenarios

  • Company and person data extracted from DBpedia
  • PSC (persons with significant control)
    • All 67K companies and subsets of 1K, 10K, 100K, 1M and 1.5M of all persons
    • Executed the 5 settings with indexed tables in PostgreSQL, MySQL, Oracle and graphs in Neo4J
  • AllPSC
    • This scenario is a variation of the previous one
    • We actually want to group in a single set all the PSC for a company

26 of 26

THANK YOU