1 of 19

ProvGen: Generating synthetic PROV graphs with predictable structure

Hugo Firth

Newcastle University, UK

Paolo Missier

Newcastle University, UK

IPAW’14

Cologne, Germany, DLR

Tuesday 15th March 2016

2 of 19

Introduction

  • Where provenance is captured, it is often captured in bulk.

  • We’ve seen that provenance may come from varied domains.
  • Scalable infrastructure required to take advantage of such large, heterogeneous, datasets.
  • No generic, scalable architecture for PROV management and querying.
  • Bootstrapping problem. Hard to encourage development with insufficient public data for effective benchmarks.

Tuesday 15th March 2016

3 of 19

Justification

Linked Data Benchmark Council:

aims to establish industry cooperation between vendors of RDF and Graph database technologies in developing, endorsing, and publishing reliable and insightful benchmark results[ ldbc.eu ]

  • Data benchmarking is established practice in many other “Big data” domains:

Facebook:

We're proud to release LinkBench this week [...] and hope it will be a useful tool for other developers who need to benchmark and tune database systems [ facebook.com ]

  • Some nascent initiatives to collect and document a public corpus of PROV datasets (e.g ProvBench’14 [ bit.ly/1c0q5rS ])

Tuesday 15th March 2016

4 of 19

Justification (cont.)

  • Why are existing linked data benchmarks not enough?
  • Many existing implementations focus on domain specific data, such as social networking.

  • Collections of PROV data, as meta-data, vary topologically as a function of the underlying application:

    • Scientific workflows produce very many small provenance ‘traces’

    • Version control systems produce fewer, larger, provenance ‘traces’

  • A generic provenance management architecture should be able to quantify its performance in the context of the breadth of its potential application.

  • There is no coherent picture of a common query workload.

Tuesday 15th March 2016

5 of 19

Approach

  • We contend that synthetic PROV graphs can provide a useful addition to existing community efforts.

  • Not without precedent, however most generation techniques purely ensure global statistical properties of a graph. e.g:

Preferential attachment:

Continuously add vertices to a graph.

Probability of creating an edge with a vertex v is proportional to the degree of v.

Graph exhibits a power law.

Common in social networking data.

  • Statistical models lose semantics. In many PROV applications (data reproducibility etc…) semantics are vital.

Tuesday 15th March 2016

6 of 19

Approach (cont.)

  • We propose ProvGen [ prov-gen ], a tool for generating synthetic PROV datasets with predictable structure.
  • ProvGen aims to satisfy 3 key objectives:
  • Generate potentially large synthetic PROV graphs.
  • Provide a simple interface to support non-technical use
  • Generate synthetic PROV graphs with realistic structure and properties.

Tuesday 15th March 2016

7 of 19

The Model - elementary create operations

  • Iterative generation process which starts from a single node.

  • Predefined set of create operations representing valid PROV.

  • Operations account for all relationships defined in PROV-DM.

  • Example of operations to represent used relationship:

      • given an entity node e, add a new activity node a and an edge used(a, e)

      • given an activity node a, add a new entity node e and an edge used(a, e)

      • given a pair of unrelated nodes (a, e), add edge used(a, e)

Tuesday 15th March 2016

8 of 19

The Model - control flow

  • Users control the execution of create operations in 3 ways:

Seed graphs

Restricts create operations to those

relationships found in the graph.

Specified as W3C PROV-N

Constraint rules

Control local graph topology such as degree, and required relationships.

Specified using custom DSL.

Execution parameters

Control global properties such as size, shape and connectedness of graph.

document

default <http://anotherexample.org/>

prefix ex <http://example.org/>

entity(e2, [ prov:type=“File”, ex:pa

ex:content=“There was a

activity(a1, 2011-11-16T16:05:00, -,

wasGeneratedBy(e2, a1, -, [ex:fct=“s

wasAssociatedWith(a1, ag2, -, [prov:

agent(ag2, [ prov:type=“prov:Peron”

An Entity has relationship “Used” at most 1 times;

An Entity has relationship “WasDerivedFrom” exactly 1 times unless e1 has property {ex_version:”original”};

An Entity has relationship “WasGeneratedBy” exactly 1 times;

An Entity has out degree at most 2;

An Activity has relationship “WasGeneratedBy” exactly 1 times;

an Agent has relationship “WasAssociatedWith” between 1, 120 times with distribution exponential(2.4);

Tuesday 15th March 2016

9 of 19

Constraint DSL

  • Constraint rules are specified in a natural english syntax.

  • Consist of 3 structural components: Determiner, Imperative and an optional Condition.

Determiner

Determines the nodes to which a constraint applies. May be variable or invariable.

Condition

Specifies the applicability of an imperative to a determined node. May be selective or greedy.

Imperative

Specifies requirements upon determined nodes. Requirements may be qualified.

  • Constraints are in addition to those implied by valid PROV.

  • Constrains local topology and properties, therefore controls semantics of traces.

“An Entity must have in degree at most 2;”

“An Agent must have relationship “WasAssociatedWith”, unless it has relationship “ActedOnBehalfOf” from an Agent;”

An Entity

An Agent

must have in degree at most 2

must have relationship “WasAssociatedWith”

unless it has

relationship “ActedOnBehalfOf” from an Agent;”

Tuesday 15th March 2016

10 of 19

The System

  • System built upon Neo4j DBMS [ neo4j.org ]

  • Built for flexibility: future versions won’t rely upon Neo4j.
  • Create operations map directly to Cypher:

  • Overview of system architecture:
  1. Seed is parsed, producing set of edges.

  • PROV operations filtered by edge set.

  • User constraint rules parsed.

  • Constraints provided to operations where applicable.

  • Generator loop iteratively evaluates and executes valid operations.
  • Seed is parsed, producing set of edges.

  • PROV operations filtered by edge set.

  • User constraint rules parsed.

  • Constraints provided to operations where applicable.

  • Generator loop iteratively evaluates and executes valid operations.
  • Seed is parsed, producing set of edges.

  • PROV operations filtered by edge set.

  • User constraint rules parsed.

  • Constraints provided to operations where applicable.

  • Generator loop iteratively evaluates and executes valid operations.
  • Seed is parsed, producing set of edges.

  • PROV operations filtered by edge set.

  • User constraint rules parsed.

  • Constraints provided to operations where applicable.

  • Generator loop iteratively evaluates and executes valid operations.
  • Seed is parsed, producing set of edges.

  • PROV operations filtered by edge set.

  • User constraint rules parsed.

  • Constraints provided to operations where applicable.

  • Generator loop iteratively evaluates and executes valid operations.

MATCH

(a:Activity)

CREATE

(a)-[:USED]->(:Entity)

MATCH

(a:Entity)

CREATE

(a)<-[:USED]-(:Activity)

Tuesday 15th March 2016

11 of 19

Use case 1 - The provenance of Wikipedia

  • Consider the task of generating provenance similar to that of Wikipedia. [ wikipedia.org ]
  • Conforms to “document revision” user interaction model*.

entity(e1, [ prov:type="Document" ])�entity(e2, [ prov:type="Document" ])

activity(a1, [ prov:type=”create” ])�activity(a2, [ prov:type="edit" ])�agent(ag, [ prov:type=”prov:Person” ])

used(a2, e1)�wasGeneratedBy(e2, a2, -, [ ex:fct="save" ])

wasGeneratedBy(e1, a1, -, [ ex:fct="publish" ])�wasAssociatedWith(a2, ag, -, [ prov:role="contributor" ])

wasAssociatedWith(a1, ag, -, [ prov:role="creator" ])�wasDerivedFrom(e2, e1)

* Along with other rich sources of provenance, such as version control systems. [ git2prov ]

  • Design a seed PROV-N trace to reflect this.
  • A user creates a document, which is subsequently edited by another user.

Tuesday 15th March 2016

12 of 19

Use case 1 (cont.)

  • Reason about the semantic rules of the Wikipedia model.

  • Express these rules using ProvGen’s constraint DSL, e.g:

All functions use an existing document, except creation

An Activity must have relationship “Used” exactly 1 times, unless it has property(“prov:type”=“create”);

A user is more likely to contribute to the same document multiple times

An Agent must have relationship “WasAssociatedWith” with probability 0.1;

The Agent, ag1, must have relationship “WasAssociatedWith” with the Activity a1, with probablity 0.3, when ag1 has pattern(ag1-[*..10]-a1);

On average a user contributes to x documents with distribution y

An Agent must have relationship “WasAssociatedWith” between 1, 1000 times, with distribution gamma(…, …);

Tuesday 15th March 2016

13 of 19

Use case 1 (cont.)

  • Provide both seed trace and constraint rules as input to ProvGen and ...

… Some time later …

Tuesday 15th March 2016

14 of 19

Evaluation method - control

  • Encompassing goal of ProvGen is to generate provenance graphs for domains where they are lacking.
  • Effectiveness of ProvGen is dependent upon the degree of similarity between sythetic and organic data.

...what organic data?

  • Counterintuitively our evaluation requires generating data for domains where it already exists in abundance.
  • Control dataset: Wikipedia-PROV [ provbench ] 4000 vertices & 6000 vertices

Tuesday 15th March 2016

15 of 19

Evaluation method - similarity criteria

  • ProvGen model is to reason about topology and semantics.
  • But … want to be able to empirically demonstrate similarity.
  • Demonstrating global statistical similarity (i.e degree distribution) of little value.
  • For the purposes of preliminary evaluation we compromise.

  • Take the global average incidence of three local topological features.
  1. Usages per Entity

  • Associations per Agent

  • Distinct families of Entities contributed to per Agent

Tuesday 15th March 2016

16 of 19

Evaluation

Evaluation Criteria

used relationships per Entity

1.0

0.0

1.0

0.0

wasAssociateWith relationships per Agent

2.4

6.2

2.9

0.1

Distinct Entity families per Agent

1.1

0.8

1.8

0.9

Control

avg

stdDev

Test

  • Less than 1 standard deviation from the discovered mean for all criteria.
  • Criteria imperfect for measuring similarity of graph semantics, as they only capture a single-hop.
  • However, encouraged by promising results, now investigating a more thorough evaluation method on more datasets.

May have noticed:

Associations per Agent is not normally distributed.

Part of the power of the system is to express complex distributions in constraint rules.

No fitting, but Gamma works

Tuesday 15th March 2016

17 of 19

Further work

  • Investigate a method for similarity comparison which better accounts for semantics, such as frequent subgraph matching.

Evaluation

  • Assess ProvGen’s ability to generate data from more domains.

Architecture

Model

  • Remove incidental dependency on transactional model.
  • Investigate parallelizability.
  • Add parameter syntax to seed traces:

entity(e1, [prov:type:“prov:Document”, version:“1”])

entity(e2, [prov:type:“prov:Document”, version:“${e1.version+1}”])

wasDerivedFrom(e2, e1)

Tuesday 15th March 2016

18 of 19

Summary

  • ...lack of suitable test data available to provenance researchers.
  • Tractable tool for generating PROV graphs.

PROV community nascent

  • Starting to develop key infrastructure & algorithms, but...

Propose ProvGen

  • Reasons over topology and semantics with seeds + constraints.

Tool very prototypical

  • Still adding features. Suggestions?

Potential for stronger evaluation

  • More datasets, & frequent subgraph mining for similarity.

Tuesday 15th March 2016

19 of 19

Fin

Thank you

Any questions?

Tuesday 15th March 2016