ProvGen: Generating synthetic PROV graphs with predictable structure
Hugo Firth
Newcastle University, UK
Paolo Missier
Newcastle University, UK
IPAW’14
Cologne, Germany, DLR
Tuesday 15th March 2016
Introduction
Tuesday 15th March 2016
Justification
Linked Data Benchmark Council:
“aims to establish industry cooperation between vendors of RDF and Graph database technologies in developing, endorsing, and publishing reliable and insightful benchmark results” [ ldbc.eu ]
Facebook:
“We're proud to release LinkBench this week [...] and hope it will be a useful tool for other developers who need to benchmark and tune database systems” [ facebook.com ]
Tuesday 15th March 2016
Justification (cont.)
Tuesday 15th March 2016
Approach
Preferential attachment:
Continuously add vertices to a graph.
Probability of creating an edge with a vertex v is proportional to the degree of v.
Graph exhibits a power law.
Common in social networking data.
Tuesday 15th March 2016
Approach (cont.)
Tuesday 15th March 2016
The Model - elementary create operations
Tuesday 15th March 2016
The Model - control flow
Seed graphs
Restricts create operations to those
relationships found in the graph.
Specified as W3C PROV-N
Constraint rules
Control local graph topology such as degree, and required relationships.
Specified using custom DSL.
Execution parameters
Control global properties such as size, shape and connectedness of graph.
document
default <http://anotherexample.org/>
prefix ex <http://example.org/>
entity(e2, [ prov:type=“File”, ex:pa
ex:content=“There was a
activity(a1, 2011-11-16T16:05:00, -,
wasGeneratedBy(e2, a1, -, [ex:fct=“s
wasAssociatedWith(a1, ag2, -, [prov:
agent(ag2, [ prov:type=“prov:Peron”
An Entity has relationship “Used” at most 1 times;
An Entity has relationship “WasDerivedFrom” exactly 1 times unless e1 has property {ex_version:”original”};
An Entity has relationship “WasGeneratedBy” exactly 1 times;
An Entity has out degree at most 2;
An Activity has relationship “WasGeneratedBy” exactly 1 times;
an Agent has relationship “WasAssociatedWith” between 1, 120 times with distribution exponential(2.4);
Tuesday 15th March 2016
Constraint DSL
Determiner
Determines the nodes to which a constraint applies. May be variable or invariable.
Condition
Specifies the applicability of an imperative to a determined node. May be selective or greedy.
Imperative
Specifies requirements upon determined nodes. Requirements may be qualified.
“An Entity must have in degree at most 2;”
“An Agent must have relationship “WasAssociatedWith”, unless it has relationship “ActedOnBehalfOf” from an Agent;”
An Entity
An Agent
must have in degree at most 2
must have relationship “WasAssociatedWith”
unless it has
relationship “ActedOnBehalfOf” from an Agent;”
Tuesday 15th March 2016
The System
MATCH
(a:Activity)
CREATE
(a)-[:USED]->(:Entity)
MATCH
(a:Entity)
CREATE
(a)<-[:USED]-(:Activity)
Tuesday 15th March 2016
Use case 1 - The provenance of Wikipedia
entity(e1, [ prov:type="Document" ])�entity(e2, [ prov:type="Document" ])
activity(a1, [ prov:type=”create” ])�activity(a2, [ prov:type="edit" ])�agent(ag, [ prov:type=”prov:Person” ])
used(a2, e1)�wasGeneratedBy(e2, a2, -, [ ex:fct="save" ])
wasGeneratedBy(e1, a1, -, [ ex:fct="publish" ])�wasAssociatedWith(a2, ag, -, [ prov:role="contributor" ])
wasAssociatedWith(a1, ag, -, [ prov:role="creator" ])�wasDerivedFrom(e2, e1)
* Along with other rich sources of provenance, such as version control systems. [ git2prov ]
Tuesday 15th March 2016
Use case 1 (cont.)
All functions use an existing document, except creation
An Activity must have relationship “Used” exactly 1 times, unless it has property(“prov:type”=“create”);
A user is more likely to contribute to the same document multiple times
An Agent must have relationship “WasAssociatedWith” with probability 0.1;
The Agent, ag1, must have relationship “WasAssociatedWith” with the Activity a1, with probablity 0.3, when ag1 has pattern(ag1-[*..10]-a1);
On average a user contributes to x documents with distribution y
An Agent must have relationship “WasAssociatedWith” between 1, 1000 times, with distribution gamma(…, …);
Tuesday 15th March 2016
Use case 1 (cont.)
… Some time later …
Tuesday 15th March 2016
Evaluation method - control
...what organic data?
Tuesday 15th March 2016
Evaluation method - similarity criteria
Tuesday 15th March 2016
Evaluation
Evaluation Criteria | | | | |
used relationships per Entity | 1.0 | 0.0 | 1.0 | 0.0 |
wasAssociateWith relationships per Agent | 2.4 | 6.2 | 2.9 | 0.1 |
Distinct Entity families per Agent | 1.1 | 0.8 | 1.8 | 0.9 |
Control
avg
stdDev
Test
May have noticed:
Associations per Agent is not normally distributed.
Part of the power of the system is to express complex distributions in constraint rules.
No fitting, but Gamma works
Tuesday 15th March 2016
Further work
Evaluation
Architecture
Model
entity(e1, [prov:type:“prov:Document”, version:“1”])
entity(e2, [prov:type:“prov:Document”, version:“${e1.version+1}”])
wasDerivedFrom(e2, e1)
Tuesday 15th March 2016
Summary
PROV community nascent
Propose ProvGen
Tool very prototypical
Potential for stronger evaluation
Tuesday 15th March 2016
Fin
Thank you
Any questions?
Tuesday 15th March 2016