1 of 32

Graph Databases in Java

What, Why, How

Greater Milwaukee Java Meetup

7/28/2016 @ Digital Measures

2 of 32

Andrew Glassman

Back End Architect - Digital Measures

@a_glassman

http://aglassman.github.io

Nick Halase

Software Engineer - Digital Measures

@nickhalase

3 of 32

What is a Graph Database?

A graph database is a type of data storage that uses vertices and edges to directly relate data.

4 of 32

Graph Entities

5 of 32

Vertices

Singular: Vertex
Represents a single entity, like a row in a database.
A Vertex can have many properties of different data types. These property definitions can be defined by a schema, or schema-less depending on the implementation.
Vertices are referred to as Nodes in some libraries.

6 of 32

Edges

Edges represent the relationships between vertices.
Edges typically have a main “type”.
A Edge can have many properties of different data types. These property definitions can be defined by a schema, or schema-less depending on the implementation.
Edges are directional. A bidirectional relationship between two nodes would need to be represented using two edges.

7 of 32

Example Domains

8 of 32

Example Domain - Taxonomy

Vertices

Classification

Name, description

Characteristic

Name, description

Species

Name, description

Edges

belongs_to
has

9 of 32

Example Domain - Social Graph

Vertices

Person

Name
Age
Height

Edges

Knows

Since

reports_to

10 of 32

Example Domain - Logistics

Vertices

Name
Geo-location
Processing Time

Truck

Name
Hauling Capacity
Fuel Capacity
Efficiency

Edges

Route

Name
Travel Time

Assigned_To

11 of 32

Example Domain - Logistics

Hub

Truck

Hub

assigned_to

route

Hub

route

Truck

12 of 32

Why use a Graph?

13 of 32

Why use a graph model?

Highly interconnected data
Complex queries
Rapidly changing domain
The relationships between data elements are as important as the data elements themselves

Source

14 of 32

Highly Interconnected Data

Modeling data that is highly interconnected gets very difficult to manage and query in a RDBMS.
Typical Example

Movies - relational:

Actor, Director, Producer, Movie, Show, Episode, Voice Actor, Writer, Studio
Maintaining all these tables would get tedious. Now you need tons of PK/FK relations, and join tables. Just imagine the SQL queries this would generate! (so many joins!)

15 of 32

Highly Interconnected Data

Movies - Graph

Vertices

Person, Studio
Content

type

Relations

acted_in (Tarantino directed Pulp Fiction(movie))
wrote (Tarantino wrote Pulp Fiction(movie))
financed (Miramiax financed Pulp Fiction(movie))
contains (Pulp Fiction(movie) contains Misirlou(song))

16 of 32

Complex Queries

Graph query languages can express the same query much more efficiently than SQL in some domains.
RDBMS is based on Relational Algebra. Graph query languages are usually based off of conjunctive Regular Path Queries (CRPQs). This makes them suited to solve different problems.
Graph is great at traversing relationships. Imagine doing a “degrees of separation” query in SQL.

Great for Join intensive queries. Just find the first node, or nodes, and traverse the graph.

17 of 32

Complex Queries

Some standards have been specified, such as RDF. Many graph implementations are built on top of RDF, OWL2, and SPARQL specs. (Blazegraph, Stardog)
OWL2 is a web ontology language. Ontologies are important when working with large graphs, especially when merging data sets.
SPARQL is the query language used to query RDF stores.
RDF stores are implemented using “triple stores”. SPARQL was designed to be able to query distributed resources that implement ontologies.

18 of 32

Rapidly Changing Domain / Model

Graph DBs are almost all schemaless. You can add arbitrary properties, even for the same vertex, or edge types.
You can add new node types, or edge types at will, and interconnect them without any side effects.

19 of 32

When NOT to use a Graph?

20 of 32

Inappropriate Use-Cases

Collections of key-value pairings
Storing aggregate data

Finding the amount spent across a set of categories in a particular budget

21 of 32

How to use a Graph DB

22 of 32

Graph is on the rise.

23 of 32

But it’s still dwarfed by RDBMS

24 of 32

Neo4j

Pros

Free Community Edition + Enterprise Edition
Low barrier to entry. Easy Install. Great tutorials.
Provides Java APIs
Can run Embedded, or Stand Alone
Provides REST API for all operations.
Proprietary Cypher query language.
Provides full, and incremental backups.

Cons

Tinkerpop 2 only, Tinkerpop3 has few contributors.

25 of 32

Stardog

Pros

Supports OWL2, Based on RDF, and supports SPARQL
Provides Tinkerpop3 / support
Full text, semantically typed search
Geospatial queries
High availability mode using Zookeeper.
Free community license, flexible developer license

Cons

No incremental backups
Expensive per “node”, but flexible on price

26 of 32

Blazegraph

Pros

Based on RDF, supports SPARQL (REST client)
Supports Tinkerpop3
Supports running on GPUs for fast queries.
High Availability configurations supported
Embedded option as well.

Cons

Need commercial license to use. Gray area.
Embedded license is GPLv2
Tinkerpop implementation only supports embedded mode.

27 of 32

DataStax Graph

Pros

Supports Tinkerpop3
Built on Cassandra, so you get a robust, scalable platform out of the box.
Supports incremental backups, and snapshots.

Cons

Not built on RDF (not really a con)
No community edition.

Other

DataStax Blog - Benefits of Gremlin

28 of 32

OrientDB

Pros

Community Edition
Multi-Model (graph + document)
Supports Tinkerpop3
High Availability configurations.

Cons

Seems to have remote transaction issues.
U nstable community that seems driven by “most features” over quality features.

29 of 32

Apache TinkerPop

An open graph computing framework
An abstraction layer over graph databases
Helps to prevents vendor lock-in
Maturity

TinkerPop3 has gained the sponsorship of the Apache Foundation

30 of 32

Apache TinkerPop

In development for 7 years
TinkerPop2 final release was September 2014
Since TinkerPop2, TinkerPop3 has been maintained under the Apache Foundation

As of 3.2.1 the project is no longer “incubating”
Released July 18th, 2016 (last week!)

31 of 32

Why TinkerPop?

source: tinkerpop.apache.org/docs/current/tutorials/getting-started/#_why_tinkerpop

32 of 32

Why TinkerPop?

Industry adoption and rockstar developers

Blazegraph
DSEGraph
Hadoop (Giraph)
Hadoop (Spark)
IBM Graph
Neo4j
Sqlg
Stardog
TinkerGraph
Titan
Titan (Amazon)
Titan (Tupl)
Unipop

source: tinkerpop.apache.org