1 of 32

Graph Databases in Java

What, Why, How

Greater Milwaukee Java Meetup

7/28/2016 @ Digital Measures

2 of 32

Andrew Glassman

Back End Architect - Digital Measures

@a_glassman

http://aglassman.github.io

Nick Halase

Software Engineer - Digital Measures

@nickhalase

3 of 32

What is a Graph Database?

A graph database is a type of data storage that uses vertices and edges to directly relate data.

4 of 32

Graph Entities

5 of 32

Vertices

  • Singular: Vertex
  • Represents a single entity, like a row in a database.
  • A Vertex can have many properties of different data types. These property definitions can be defined by a schema, or schema-less depending on the implementation.
  • Vertices are referred to as Nodes in some libraries.

6 of 32

Edges

  • Edges represent the relationships between vertices.
  • Edges typically have a main “type”.
  • A Edge can have many properties of different data types. These property definitions can be defined by a schema, or schema-less depending on the implementation.
  • Edges are directional. A bidirectional relationship between two nodes would need to be represented using two edges.

7 of 32

Example Domains

8 of 32

Example Domain - Taxonomy

  • Vertices
    • Classification
      • Name, description
    • Characteristic
      • Name, description
    • Species
      • Name, description
  • Edges
    • belongs_to
    • has

9 of 32

Example Domain - Social Graph

  • Vertices
    • Person
      • Name
      • Age
      • Height
  • Edges
    • Knows
      • Since
    • reports_to

10 of 32

Example Domain - Logistics

  • Vertices
    • Hub
      • Name
      • Geo-location
      • Processing Time
    • Truck
      • Name
      • Hauling Capacity
      • Fuel Capacity
      • Efficiency

  • Edges
    • Route
      • Name
      • Travel Time
    • Assigned_To

11 of 32

Example Domain - Logistics

Hub

Truck

Hub

assigned_to

route

route

Hub

route

route

Truck

Truck

12 of 32

Why use a Graph?

13 of 32

Why use a graph model?

  • Highly interconnected data
  • Complex queries
  • Rapidly changing domain
  • The relationships between data elements are as important as the data elements themselves

14 of 32

Highly Interconnected Data

  • Modeling data that is highly interconnected gets very difficult to manage and query in a RDBMS.
  • Typical Example
    • Movies - relational:
      • Actor, Director, Producer, Movie, Show, Episode, Voice Actor, Writer, Studio
      • Maintaining all these tables would get tedious. Now you need tons of PK/FK relations, and join tables. Just imagine the SQL queries this would generate! (so many joins!)

15 of 32

Highly Interconnected Data

  • Movies - Graph
    • Vertices
      • Person, Studio
      • Content
        • type
    • Relations
      • acted_in (Tarantino directed Pulp Fiction(movie))
      • wrote (Tarantino wrote Pulp Fiction(movie))
      • financed (Miramiax financed Pulp Fiction(movie))
      • contains (Pulp Fiction(movie) contains Misirlou(song))

16 of 32

Complex Queries

  • Graph query languages can express the same query much more efficiently than SQL in some domains.
  • RDBMS is based on Relational Algebra. Graph query languages are usually based off of conjunctive Regular Path Queries (CRPQs). This makes them suited to solve different problems.
  • Graph is great at traversing relationships. Imagine doing a “degrees of separation” query in SQL.
    • MYSQL
    • Neo4j
  • Great for Join intensive queries. Just find the first node, or nodes, and traverse the graph.

17 of 32

Complex Queries

  • Some standards have been specified, such as RDF. Many graph implementations are built on top of RDF, OWL2, and SPARQL specs. (Blazegraph, Stardog)
  • OWL2 is a web ontology language. Ontologies are important when working with large graphs, especially when merging data sets.
  • SPARQL is the query language used to query RDF stores.
  • RDF stores are implemented using “triple stores”. SPARQL was designed to be able to query distributed resources that implement ontologies.

18 of 32

Rapidly Changing Domain / Model

  • Graph DBs are almost all schemaless. You can add arbitrary properties, even for the same vertex, or edge types.
  • You can add new node types, or edge types at will, and interconnect them without any side effects.

19 of 32

When NOT to use a Graph?

20 of 32

Inappropriate Use-Cases

  • Collections of key-value pairings
  • Storing aggregate data
    • Finding the amount spent across a set of categories in a particular budget

21 of 32

How to use a Graph DB

22 of 32

Graph is on the rise.

23 of 32

But it’s still dwarfed by RDBMS

24 of 32

  • Pros
    • Free Community Edition + Enterprise Edition
    • Low barrier to entry. Easy Install. Great tutorials.
    • Provides Java APIs
    • Can run Embedded, or Stand Alone
    • Provides REST API for all operations.
    • Proprietary Cypher query language.
    • Provides full, and incremental backups.
  • Cons
    • Tinkerpop 2 only, Tinkerpop3 has few contributors.

25 of 32

  • Pros
    • Supports OWL2, Based on RDF, and supports SPARQL
    • Provides Tinkerpop3 / support
    • Full text, semantically typed search
    • Geospatial queries
    • High availability mode using Zookeeper.
    • Free community license, flexible developer license
  • Cons
    • No incremental backups
    • Expensive per “node”, but flexible on price

26 of 32

  • Pros
    • Based on RDF, supports SPARQL (REST client)
    • Supports Tinkerpop3
    • Supports running on GPUs for fast queries.
    • High Availability configurations supported
    • Embedded option as well.
  • Cons
    • Need commercial license to use. Gray area.
    • Embedded license is GPLv2
    • Tinkerpop implementation only supports embedded mode.

27 of 32

  • Pros
    • Supports Tinkerpop3
    • Built on Cassandra, so you get a robust, scalable platform out of the box.
    • Supports incremental backups, and snapshots.
  • Cons
    • Not built on RDF (not really a con)
    • No community edition.
  • Other

28 of 32

  • Pros
    • Community Edition
    • Multi-Model (graph + document)
    • Supports Tinkerpop3
    • High Availability configurations.
  • Cons

29 of 32

Apache TinkerPop

  • An open graph computing framework
  • An abstraction layer over graph databases
  • Helps to prevents vendor lock-in
  • Maturity
    • TinkerPop3 has gained the sponsorship of the Apache Foundation

30 of 32

Apache TinkerPop

  • In development for 7 years
  • TinkerPop2 final release was September 2014
  • Since TinkerPop2, TinkerPop3 has been maintained under the Apache Foundation
    • As of 3.2.1 the project is no longer “incubating”
    • Released July 18th, 2016 (last week!)

31 of 32

Why TinkerPop?

source: tinkerpop.apache.org/docs/current/tutorials/getting-started/#_why_tinkerpop

32 of 32

Why TinkerPop?

  • Industry adoption and rockstar developers
    • Blazegraph
    • DSEGraph
    • Hadoop (Giraph)
    • Hadoop (Spark)
    • IBM Graph
    • Neo4j
    • Sqlg
    • Stardog
    • TinkerGraph
    • Titan
    • Titan (Amazon)
    • Titan (Tupl)
    • Unipop

source: tinkerpop.apache.org