LATC Support Action - May 2011 - v0.2

NoSQL solutions for

Linked Data processing


Michael Hausenblas, DERI, NUI Galway

In this article I’m going to discuss a number of NoSQL systems with regard to their Linked Data processing capabilities. The goal is to give developers and system architects a quick overview and help them assess if and to what extent they can use a certain NoSQL system for storing, querying and exposing RDF data. I will not take dedicated RDF databases into account, as there are plenty of resources available comparing them. As well out of scope are solutions based on relational databases, as this is the topic of another of my activities, the RDB2RDF mapping standardisation.

Table of Contents

Candidates

   BigQuery | Cassandra | CouchDB

   Hadoop | HBase | MongoDB

   Neo4j | SimpleDB |Riak | Others

Summary

Appendix

Setting the stage

A good starting point to dive into the topic is Arto Bendiken’s excellent write-up on ‘How RDF Databases Differ from Other NoSQL Solutions’ as well as Sandro Hawke’s ‘RDF meets NoSQL’. I’ll first briefly introduce the NoSQL systems, then describe RDF plug-ins or the like I’ve found for them and then conclude with some general observations.

Who should read this?

If you’re a developer or system architect who either already uses one of the NoSQL systems below or you think about using one of them, and you want to process RDF data or make your data available as Linked Data, then you should continue reading.

How were the NoSQL systems selected?

I’m not very strict regarding the NoSQL systems I consider here: both stuff you can download and run on your machines and cloud computing offerings are relevant, as long as someone has provided an RDF processing plug-in for it or has done an experiment with processing RDF on the system.

To a lesser extent, the observations exhibited here are based on my own experiments, the majority of the information presented in this article is based on work other people have done and is typically available via blog posts, academic publications or software documentations. All errors and omissions are mine.

Candidates

In the following I’ll quickly introduce each of the original NoSQL systems and then describe the Linked Data processing capabilities.

BigQuery

Summary

BigQuery is a web service that enables you to do interactive analysis of massively large datasets.

Provider

Google

Maturity

Available since 2010; in Google Code Labs status

Development

Java, Python, shell, curl

Type

cloud computing IaaS offering

Usage

Terms of Services, Quota

BigQuery is a cloud computing offering by Google, a rather new player in town. It is supposed to complement MapReduce jobs in terms of interactive query processing and was introduced together with Google Storage and the Google Prediction API in early 2010.

In late 2010 I coded up bigquery-linkeddata, a Python-based service and set of tools that allows to load RDF/N-Triples content into Google Storage as well as exposing an endpoint allowing to query the data in BigQuery's SELECT syntax (no SPARQL-to-SQL mapping available so far).

Cassandra

Summary

The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model.

Provider

Apache (was: Facebook)

Maturity

Open-sourced in July 2008, now an Apache top-level project

Development

Java

Type

stand-alone distributed database management system

Usage

Available under ASF 2.0

Cassandra is an established NoSQL system used by some relevant companies including CISCO, Facebook and Rackspace.

There is a Cassandra storage adaptor for RDF.rb available, developed by Arto Bendiken. The RDF.rb lib with Cassandra is also used in Dydra, an RDF cloud service Arto is involved in.

The other work I’m aware of in this area is by Günter Ladwig and Andreas Harth: they developed Cumulus, which uses Cassandra.

CouchDB

Summary

Apache CouchDB is a document-oriented database written in the Erlang; it can be queried and indexed in a MapReduce fashion.

Provider

Apache

Maturity

Open-sourced in February 2008, now an Apache top-level project

Development

JavaScript (default), PHP, Ruby, Python, Erlang

Type

stand-alone document-oriented database

Usage

Available under ASF 2.0

CouchDB is another established distributed, schema-free NoSQL system that manages the data as a collection of JSON documents and is used by Ubuntu, Couchbase and many more.

Greg Lappen has provided a CouchDB storage adaptor for RDF.rb, based on Ben Lavender’s skeleton. As CouchDB’s native language is JSON, it seems that efforts like JSON-LD (JavaScript Object Notation for Linked Data) are a good fit, although I wasn’t able to find implementations for it, yet.

Last but not least, see also a recent discussion on the CouchDB users list regarding a ‘CouchDB x RDF databases comparison’.

Hadoop

Summary

Apache Hadoop is a software framework written in Java that supports reliable, scalable, distributed computing, including subprojects, such as HDFS (a distributed file system) and a MapReduce engine.

Provider

Apache

Maturity

Around since 2005, now an Apache top-level project

Development

Java, Python, etc.

Type

MapReduce framework

Usage

Available under ASF 2.0

Hadoop is a veteran and used by a large number of commercial and educational entities. You can use it locally (that is, on your machines) or as an IaaS cloud computing offering, such as Amazon’s Elastic MapReduce (EMR).

Arto Bendiken has written RDFgrid, a framework for batch-processing RDF data with Hadoop and Amazon EMR; Datagraph uses it in Dydra.

A group of researchers published about ‘Extensions to the Pig data processing platform for scalable RDF data processing using Hadoop’ - Apache Pig is a high-level language on top of Hadoop MapReduce with certain similarities concerning SQL.

Eventually, a number of best practices for processing RDF data with MapReduce/Hadoop is available.

HBase

Summary

Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable and is written in Java.

Provider

Apache

Maturity

Around since 2007, now an Apache top-level project

Development

Java

Type

column-oriented store

Usage

Available under ASF 2.0

A couple of institutions like Mendeley, Facebook and Adobe are using HBase.

Gabriel Mateescu has written up a developerWorks article on how to process RDF data using HBase. Further, Paolo Castagna wrote an experimental HBase implementation and a group of researchers reported on ‘Scalable RDF store based on HBase and MapReduce’.

MongoDB

Summary

MongoDB is a high-performance, schema-free, (JSON) document-oriented database written in C++.

Provider

10gen

Maturity

 First public release in February 2009

Development

Many languages incl. C/C++, Java, etc.

Type

document-oriented database

Usage

Licensing

MongoDB is used by an array of sites and providers including SourceForge, CERN, Foursquare to name a few.

Rob Vesse reported on his experiments with MongoDB as an RDF store. Further, Antoine Imbert developed MongoDB::RDF for Perl. I also found William Waites write-up on ‘Mongo as an RDF store’ interesting.

Neo4j

Summary

Neo4j is a high-performance graph database, implemented in Java.

Provider

Neo Technology

Maturity

Around since 2003, first public release in May 2007

Development

Java and language bindings

Type

graph database

Usage

Licenses

Neo4j is a rather young but promising player in the NoSQL area. It has built-in RDF processing support, albeit under development. There seems to be some take-up.

SimpleDB

Summary

Amazon SimpleDB is a distributed database/web-service written in Erlang.

Provider

Amazon

Maturity

Available since 2007

Development

Java and others

Type

cloud computing IaaS offering

Usage

Pricing

SimpleDB is often used together with other Amazon Web Services offerings such as the Simple Storage Service (S3), for example by Alexa, Livemocha or only recently, Netflix.

Two researchers published a paper ‘RDF on Cloud Number Nine’, summarising their experiences with RDF processing in SimpleDB - see also their open source project, Stratustore, which is coded in Java.

Riak

Summary

Riak is a Dynamo-inspired key/value store with a distributed database network platform with built-in MapReduce support.

Provider

Basho

Maturity

Available since 2009

Development

Erlang, Python, Java, PHP, Javascript, Ruby

Type

key-value distributed database

Usage

Apache 2.0 license

Riak is a distributed database that supports high availability by allowing tunable levels of guarantees for durability and eventual consistency and is used in production by institutions such as Comcast, Wikia or Opscode.

Andrew McKnight shared his thoughts concerning SPARQL on Riak and I gave it a shot as well and tried to store an RDF graph in Riak, using Link headers.

Others

There are a number of projects, both from research and with commercial background that already process RDF with NoSQL systems or plan to do so; in the following I briefly list some I’m aware of:

Summary

This is by no means a comprehensive survey of NoSQL databases (but there are some very good ones, such as the one by NoSQLPedia). In the context of the article I’m merely looking into RDF/Linked Data processing capabilities of some NoSQL systems.

The motivation for this write-up is rather simple: in the LATC Support Action we face the question ‘What database should I use to manage my RDF data?’ almost on a daily basis. So, rather than pulling the information together again and again I decided to invest some more time and start collecting stuff in a more structured form.

I currently see essentially three possibilities one has to manage and process Linked Data: dedicated RDF stores, RDB-based solutions with built-in RDF support or RDB2RDF mappings and, the focus of this article, solutions based on NoSQL.

Some of the features I’d expect to find in a NoSQL system concerning Linked Data processing (in no particular order) are:

Most of the above features are currently not widely supported in the NoSQL systems and the respective plug-ins as discussed above, so, my conclusion for the time being is that NoSQL-based solutions for Linked Data still have a lot of room for improvement.

Appendix

Acknowledgements

A big thank you to all the wonderful people that write about their experiences regarding NoSQL and RDF as well as to the LATC crew and clients for their support and motivation. I’d like to thank the following people for feedback and comments: Friedrich Lindenberg, KevBurnsJr, Luca Garulli, peco, and Anders Nawroth.

The work on this article has been enabled through funding received from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 256975, LOD Around-The-Clock (LATC) Support Action.

License

 

This article is licensed under the Creative Commons Attribution 3.0 Unported License.

Change Log