LATC Support Action - May 2011 - v0.2

NoSQL solutions for

Linked Data processing

In this article I’m going to discuss a number of NoSQL systems with regard to their Linked Data processing capabilities. The goal is to give developers and system architects a quick overview and help them assess if and to what extent they can use a certain NoSQL system for storing, querying and exposing RDF data. I will not take dedicated RDF databases into account, as there are plenty of resources available comparing them. As well out of scope are solutions based on relational databases, as this is the topic of another of my activities, the RDB2RDF mapping standardisation.

Table of Contents

Candidates

BigQuery | Cassandra | CouchDB

Hadoop | HBase | MongoDB

Neo4j | SimpleDB |Riak | Others

Summary

Appendix

Setting the stage

A good starting point to dive into the topic is Arto Bendiken’s excellent write-up on ‘How RDF Databases Differ from Other NoSQL Solutions’ as well as Sandro Hawke’s ‘RDF meets NoSQL’. I’ll first briefly introduce the NoSQL systems, then describe RDF plug-ins or the like I’ve found for them and then conclude with some general observations.

Who should read this?

If you’re a developer or system architect who either already uses one of the NoSQL systems below or you think about using one of them, and you want to process RDF data or make your data available as Linked Data, then you should continue reading.

How were the NoSQL systems selected?

I’m not very strict regarding the NoSQL systems I consider here: both stuff you can download and run on your machines and cloud computing offerings are relevant, as long as someone has provided an RDF processing plug-in for it or has done an experiment with processing RDF on the system.

To a lesser extent, the observations exhibited here are based on my own experiments, the majority of the information presented in this article is based on work other people have done and is typically available via blog posts, academic publications or software documentations. All errors and omissions are mine.

Candidates

In the following I’ll quickly introduce each of the original NoSQL systems and then describe the Linked Data processing capabilities.

BigQuery

Summary	BigQuery is a web service that enables you to do interactive analysis of massively large datasets.
Provider	Google
Maturity	Available since 2010; in Google Code Labs status
Development	Java, Python, shell, curl
Type	cloud computing IaaS offering
Usage	Terms of Services, Quota

BigQuery is a cloud computing offering by Google, a rather new player in town. It is supposed to complement MapReduce jobs in terms of interactive query processing and was introduced together with Google Storage and the Google Prediction API in early 2010.

In late 2010 I coded up bigquery-linkeddata, a Python-based service and set of tools that allows to load RDF/N-Triples content into Google Storage as well as exposing an endpoint allowing to query the data in BigQuery's SELECT syntax (no SPARQL-to-SQL mapping available so far).

Cassandra

Summary	The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model.
Provider	Apache (was: Facebook)
Maturity	Open-sourced in July 2008, now an Apache top-level project
Development	Java
Type	stand-alone distributed database management system
Usage	Available under ASF 2.0

Cassandra is an established NoSQL system used by some relevant companies including CISCO, Facebook and Rackspace.

There is a Cassandra storage adaptor for RDF.rb available, developed by Arto Bendiken. The RDF.rb lib with Cassandra is also used in Dydra, an RDF cloud service Arto is involved in.

The other work I’m aware of in this area is by Günter Ladwig and Andreas Harth: they developed Cumulus, which uses Cassandra.

CouchDB

Summary	Apache CouchDB is a document-oriented database written in the Erlang; it can be queried and indexed in a MapReduce fashion.
Provider	Apache
Maturity	Open-sourced in February 2008, now an Apache top-level project
Development	JavaScript (default), PHP, Ruby, Python, Erlang
Type	stand-alone document-oriented database
Usage	Available under ASF 2.0

CouchDB is another established distributed, schema-free NoSQL system that manages the data as a collection of JSON documents and is used by Ubuntu, Couchbase and many more.

Greg Lappen has provided a CouchDB storage adaptor for RDF.rb, based on Ben Lavender’s skeleton. As CouchDB’s native language is JSON, it seems that efforts like JSON-LD (JavaScript Object Notation for Linked Data) are a good fit, although I wasn’t able to find implementations for it, yet.

Last but not least, see also a recent discussion on the CouchDB users list regarding a ‘CouchDB x RDF databases comparison’.

Hadoop

Summary	Apache Hadoop is a software framework written in Java that supports reliable, scalable, distributed computing, including subprojects, such as HDFS (a distributed file system) and a MapReduce engine.
Provider	Apache
Maturity	Around since 2005, now an Apache top-level project
Development	Java, Python, etc.
Type	MapReduce framework
Usage	Available under ASF 2.0

Hadoop is a veteran and used by a large number of commercial and educational entities. You can use it locally (that is, on your machines) or as an IaaS cloud computing offering, such as Amazon’s Elastic MapReduce (EMR).

Arto Bendiken has written RDFgrid, a framework for batch-processing RDF data with Hadoop and Amazon EMR; Datagraph uses it in Dydra.

A group of researchers published about ‘Extensions to the Pig data processing platform for scalable RDF data processing using Hadoop’ - Apache Pig is a high-level language on top of Hadoop MapReduce with certain similarities concerning SQL.

Eventually, a number of best practices for processing RDF data with MapReduce/Hadoop is available.

HBase

Summary	Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable and is written in Java.
Provider	Apache
Maturity	Around since 2007, now an Apache top-level project
Development	Java
Type	column-oriented store
Usage	Available under ASF 2.0

A couple of institutions like Mendeley, Facebook and Adobe are using HBase.

Gabriel Mateescu has written up a developerWorks article on how to process RDF data using HBase. Further, Paolo Castagna wrote an experimental HBase implementation and a group of researchers reported on ‘Scalable RDF store based on HBase and MapReduce’.

MongoDB

Summary	MongoDB is a high-performance, schema-free, (JSON) document-oriented database written in C++.
Provider	10gen
Maturity	First public release in February 2009
Development	Many languages incl. C/C++, Java, etc.
Type	document-oriented database
Usage	Licensing

MongoDB is used by an array of sites and providers including SourceForge, CERN, Foursquare to name a few.

Rob Vesse reported on his experiments with MongoDB as an RDF store. Further, Antoine Imbert developed MongoDB::RDF for Perl. I also found William Waites write-up on ‘Mongo as an RDF store’ interesting.

Neo4j

Summary	Neo4j is a high-performance graph database, implemented in Java.
Provider	Neo Technology
Maturity	Around since 2003, first public release in May 2007
Development	Java and language bindings
Type	graph database
Usage	Licenses

Neo4j is a rather young but promising player in the NoSQL area. It has built-in RDF processing support, albeit under development. There seems to be some take-up.

SimpleDB

Summary	Amazon SimpleDB is a distributed database/web-service written in Erlang.
Provider	Amazon
Maturity	Available since 2007
Development	Java and others
Type	cloud computing IaaS offering
Usage	Pricing

SimpleDB is often used together with other Amazon Web Services offerings such as the Simple Storage Service (S3), for example by Alexa, Livemocha or only recently, Netflix.

Two researchers published a paper ‘RDF on Cloud Number Nine’, summarising their experiences with RDF processing in SimpleDB - see also their open source project, Stratustore, which is coded in Java.

Riak

Summary	Riak is a Dynamo-inspired key/value store with a distributed database network platform with built-in MapReduce support.
Provider	Basho
Maturity	Available since 2009
Development	Erlang, Python, Java, PHP, Javascript, Ruby
Type	key-value distributed database
Usage	Apache 2.0 license

Riak is a distributed database that supports high availability by allowing tunable levels of guarantees for durability and eventual consistency and is used in production by institutions such as Comcast, Wikia or Opscode.

Andrew McKnight shared his thoughts concerning SPARQL on Riak and I gave it a shot as well and tried to store an RDF graph in Riak, using Link headers.

Others

There are a number of projects, both from research and with commercial background that already process RDF with NoSQL systems or plan to do so; in the following I briefly list some I’m aware of:

Keep an eye on MonetDB - they are partners in the LOD2 project and have some RDF features in the queue
Sindice, a semantic indexer uses Hadoop and other NoSQL solutions
From the LarKC project I know they are working on Web-scale Parallel Inference Engine (WebPIE), a MapReduce distributed RDFS/OWL inference engine.
Gremlin is a graph traversal language that works over those graph databases/frameworks that implement the Blueprints property graph data model, such as Neo4j or OrientDB - read more about this on the Wiki page SPARQL vs. Gremlin.

Summary

This is by no means a comprehensive survey of NoSQL databases (but there are some very good ones, such as the one by NoSQLPedia). In the context of the article I’m merely looking into RDF/Linked Data processing capabilities of some NoSQL systems.

The motivation for this write-up is rather simple: in the LATC Support Action we face the question ‘What database should I use to manage my RDF data?’ almost on a daily basis. So, rather than pulling the information together again and again I decided to invest some more time and start collecting stuff in a more structured form.

I currently see essentially three possibilities one has to manage and process Linked Data: dedicated RDF stores, RDB-based solutions with built-in RDF support or RDB2RDF mappings and, the focus of this article, solutions based on NoSQL.

Some of the features I’d expect to find in a NoSQL system concerning Linked Data processing (in no particular order) are:

Indexing: I’d like to have a modular/plug-able indexing sub-system that allows me to use specialised indexing services such as the Semantic Information Retrieval Engine (SIREn).
Batch-loading: one requirement I come across most frequently is the ability to load RDF dumps (in NTriple or as N-Quads) in one go (in contrast to have the triples insert one by one).
SPARQL: sounds like a no-brainer, but is really essential - many people demand support for a SPARQL interface, and few of the solutions offer it.

Most of the above features are currently not widely supported in the NoSQL systems and the respective plug-ins as discussed above, so, my conclusion for the time being is that NoSQL-based solutions for Linked Data still have a lot of room for improvement.

Appendix

Acknowledgements

A big thank you to all the wonderful people that write about their experiences regarding NoSQL and RDF as well as to the LATC crew and clients for their support and motivation. I’d like to thank the following people for feedback and comments: Friedrich Lindenberg, KevBurnsJr, Luca Garulli, peco, and Anders Nawroth.

The work on this article has been enabled through funding received from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 256975, LOD Around-The-Clock (LATC) Support Action.

License

This article is licensed under the Creative Commons Attribution 3.0 Unported License.

Change Log

2011-05-03: minor typos fixed and updated ack - v0.2
2011-05-02: initial release - v0.1