LATC Support Action - May 2011 - v0.2
NoSQL solutions for
Linked Data processing
Michael Hausenblas, DERI, NUI Galway
In this article I’m going to discuss a number of NoSQL systems with regard to their Linked Data processing capabilities. The goal is to give developers and system architects a quick overview and help them assess if and to what extent they can use a certain NoSQL system for storing, querying and exposing RDF data. I will not take dedicated RDF databases into account, as there are plenty of resources available comparing them. As well out of scope are solutions based on relational databases, as this is the topic of another of my activities, the RDB2RDF mapping standardisation.
Table of Contents
BigQuery | Cassandra | CouchDB
Neo4j | SimpleDB |Riak | Others
Setting the stage
A good starting point to dive into the topic is Arto Bendiken’s excellent write-up on ‘How RDF Databases Differ from Other NoSQL Solutions’ as well as Sandro Hawke’s ‘RDF meets NoSQL’. I’ll first briefly introduce the NoSQL systems, then describe RDF plug-ins or the like I’ve found for them and then conclude with some general observations.
Who should read this?
If you’re a developer or system architect who either already uses one of the NoSQL systems below or you think about using one of them, and you want to process RDF data or make your data available as Linked Data, then you should continue reading.
How were the NoSQL systems selected?
I’m not very strict regarding the NoSQL systems I consider here: both stuff you can download and run on your machines and cloud computing offerings are relevant, as long as someone has provided an RDF processing plug-in for it or has done an experiment with processing RDF on the system.
To a lesser extent, the observations exhibited here are based on my own experiments, the majority of the information presented in this article is based on work other people have done and is typically available via blog posts, academic publications or software documentations. All errors and omissions are mine.
Candidates
In the following I’ll quickly introduce each of the original NoSQL systems and then describe the Linked Data processing capabilities.
BigQuery
Summary | BigQuery is a web service that enables you to do interactive analysis of massively large datasets. |
Provider | |
Maturity | Available since 2010; in Google Code Labs status |
Development | Java, Python, shell, curl |
Type | cloud computing IaaS offering |
Usage |
BigQuery is a cloud computing offering by Google, a rather new player in town. It is supposed to complement MapReduce jobs in terms of interactive query processing and was introduced together with Google Storage and the Google Prediction API in early 2010.
In late 2010 I coded up bigquery-linkeddata, a Python-based service and set of tools that allows to load RDF/N-Triples content into Google Storage as well as exposing an endpoint allowing to query the data in BigQuery's SELECT syntax (no SPARQL-to-SQL mapping available so far).
Cassandra
Summary | The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model. |
Provider | Apache (was: Facebook) |
Maturity | Open-sourced in July 2008, now an Apache top-level project |
Development | Java |
Type | stand-alone distributed database management system |
Usage | Available under ASF 2.0 |
Cassandra is an established NoSQL system used by some relevant companies including CISCO, Facebook and Rackspace.
There is a Cassandra storage adaptor for RDF.rb available, developed by Arto Bendiken. The RDF.rb lib with Cassandra is also used in Dydra, an RDF cloud service Arto is involved in.
The other work I’m aware of in this area is by Günter Ladwig and Andreas Harth: they developed Cumulus, which uses Cassandra.
CouchDB
Summary | Apache CouchDB is a document-oriented database written in the Erlang; it can be queried and indexed in a MapReduce fashion. |
Provider | Apache |
Maturity | Open-sourced in February 2008, now an Apache top-level project |
Development | JavaScript (default), PHP, Ruby, Python, Erlang |
Type | stand-alone document-oriented database |
Usage | Available under ASF 2.0 |
CouchDB is another established distributed, schema-free NoSQL system that manages the data as a collection of JSON documents and is used by Ubuntu, Couchbase and many more.
Greg Lappen has provided a CouchDB storage adaptor for RDF.rb, based on Ben Lavender’s skeleton. As CouchDB’s native language is JSON, it seems that efforts like JSON-LD (JavaScript Object Notation for Linked Data) are a good fit, although I wasn’t able to find implementations for it, yet.
Last but not least, see also a recent discussion on the CouchDB users list regarding a ‘CouchDB x RDF databases comparison’.
Hadoop
Summary | Apache Hadoop is a software framework written in Java that supports reliable, scalable, distributed computing, including subprojects, such as HDFS (a distributed file system) and a MapReduce engine. |
Provider | Apache |
Maturity | Around since 2005, now an Apache top-level project |
Development | Java, Python, etc. |
Type | MapReduce framework |
Usage | Available under ASF 2.0 |
Hadoop is a veteran and used by a large number of commercial and educational entities. You can use it locally (that is, on your machines) or as an IaaS cloud computing offering, such as Amazon’s Elastic MapReduce (EMR).
Arto Bendiken has written RDFgrid, a framework for batch-processing RDF data with Hadoop and Amazon EMR; Datagraph uses it in Dydra.
A group of researchers published about ‘Extensions to the Pig data processing platform for scalable RDF data processing using Hadoop’ - Apache Pig is a high-level language on top of Hadoop MapReduce with certain similarities concerning SQL.
Eventually, a number of best practices for processing RDF data with MapReduce/Hadoop is available.
HBase
Summary | Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable and is written in Java. |
Provider | Apache |
Maturity | Around since 2007, now an Apache top-level project |
Development | Java |
Type | column-oriented store |
Usage | Available under ASF 2.0 |
A couple of institutions like Mendeley, Facebook and Adobe are using HBase.
Gabriel Mateescu has written up a developerWorks article on how to process RDF data using HBase. Further, Paolo Castagna wrote an experimental HBase implementation and a group of researchers reported on ‘Scalable RDF store based on HBase and MapReduce’.
MongoDB
Summary | MongoDB is a high-performance, schema-free, (JSON) document-oriented database written in C++. |
Provider | 10gen |
Maturity | First public release in February 2009 |
Development | Many languages incl. C/C++, Java, etc. |
Type | document-oriented database |
Usage |
MongoDB is used by an array of sites and providers including SourceForge, CERN, Foursquare to name a few.
Rob Vesse reported on his experiments with MongoDB as an RDF store. Further, Antoine Imbert developed MongoDB::RDF for Perl. I also found William Waites write-up on ‘Mongo as an RDF store’ interesting.
Neo4j
Summary | Neo4j is a high-performance graph database, implemented in Java. |
Provider | Neo Technology |
Maturity | Around since 2003, first public release in May 2007 |
Development | Java and language bindings |
Type | graph database |
Usage |
Neo4j is a rather young but promising player in the NoSQL area. It has built-in RDF processing support, albeit under development. There seems to be some take-up.
SimpleDB
Summary | Amazon SimpleDB is a distributed database/web-service written in Erlang. |
Provider | Amazon |
Maturity | Available since 2007 |
Development | Java and others |
Type | cloud computing IaaS offering |
Usage |
SimpleDB is often used together with other Amazon Web Services offerings such as the Simple Storage Service (S3), for example by Alexa, Livemocha or only recently, Netflix.
Two researchers published a paper ‘RDF on Cloud Number Nine’, summarising their experiences with RDF processing in SimpleDB - see also their open source project, Stratustore, which is coded in Java.
Riak
Summary | Riak is a Dynamo-inspired key/value store with a distributed database network platform with built-in MapReduce support. |
Provider | Basho |
Maturity | Available since 2009 |
Development | Erlang, Python, Java, PHP, Javascript, Ruby |
Type | key-value distributed database |
Usage |
Riak is a distributed database that supports high availability by allowing tunable levels of guarantees for durability and eventual consistency and is used in production by institutions such as Comcast, Wikia or Opscode.
Andrew McKnight shared his thoughts concerning SPARQL on Riak and I gave it a shot as well and tried to store an RDF graph in Riak, using Link headers.
Others
There are a number of projects, both from research and with commercial background that already process RDF with NoSQL systems or plan to do so; in the following I briefly list some I’m aware of:
Summary
This is by no means a comprehensive survey of NoSQL databases (but there are some very good ones, such as the one by NoSQLPedia). In the context of the article I’m merely looking into RDF/Linked Data processing capabilities of some NoSQL systems.
The motivation for this write-up is rather simple: in the LATC Support Action we face the question ‘What database should I use to manage my RDF data?’ almost on a daily basis. So, rather than pulling the information together again and again I decided to invest some more time and start collecting stuff in a more structured form.
I currently see essentially three possibilities one has to manage and process Linked Data: dedicated RDF stores, RDB-based solutions with built-in RDF support or RDB2RDF mappings and, the focus of this article, solutions based on NoSQL.
Some of the features I’d expect to find in a NoSQL system concerning Linked Data processing (in no particular order) are:
Most of the above features are currently not widely supported in the NoSQL systems and the respective plug-ins as discussed above, so, my conclusion for the time being is that NoSQL-based solutions for Linked Data still have a lot of room for improvement.
Appendix
Acknowledgements
A big thank you to all the wonderful people that write about their experiences regarding NoSQL and RDF as well as to the LATC crew and clients for their support and motivation. I’d like to thank the following people for feedback and comments: Friedrich Lindenberg, KevBurnsJr, Luca Garulli, peco, and Anders Nawroth.
The work on this article has been enabled through funding received from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 256975, LOD Around-The-Clock (LATC) Support Action.
License
This article is licensed under the Creative Commons Attribution 3.0 Unported License.
Change Log