CouchDB overview
Advanced Databases course, ULB
by Irina Kameniar and Alexey Grigorev
2013
Table of Content
Document-Oriented Storages and Documents
CouchDB and Relational Databases
Apache CouchDB is a document-oriented NoSQL database that uses JSON to store data and MapReduce using Java Script for querying. For access it provides HTTP api which means it can be accessed from nearly all programming languages. CouchDB implements Multi-Version Concurrency Control to avoid locking and uses optimistic approach for conflict resolution, leaving the users decide what to do with conflicts.
In this work we want to make an overview of CouchDB. In particular, we want to focus on
For the practical part of this work we want to illustrate its main features with examples as well as to create a simple toy application to better understand this technology.
This work is organized as follows. First, we make an overview of what CouchDB is, its core and API, what it is a document-oriented storage and then we describe the MapReduce approach for querying data. The next part is devoted to replication, conflict management and concurrency issues. And finally, we will describe how to create a simple application in CouchDB.
CouchDB’s key features are:
A document is the central data structure in CouchDB, and it uses JSON to store documents
Each document has an id, which must be unique per database. Usually the best ids are UUIDs (Universal Unique ID - random string with extremely low collision probability[1]), but generally it can be anything
How CoachDB works on a single machine
It consists of two components:
To check if it's running, using curl[3] send a GET request to this address :
> curl -X GET http://localhost:5984/
The database replies with the following: (if you see that, everything works)
{
"couchdb": "Welcome",
"uuid": "2af023889ce22a70de68547c956e273a",
"version": "1.4.0",
"vendor": {
"version": "1.4.0",
"name": "The Apache Software Foundation"
}
}
(here and henceforth formatted for better readability)
To get the list of all available databases, use command "_all_dbs"
curl -X GET http://localhost:5984/_all_dbs
To create a new database, issue a PUT request to database you want to create
curl -X PUT http://localhost:5984/new_database
When an operation is successful, it replies with
{"ok":true}
Adding
To add new document, we issue a PUT request to url/{database_name}/{document_id}
Since the schema is not rigid, we may put there everything we want, for example
> curl -X PUT http://localhost:5984/new_database/super_toaster -d '{"title":"toaster","price":"10$"}'
{"ok":true,"id":"super_toaster","rev":"1-8f71d392bd5139ba142eb87ea52096d7"}
it returns id of the newly added plus revision id.
To retrieve this document use the same url
> curl -X GET http://localhost:5984/new_database/super_toaster
{
"_id": "super_toaster",
"_rev": "1-8f71d392bd5139ba142eb87ea52096d7",
"title": "toaster",
"price": "10$"
}
Note that we don't have to specify the id in the document, CouchDB takes care of adding it itself.
Mechanisms behind versioning and revisions will be discussed below.
You can easily generate a lot of json data with http://json-generator.appspot.com/ and It's easy to bulk post your data to CoachDB[4]. For doing that we have prepared 80k+ lines of JSON code (1500 documents) with user data to be inserted to the database (available at http://goo.gl/jkcCim)
To create this database execute the following:
# create a database "users"
curl -X PUT http://localhost:5984/users/
# download database data into "database.json"
wget http://goo.gl/jkcCim --no-check-certificate -O test-database.json
# use bulk post to add your data to CouchDB
curl -X POST http://localhost:5984/users/_bulk_docs -H "Content-Type:application/json" -d @test-database.json
# at this point, CouchDB will answer with a list of newly added ids
You don't have to interact with CouchDB only via HTTP requests: there is a web application for managing the database through a web browser, called Futon, which comes along with CouchDB. To access it, open your browser and go to http://localhost:5984/_utils/
With Futon you can create databases and explore existing ones
let's check what the users data we have put: click on "users"
To see what's inside a document, just click on it
There are two options:
1) to see formatted version of JSON
2) or raw JSON data
Main core components:
A design document is a special type of documents that contain application code. They also live inside the database, but they are highly structured. These documents are very similar to usual documents: they can be replicated, have id and revision id.
Virtual Documents
We typically want to fetch all the data we want to display in one request, so it makes sense to store related records together, and if there is a need for joining, we want to pre-compute this. For that there's a technique called "virtual documents" which uses views to collate data together
A design document starts with a special prefix "_design/".
A design document may contain:
Validation is a powerful tool to ensure that only document you need/want end up in your database.
There is a function "validate_doc_update" in a design document. It is used to prevent invalid or unauthorized updated. For example: only authorized users can add blog posts
Validation functions as all other CouchDB functions must not have any side-effects, and they are run in isolation, it also can block invalid updated from other CouchDB instances during replication. This function is executed each time a document is added or updated. If it raises an exception, the update is canceled, otherwise - accepted.
Validation is optional, if there's no such function, every update will get accepted
A design document may contain only one validation function, but if you have several design documents, all the validation function will be executed on a write request. If at least one of them decides to reject, the update is rejected.
NB: order of execution is not defined, so you must not make any assumptions about it
function(newDoc, savedDoc, ctx) {
// some logic
if (/* validation */) {
throw({unauthorized: 'some message'})
}
}
Types are needed to ensure that documents have proper type - i.e. have all required fields. This is a common pattern: to assign document types to records. It's not the part of CouchDB and it's up to user to decide whether to include type fields or not. This is quite convenient to use in MapReduce queries
Consider the following validation query:
function(newDoc, oldDoc, ctx) {
if (newDoc.type == "post") {
// validate post
}
if (newDoc.type == "comment") {
// validate comment
}
}
For Relational Databases you can issue any query, and as long as you data is structured correctly, you'll be able to get an answer.
However, documents aren't always as structured as relations in Relational Databases, and for that we need a different approach. For CouchDB this approach is MapReduce.
MapReduce[5] is a paradigm of parallel computation.
A user has to provide two functions that will operate on all data:
These functions provide CouchDB with great flexibility: they can adapt to various document structures.
So a view is a combination of map and reduce functions
Views allows for parallel and incremental computation of views (described below). Since MapReduce produces key-value pairs, the results are also stored in the B-Trees. View results are stored in a B-Tree (like documents), but in their own file.
Views can be used for:
View functions are stored inside "views" field of a design document. Once you create a view, you query it to get results.
Map is applied to each document and emits zero or more key/value pairs - view rows. A map function doesn't depend on any information outside of the document, which allows CouchDB views be generated incrementally and in parallel. Views are stored as rows that are sorted by key in a B-Tree, which makes range retrievals efficient. When writing a map function, your goal is to build an index that stores related data records under nearby keys.
Writing a map function
Map functions must not have any side-effects. A map function can fail during the execution - CouchDB recovers from it easily. It doesn't mean your functions should fail systematically, so it is important to check if fields you want to use exist.
Keys in the result returned by a map function can be anything, including lists that compose several keys. Map function takes one parameter "doc", it refers to a document from the database. Emit function is to be used inside map, it takes two arguments: key and value, emit function can be called multiple times (including zero times). Emitted pairs are stored in the B-Tree
Examples
For our generated database (see Generating a database) we want to retrieve all active users that are women with more than 3 friends
function(doc) {
if (doc.isActive && doc.gender == 'female' && doc.friends.length >= 3) {
emit(null, doc);
}
}
This gives us unsorted output (it is sorted by document id, which gives us an impression that the result is not ordered)
Since the results are sorted by keys emitted by a map function, we to order the result on the last name of a user, we pass their name as the first argument of emit function
function(doc) {
if (doc.isActive && doc.gender == 'female' && doc.friends.length >= 3) {
var lastName = doc.name.split(" ")[1];
emit(lastName, doc);
}
}
Note that using Java Script gives us a lot of flexibility and we can transform our output as we want.
Incremental Computation of Map Results
A map function runs through all records when you first query the view. A call to emit creates an entry in the view results where everything is sorted by the key. Indexes for each document can be computed independently and in parallel. If a document is changed, the map function is run only once to recompute the key and values for this single documents. If a document is deleted, corresponding entries are marked invalid - and they don't show up in the results
Consider the following view function:
function(doc) {
if (doc.isActive && doc.gender == 'female' && doc.friends.length >= 3) {
var lastName = doc.name.split(" ")[1];
emit(lastName, {"name": doc.name, "email": doc.email});
}
}
It outputs names and emails of all active female users with at least 3 friends and sorts the result by their last names. We save this view in a design document ”females” under name ”byLastName”.
To query a view use the following url
> curl -X GET HOST/db/_design/{design_document}/_views/{view_name}
this returns all rows from the view.
For our example it is
> curl -X GET http://localhost:5984/users/_design/females/_view/byLastName
{
"total_rows": 353,
"offset": 0,
"rows": [
{
"id": "d2a8c788-665a-4475-a123-56cef7a03dbe",
"key": "Adams",
"value": {
"name": "Elvira Adams",
"email": "elviraadams@centuria.com"
}
},
/* remaining 352 rows are truncated */
]
}
You also can pass a view parameter
> curl -X GET ‘HOST/db/_design/{design_document}/_views/{view_name}?key="{key}"’
where "{key}" is the key we used in emit call.
Example
> curl -X GET 'http://localhost:5984/users/_design/females/_view/byLastName?key="Burns"'
{
"total_rows": 353,
"offset": 41,
"rows": [
{
"id": "ec8b0f81-4507-4a7e-8aa8-e9dbf9e57089",
"key": "Burns",
"value": {
"name": "Elvia Burns",
"email": "elviaburns@opticon.com"
}
},
{
"id": "ee6e6491-7e3b-48a7-8fda-241e571f1b4c",
"key": "Burns",
"value": {
"name": "Mamie Burns",
"email": "mamieburns@indexia.com"
}
},
/* 1 record is not shown */
]
}
If we want to specify several keys, we then use the keys parameter.
However the value passed to this parameter has to be a properly URL-encoded JSON-string, which is not very convenient and readable. But there an alternative syntax for querying: we need to issue a PUT request with all needed parameters in the request body
> curl -X POST HOST/db/_design/{design_document}/_view/{view_name} -d '{...}'
For example, we want to see only users with last name “Burns”
$ curl -X POST http://localhost:5984/users/_design/females/_view/byLastName -d '{"keys": ["Burns"]}'
Also we can ask to output the result of a key range
> curl -X GET ‘HOST/db/_design/{design_document}/_views/{view_name}?
startkey="abc"&endkey="zzz"’
With our view, suppose we want to list all users within range “Adkins” and “Aquirre”
$ curl -X GET 'http://localhost:5984/users/_design/females/_view/byLastName?startkey="Adkins"&endkey="Aguirre"'
{
"total_rows": 353,
"offset": 1,
"rows": [
{
"id": "128a8c1c-8422-4cb4-8fd0-6b82503eb96d",
"key": "Adkins",
"value": {
"name": "Mallory Adkins",
"email": "malloryadkins@qaboos.com"
}
},
/* 1 record is not shown */
{
"id": "0c55ee3b-1f22-40e3-a9f1-57b0b2f8742d",
"key": "Aguirre",
"value": {
"name": "Stacie Aguirre",
"email": "stacieaguirre@velity.com"
}
}
]
}
Note that the range is inclusive, that is, the records with the “endkey” are also included in the result.
If we want to start from the beginning, we just don’t specify it at all or put null. The same applies to endkey: if we want to show the result till the end, we just don’t put it to a query.
It is also possible to output the result in the descending order
> curl -X GET HOST/db/_design/{design_document}/_views/{view_name}?
startkey="zzz"&endkey="aaa"&descending=true
But in this case we must put the start key as end key and the end key as start key.
> curl -X GET 'http://localhost:5984/users/_design/females/_view/byLastName?
startkey="Alford"&endkey="Aguirre"&descending=true'
{
"total_rows": 353,
"offset": 348,
"rows": [
{
"id": "0c55ee3b-1f22-40e3-a9f1-57b0b2f8742d",
"key": "Aguirre",
"value": {
"name": "Stacie Aguirre",
"email": "stacieaguirre@velity.com"
}
},
/* 1 record is not shown */
{
"id": "128a8c1c-8422-4cb4-8fd0-6b82503eb96d",
"key": "Adkins",
"value": {
"name": "Mallory Adkins",
"email": "malloryadkins@qaboos.com"
}
}
]
}
And lastly, there are two important parameters for pagination: limit - number of items shown in the result, and skip - how many records we skip before starting to output the result.
Suppose that we show 10 items per page, and this is our limit. The value for skip we can calculate using the following formula: skip = (page - 1) * limit
For page #2 skip is 10, and thus we have the following request:
> curl -X GET 'http://localhost:5984/users/_design/females/_view/byLastName?
limit=10&skip=10'
For more details see http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options
After executing the map function, we run a series of reduce queries, one for each group returned from map. These functions operate on sorted raws emitted by map functions.
function(keys, values, rereduce) {
return sum(values);
}
It takes three parameters, which we will describe below, and one of them - rereduce - is optional.
So we might write this query:
function(keys, values) {
return sum(values);
}
In CouchDB reduce functions take advantage of B-Tree properties. The view result - pre-order traversal[6] of the tree. For every leaf node there's a chain of internal nodes reaching back to the root so reduce runs on every leaf node and then it runs on every intermediate node going up to the root. So, the end result of Reduce is caches and can be updated incrementally on changes to the data (first map results are recalculated, then reduce). When reduce runs on leaves, rereduce parameter in false, but for inner nodes it's true, it means it gets intermediate results from downstream nodes
So Reduce is also computed incrementally: we cache reduce results in the intermediate nodes of B-Tree complexity:
Example
Suppose we want to calculate what is the average balance for all active female users with at least 3 friends. Here is our view:
function(doc) {
if (doc.isActive && doc.gender == 'female' && doc.friends.length >= 3) {
var sum = doc.balance.replace(',', '').slice(1);
emit(null, parseInt(sum));
}
}
function(keys, values) {
return sum(values) / values.length;
}
The result is only one value. It is also possible to calculate the average value per group. Say, we want to see the average salary per first letter of user’s last name
function(doc) {
if (doc.isActive && doc.gender == 'female' && doc.friends.length >= 3) {
var sum = doc.balance.replace(',', '').slice(1);
var lastName = doc.name.split(" ")[1];
var firstLetter = lastName[0];
emit(firstLetter , parseInt(sum));
}
}
function(keys, values) {
return sum(values) / values.length;
}
Now to query this view we need to issue the following GET request
> curl -X GET http://localhost:5984/users/_design/females/_view/avg_sal_firstletter?group=true
Note that we also pass parameter group set to true to enable grouping (i.e. the reduce phase)
{
"rows": [
{"key": "A", "value": 2865.6666666666665},
{"key": "B", "value": 2502.2370370370372},
{"key": "C", "value": 2695.6999999999998},
/* 15 records are not shown */
{"key": "T", "value": 2146},
{"key": "U", "value": 2746},
{"key": "V", "value": 2176.5999999999999},
{"key": "W", "value": 2561.4210526315787},
{"key": "Y", "value": 1471.6666666666667},
{"key": "Z", "value": 3480}]
}
If we didn’t pass group attribute, it would use the default value (which is false), and the result of the reduce phase will not be grouped
> curl -X GET http://localhost:5984/users/_design/females/_view/avg_sal_firstletter
{
"rows": [
{"key": null, "value": 2590.2145937493101}]
}
In this case the result is average of all values in the database.
For more information about views and its API, you may refer to CouchDB’s wiki page http://wiki.apache.org/couchdb/HTTP_view_API which gives a more detailed description.
A replication is a mechanism that allows to synchronize two or more database instances.
Reasons for doing replication:
Three properties that you can scale
Read Requests
Read: process an HTTP request, open a socket, find where data is stored - this takes processing time, responses to these requests may be cached. If your users generate more requests that you can handle, you need to add an additional server, make sure all users read same data.
Write Requests
Same steps as for reads, but responses cannot be cached. If you have several servers - write must eventually occur on all of them
Data (when amount grows too high)
Replication is the basis for all the 3 types.
Distributed systems operate over some network, and networks are often segmented. Eventual consistency means that data on all the servers will be consistent eventually, but the database (as a single unit) is always available. For CouchDB “availability” means being always available for writes: even if the current version of a document is in conflict, it still can accept new writes to the data. And it is up to the application how to resolve the conflicts at the later time. This concept is tightly related to Conflict Resolution which is discussed in the next section.
The opposite of the Eventual Consistency model is Strong Consistency. In this model the synchronization and replication between all database servers happen with each transaction. That is, a transaction is executed on all the servers and considered successful only when every server has replied that it is successfully committed. This approach is widely used for Relational Databases and has proven to be not scalable, and therefore many NoSQL solutions sacrifice Consistency in favor to better scalability. For more details please refer to Amazon’s Dynamo paper [DeCandia et al, 2007].
Data is kept locally, no need for constant network access for communicating with other CouchDB instances. Synchronization happens whenever possible (when a network connection appears, etc).
Replication in CouchDB works incrementally, only differences are replicated, not whole databases. If something during the replication goes wrong, when this is fixed, next time it starts from the same moment.
Note that replication is unidirectional (from source to target). If you want bidirectional replication, run it twice, swapping the source and the target for the second run.
Incremental Replication
CouchDB achieves eventual consistency by Incremental Replication - this is the process when all document changes are copied periodically. This is called "Shared Nothing" cluster of databases with each node being independent and self-sufficient: these is no single point of connection in the system. Changes can be propagated in any way we like, and after replication each server can continue working independently.
This is how it works
To scale the system we just add another server
Schematically we may show replication like that:
When the replication process is run first, it runs the comparison between the two servers, which returns a list of changed documents, this includes:
Documents that exist both on source and on target are not transferred (only differences will be moved).
Databases in CouchDB have a sequence number. It gets incremented when any change occurs and it remembers what change was associated with a particular sequence number. So calculating difference between source and target is efficient.
If replication process is interrupted, the target database may be left in an inconsistent state. But if you trigger the replication again, it will continue from the moment of interruption
To synchronize two databases we issue a simple PUT request where we specify
CouchDB will figure out what are the new documents and what are the new revisions that are no the source but not yet on the target, and will transfer it to the target
> curl -X PUT http://localhost:5984/_replicate -d '{"source":"users","target":"users_replica"}'
The database replies with some statistics and tells if it was successful or not
NB: the request for replication will stay open till the replication process finishes, so it may take a while.
This is called local replication because both source and target are local. It is useful for backups, taking snapshots, etc. “Source" and "target" parameters are links, and they are relative to the server
There are remote replications as well:
> curl -X PUT http://localhost:5984/_replicate -d '{"source":"users","target":"http://localhost:5985/users"}'
> curl -X PUT http://localhost:5984/_replicate -d '{"source":"http://localhost:5985/users","target":"users"}'
> curl -X PUT http://localhost:5984/_replicate -d '{"source":"http://localhost:5985/users","target":"http://localhost:5984/users"}'
A reply for a replication request may look like the following:
{
"ok": true,
"source_last_seq": 10,
"session_id": "c7a2bbbf9e4af774de3049eb86eaa447",
"history": [
{
"session_id": "c7a2bbbf9e4af774de3049eb86eaa447",
"start_time": "Wed, 20 Nov 2013 19:30:46 GMT",
"end_time": "Wed, 20 Nov 2013 19:30:47 GMT",
"start_last_seq": 0,
"end_last_seq": 1,
"recorded_seq": 1,
"missing_checked": 0,
"missing_found": 1,
"docs_read": 1,
"docs_written": 1,
"doc_write_failures": 0,
}
]
}
Some details:
Continuous Replication
This was manual replication - replication that is triggered manually and happens just once. But it is possible to set up continuous replication - replication that will be triggered by some events (e.g. when the number of changes exceeds some threshold, etc) to do that add "continuous": true to the replication query. Note that you need to call this function every time your server restarts (it doesn't remember the replication settings from the previous runs).
In Futon replication is simple: just click on "Replication"
In a typical relational database when we modify a table, we put a lock - and all other clients that want to access the table are queued.
This sequential execution of tasks wastes a lot of processor's power and time: under high load it may spend a lot of time trying to figure out whose turn is next. MVCC, Multi-Version Concurrency Control, is used to manage concurrent access to the data in CouchDB
This concurrency model allows CouchDB to run effectively even under high load, without worrying about queuing requests.
B-Tree (CouchDB uses a variation of a B-Tree[7] called B+Tree[8])
B-Tree is a sorted data structure that allows for searching, insertions and deletion in logarithmic time. Lookup is O(log N) time, and range is O(log N + K)
This data structure is used everywhere, also for internal data: documents and views. Usage of this data structure imposes an important restriction: can access only by key.
Reason: to be make huge performance gains
In CouchDB the implementation is a little bit different from original B+Trees. It adds:
Support for MVCC
Append-only design
All documents have versions (like in version control systems such as SVN)
If you want to change a document, you create a new one and save it over the old one. After doing that there will be two versions of the same document. Since a new version is just appended to the database, the read requests don't have to be suspended.
Once a new version is appended, all new requests are routed to this newer version
Updates in CouchDB
each revision is identified with a new "_rev" value
if you want to update or delete a document, you must specify the revision you're updating. This is to ensure that you will not overwrite some other update
suppose you want to update a document without providing the revision id:
> curl -X PUT http://localhost:5984/new_database/super_toaster -d '{"title":"toaster","price":"15$"}'
CouchDB responses with an error:
{"error":"conflict","reason":"Document update conflict."}
So we add the revision id to the document we're updating:
> curl -X PUT http://localhost:5984/new_database/super_toaster -d '{"title":"toaster","price":"15$","_rev":"1-8f71d392bd5139ba142eb87ea52096d7"}'
This time the database replies with "ok" and a new revision update:
{"ok":true,"id":"super_toaster","rev":"2-9c85d3c3324c3777a4665f00330b73b5"}
The same applies to deletes. Since a delete is also an update which just sets some special flag to “true”, CouchDB also needs to know a revision that it deletes:
> curl -X DELETE http://localhost:5984/new_database/super_toaster
?rev=2-9c85d3c3324c3777a4665f00330b73b5
A conflicting change is a change that occurs simultaneously in two or more replicas. This happens regularly in distributed databases.
So a document conflict means that now there are two latest revisions of the same document.
CouchDB can detect a conflicting change in a document and signals it with "_conflict" flag set to true. When there are two revisions of the same file, CouchDB has to choose one winning revision - revision that will be stored and the latest revision. However the losing revisions aren't deleted - they are stored as well, but as previous revisions.
CouchDB doesn't attempt to reconcile the conflicting changes: it ensures that all conflicts are detected, but it's up to the application to deal with them. Essentially this is the same mechanism used by SVN[9] and other popular version control systems.
Replication from A to B (assuming triggered replication, not continuous) Direction A→B (not B→A)
All other types of replication are reduced to these steps
To see if we have any conflicts we may use this simple view:
function(doc) {
if (doc._conflicts) {
emit(doc._conflicts, null)
}
}
"_conflict" is an array that contains all conflicting revisions
CouchDB uses a deterministic algorithm to ensure that each CouchDB instance will come up with the same winning and losing revision.
Note that your application should never depend on these details and should treat the results as an arbitrary choice rather than some deterministic algorithm.
Algorithm:
for example, "2-de..." wins over "2-7e..."
If we don't agree with CouchDB automatic choice, we delete one revision and keep another
> curl -X DELETE $HOST/database/document_id?rev=2-de...
This returns a new revision (remember that a delete is also an update)
Next, we put the values we want to keep back to the database, specifying the revision we like
> curl -X PUT $HOST/database/document_id -d '{..., "_rev":"2-7e..."}'
It also responses with a new revision ID.
This way we resolve the conflict
Now we need to replicate B → A, so both instances are synchronized.
Let's have a look at a typical revision ID:
3-dad88c6c6a0df7f0e09e1e2d0d145eeb
3 - an integer, the current version number, it gets incremented with each update
dad88c6c6a0df7f0e09e1e2d0d145eeb - md5 hash over a set of properties: JSON body, attachments, "_delete" flag
It means that:
> curl -X PUT $HOST/db/a -d '{"a":1}'
{"ok":true,"id":"a","rev":"1-23202479633c2b380f79507a776743d5"}
> curl -X PUT $HOST/db/b -d '{"a":1}'
{"ok":true,"id":"b","rev":"1-23202479633c2b380f79507a776743d5"}
When there is a conflict, the history branches into a tree. Each branch can extent its own history independently. The last documents in the tree (i.e. leaves) are the set of conflicting revisions, in this case these are r4a,r3b,r2c
The way to resolve conflict:
Note that when we delete a record, another revision is added to the revision tree, and the deleted record still exists, but as "deleted" node. It will be still possible to retrieve this record, but it will be marked with "_deleted" flag set to true
Also afterwards during compaction data from non-leaf nodes will be removed
There also is a mechanism for "pruning" the revision tree to prevent it from growing too large
As a showcase we have chosen to implement a small blog on top of CouchDB.
Our schema is following:
For generating a database we used the same approach as described in Generating a database. That is:
However this time we created two types of documents: articles and comments for these articles. To distinguish between types we used the approach described in Types: to article documents we assigned tag “article”, and to comment documents - type “comment”
To retrieve all articles sorted by date of creation (descending)
function(doc) {
if (doc.type == "article") {
var dateTimeStr = doc.created.split(' ')[0]
var date = Date.parse(dateTimeStr);
var copy = eval(uneval(doc));
copy.content = copy.content.slice(0, 100) + '...';
emit(-date, copy);
}
}
We save it as “blog/_design/queries/_view/articles” view
Thus the formula for the url for querying this view is /blog/_design/queries/_view/articles?limit={limit}&skip={page * limit}
To retrieve some certain article we might just directly query it by the id. However we also want to show related comments, and therefore we need to join articles with comments
To do that we use technique called “View Collation”:
function(doc) {
if (doc.type == "article") {
emit([doc._id, 'a'], doc);
} else if (doc.type == "comment") {
emit([doc.article_id, 'b'], doc);
}
}
That is, as for key we output the article key - the key we join on - and artificially created “tags” to signify different entities. Since in views all data is sorted by keys, we can just issue a range query (see Map section and Querying Views) to obtain the needed result.
In this case we save this view as “blog/_design/queries/_view/fullArticle” and the formula for calculating url for querying is as follows:
start = ["{id}”,"a"]
end = ["{id}”,"b"]
url = /blog/_design/queries/_view/fullArticle?startkey={start}&endkey={end}
That query will return us a sequence where an article is followed by its comments.
To edit posted article we delete current one, having ID and Revision ID, and replace it with edited article, ID remains the same:
var url = '/blog/' + currentID + "?rev=" + currentRev;
$.ajax({
url: url,
type: 'DELETE',
success: function (data) {
var urlid = '/_uuids';
$.ajax({
url: urlid,
dataType: 'json',
success: function (data) {
var uuid = data.uuids[0];
var url2 = '/blog/' + currentID;
var article = $("#editform").serializeObject();
article.type = 'article';
article.article_id = currentID;
article.created = new Date();
article.tags = article.tags.split(',');
$.ajax({
url: url2,
type: 'PUT',
data: JSON.stringify(article),
dataType: 'json',
success: function (data) {
alert("Article was updated");
location.reload(); }
});
}
});
}
});
To add new comment we dynamically create form and insert it to database via ajax request, using current article ID:
var article = $("#commentform").serializeObject();
article.type = 'comment';
article.article_id = currentID;
article.created = new Date();
$.ajax({
url: url,
type: 'PUT',
data: JSON.stringify(article),
dataType: 'json',
success: function (data) {
alert("Comment was added");
}
});
For adding new post we created new page with content form:
To insert new post to our database we used the following request:
var article = $("#usrform").serializeObject();
article.type = 'article';
article.created = new Date();
article.tags = article.tags.split(',');
$.ajax({
url: url,
type: 'PUT',
data: JSON.stringify(article),
dataType: 'json',
success: function (data) {
alert("Post was added");
}
});
To get values from the forms elements we used function serializeObject :
$.fn.serializeObject = function() {
var o = {};
var a = this.serializeArray();
$.each(a, function() {
if (o[this.name] !== undefined) {
if (!o[this.name].push) {
o[this.name] = [o[this.name]];
}
o[this.name].push(this.value || '');
} else {
o[this.name] = this.value || '';
}
});
return o;
};
The blog application itself is also a CouchDB document, which stores (as attachments) all html, css and javascript files that are used for displaying the results. This document id is “app” and it contains the main index file. Attachments can easily be accessed via an Internet browser, and in our application we use this: we put all this code into index.html file and upload it as an attachment.
To access the application use http://localhost:5984/blog/app/index.html.
In comparison with traditional relational databases we may conclude that CouchDB has the following advantages and disadvantages.
Advantages
Disadvantages
MongoDB and CouchDB are both document-oriented databases. Aside from both storing documents though, it turns out that they don't share much in common.
Protocol
MongoDB uses a custom binary protocol. CouchDB uses HTTP REST. There's undeniably something nice about CouchDB's approach. Practically anything can be a CouchDB client, which is why not having a shell really isn't a big deal: your normal shell is Couch's shell.
On the flip side, whether you use CouchDB or MongoDB, most of what you do will use a library, which completely abstracts the underlying protocol. Yes, CouchDB's approach leads to nice and impressive documentation/examples, but it doesn't really change how you code. Ultimately, Mongo's approach is more flexible since you can built an HTTP REST interface on top of its binary protocol.
Organization
In CouchDB, you have the concept of a database most people are familiar with, which contains documents. MongoDB has the same concept of a database, but rather than containing documents directly, a database contains one or more collection which contain your documents. In other words, MongoDB has one extra layer of containers.
Since we are dealing with schemaless documents, there's nothing that prevents from using a single collection in MongoDB and achieving the same thing.
Given that you can simulate either approach with either engine. Between the two, the CouchDB approach seems wrong. It isn't just that collections help organize things beyond documents, like indexes (or views in Couch), sharding and provide additional administrative flexibility (backup tools are collection-aware, for example). And it isn't that the collection approach is more efficient, resulting in less wasted space and cpu cycles. It's that the collection-per-entity (or table-per-entity) just maps well with how most of applications are laid out.
In this work we made an overview of Apache CouchDB: what it is, how to use it, how it addresses the issues of concurrency, replication and conflict management. Additionally, we created a simple application to show how to use CouchDB, and finally, compared it to traditional relational databases and to another document-oriented database MongoDB.
We conclude that it is best to use CouchDB in partitioned and high-load environment with its Eventual Consistency model and lock-free Multi-Version Concurrency Control approach. But in other environments it might be more beneficial to use MongoDB for the reasons discussed in the comparison section.
However we also see that the main problem of CouchDB, MongoDB and other NoSQL solutions is their lack of maturity. Traditional relational databases have decades of development behind them and we think that there should be a good reason not to use it. If an application does not need to operate over a highly-partitioned network, it is better to use a relational database with its guarantees for consistency, rely on a data schema and be able to use SQL for ad-hoc querying.
[3] curl is an unix unitily for sending HTTP requests, http://curl.haxx.se/
[9] SVN is a version control system for managing source code, see http://en.wikipedia.org/wiki/Apache_Subversion for more details