1 of 89

Understanding the Storage Architecture

2 of 89

Column-oriented databases

  • Column-oriented databases are among the most popular types of non-relational databases.
  • Made famous by the Google engineering efforts
  • popularized by the growth of social networking giants like Facebook, LinkedIn, and Twitter, rightly be called the flag bearers of the NoSQL revolution.

3 of 89

  • Publications provided a view into the world of Google’s search engine success and shed light on the mechanics of large-scale and big data efforts like Google Earth, Google Analytics, and Google Maps.
  • cluster of inexpensive hardware hold huge amounts data, way more than a single machine can hold, and be processed effectively and efficiently within a reasonable timeframe

4 of 89

Three key themes emerged:

  • Data needs to be stored in a networked filesystem that can expand to multiple machines. Files themselves can be very large and be stored in multiple nodes, each running on a separate machine.
  • Data needs to be stored in a structure that provides more flexibility than the traditional normalized relational database structures. The storage scheme needs to allow for effective storage of huge amounts of sparse data sets. It needs to accommodate for changing schemas without the necessity of altering the tables

5 of 89

  • Data needs to be processed in a way that computations on it can be performed in isolated subsets of the data and then combined to generate the desired output.
  • This would imply computational efficiency if algorithms run on the same locations where the data resides.
  • It would also avoid large amounts of data transfer across the network for carrying out the computations on data sets.

6 of 89

7 of 89

8 of 89

  • Millions of rows
  • Flexible
  • Useful for evolving data--- stores each version of cell value as it evolves
  • 3D spreadsheet --- 3 D is Time
  • Can store complex data --- column family (related columns)
  • Friendly storage containers --- stores multiple versions of cell data

9 of 89

  • Designed to scale
  • Easily accommodate millions of column & billions of rows
  • Row key uniquely identifies row in a DB
  • Rows are split into bundles, contain continuous values as data grows

10 of 89

11 of 89

12 of 89

13 of 89

Column Databases as Nested Maps of Key/Value Pairs

  • Example
  • {
  • “row_key_1”: { “name” : { “fname” : “ Charan” ,
                  • “ lname” : “ Raj”

},

“location “ : { “zip” :”23456” } ,

“preference” : { “d/r” : “D” }

},

14 of 89

  • Row_key_2” : { “name” :{ “fname “ : “smitha”,
            • “mname” : “ S” ,
            • “lname” : “Rao”
            • },
            • “location” : { “ zip” : { 1: ”1000001”,
            • 5 : “123478”
            • } },

            • “preference” : { “d/r” : { 1 : “D”,
            • 5 : “R” }
            • }
            • },
            • ……
            • }

15 of 89

Laying out the Webtable

  • Webtable that stores copies of crawled web pages.
  • Such table stores contents of a web page in addition to attributes that relate to the page.
  • Such attributes can be an anchor that references the page or the mime types that relate to the content.
  • Google first introduced this example in its research paper on Bigtable.
  • A Webtable uses a reversed web page URL as the row-key for a web

16 of 89

  • URL www.example.com implies a row-key com.example.www.
  • The row-key forms a order for rows of data in a column-oriented database.
  • Therefore, rows that relate to two subdomains of example.com, like www.example.com and news.example.com, are stored close to each other when reversed URL is used as a row-key.
  • This makes querying for all content relating to a domain easier

17 of 89

Laying out the Webtable

18 of 89

HBASE DISTRIBUTED STORAGE ARCHITECTURE

  • Destributed HBASE deployment model is typical to many column oriented databases
  • HBase is a column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS).
  • Examples of column-based NoSQL databases include Cassandra, HBase, and Hypertable.

19 of 89

  • A robust HBase architecture involves a few more parts than HBase alone.
  • HBase deployment uses a master-worker pattern.
  • Therefore, there is usually a master and a set of workers, commonly known as range servers.

20 of 89

21 of 89

  • When HBase starts, the master allocates a set of ranges to a range server.
  • Each range stores an ordered set of rows, where each row is identified by a unique row-key.
  • As the number of rows stored in a range grows in size beyond a configured threshold, the range is split into two and rows are divided between the two new ranges.

22 of 89

  • Like most column-databases, HBase stores columns in a column-family together.
  • Therefore, each region maintains a separate store for each column-family in every table.
  • Each store in turn maps to a physical file that is stored in the distributed filesystem.
  • HBase abstracts access to the underlying filesystem with the help of a thin wrapper that acts as the intermediary between the store and the underlying physical file.

23 of 89

24 of 89

  • Each region has an in-memory store, or cache, and a write-ahead-log (WAL).

  • write-ahead logging (WAL) is a family of techniques for providing atomicity and durability (two of the ACID properties) in database

25 of 89

  • WAL is a common technique used across a variety of database systems, including the popular relational database systems like PostgreSQL and MySQL.

In HBase a client program could decide to turn WAL on or switch it off.

Switching it off would boost performance but reduce reliability and recovery, in case of failure.

When data is written to a region, it’s first written to the WAL

Then store, and then disk

26 of 89

  • distributed filesystem like the Hadoop distributed filesystem HDFS, has namenode and a set of datanodes form a structure similar to master and range servers that column databases like HBase follow.
  • physical storage file for an HBase column-family store resides in an HDFS datanode.
  • Hbase uses API acts as intermediary for conversations between an HBase store and HDFS file.
  • example, HBase could be used with CloudStore, formerly known as Kosmos FileSystem (KFS), instead of HDFS

27 of 89

  • To access HBase the first time, a client accesses two catalogs via ZooKeeper.
  • These catalogs are named -ROOT- and .META.
  • The catalogs maintain state and location information for all the regions.
  • -ROOT- keeps information of all .META. tables and a .META. file keeps records for a user-space table that is table with data.
  • When a client wants to access a specific row it first asks ZooKeeper for the -ROOT- catalog.
  • The -ROOT- catalog locates the .META.

28 of 89

  • META -- provides all the region details for accessing the specific row.
  • Column databases rely heavily on caching all relevant information, from this three-step lookup process.
  • This means clients directly contact the region servers the next time they need the row data
  • The three-step process of accessing a row is not repeated the next time the client asks for the row data.
  • This means clients directly contact the region servers the next time they need the row data.

29 of 89

DOCUMENT STORE INTERNALS

  • Each document is stored in BSON format.
  • BSON is a binary-encoded representation of a JSON-type document format
  • structure is close to a nested set of key/value pairs.
  • BSON is a superset of JSON
  • It supports additional types like regular expression, binary data, and date.
  • Each document has a unique identifier, which MongoDB can generate
  • if it is not explicitly specified when the data is inserted into a collection, like when auto-generated object ids

30 of 89

31 of 89

  • MongoDB drivers and clients serialize and de-serialize to and from BSON as they access BSON encoded data.
  • The MongoDB server understands the BSON format and doesn’t need the additional overhead of serialization.
  • The binary representations are read in the same format as they are transferred across the wire. This provides a great performance boost.

32 of 89

Storing Data in Memory-Mapped Files

  • A memory-mapped file is a segment of virtual memory that is assigned byte-for-byte to a file or a file-like resource
  • that can be referenced through a file descriptor.
  • This implies that applications can interact with such files as if they were parts of the primary memory. This obviously improves I/O performance as compared to usual disk read and write.
  • Accessing and manipulating memory is much faster than making system calls.

33 of 89

  • in many operating systems, like Linux, memory region mapped to a file is part of the buffer in RAM.
  • This transparent buffer is commonly called page cache
  • It is implemented in the operating system’s kernel.

34 of 89

Guidelines for Using Collections and Indexes in MongoDB

  • there is no restriction on number of collections in a database
  • Do not add disparate(different kind) data into a single collection.
  • Mixing an eclectic(diverse) bunch together creates complexities for indexes.
  • If you query often the varied data set. Then keep the data together, otherwise portioning it into separate collections is more efficient.

35 of 89

  • Sometimes, a collection may grow indefinitely and threaten to hit the 2 GB database size limit.
  • Then it is good to use capped collections.
  • Capped collections in MongoDB are like a stack that has a predefined size.
  • When a capped collection hits its limit, old data records are deleted.
  • Old records are identified on the basis of the Least Recently Used (LRU) algorithm.
  • Document fetching in capped collection follows a Last-In-First-Out (LIFO) strategy.

36 of 89

  • _id field indexes every MongoDB collection.
  • indexes can be defined on any other attributes of the document.
  • When queried, documents are returned in natural order of their _id .
  • Only capped collections use a LIFO-based order.
  • MongoDB offers enhanced performance but it does so at the expense of reliability.

37 of 89

MongoDB Reliability and Durability

  • Cursors are not automatically get refreshed if data is modified.
  • By default, MongoDB flushes to disk once every minute.
  • Hence data inserts and updates are recorded on disk.
  • Any failure between two synchronizations can lead to inconsistency.
  • You can force a flush to disk but all of that comes at the expense of some performance.

38 of 89

  • To avoid complete loss during a system failure, it’s advisable to set up replication.
  • Two MongoDB instances can be set up in a master-slave arrangement to replicate and keep the data in synch.
  • Replication is an asynchronous (not simultaneous).

39 of 89

  • However, it’s better to have data replicated
  • In current versions of MongoDB, replica pairs of master and slave have been replaced with replica sets, where three replicas are in a set.
  • out of the three ---1 act as of master, two act as slaves.
  • Replica sets allow automatic recovery and automatic failover.
  • Whereas replication is viewed more as a failover and disaster recovery plan, sharding is useful for horizontal scaling(increase in nodes).

40 of 89

Horizontal Scaling

  • In more recent versions, MongoDB supports auto-sharding for scaling horizontally with ease.
  • Sharding is a type of database partitioning that separates large databases into smaller, faster, more easily managed parts.
  • MongoDB allows ordered collections to be saved across multiple machines.
  • Each machine that saves part of the collection is then a shard.

41 of 89

  • Shard is a method for distributing data across multiple machines.
  • Shards are replicated to allow failover.
  • example : a large collection could be split into four shards and each shard in turn may be replicated three times.

42 of 89

  • one collection in a database may reside on a single node
  • another collection in the same database may be sharded out to multiple nodes.
  • Each shard stores contiguous sets of the ordered documents.
  • Such bundles are called chunks in MongoDB jargon. Each chunk is identified by three attributes, namely the first document key (min key), the last document key (max key), and the collection.

43 of 89

44 of 89

  • Target query ----- more efficient

query on shard key

  • Global key ----- query on index
  • Mongo process --- route queries

pulls state from config servers

45 of 89

MongoDB modifier operations

  • $inc
  • $set
  • $unset
  • $push
  • $pushall

46 of 89

  • $addToSet
  • $pop
  • $pull
  • $pullAll
  • $rename

47 of 89

48 of 89

UNDERSTANDING KEY/VALUE STORES IN MEMCACHED AND REDIS

49 of 89

memcached

  • download from http://memcached.org
  • is a distributed high performance object-caching system.
  • It’s extremely popular and used by a number of high-traffic venues like Facebook, Twitter, Wikipedia, and YouTube.
  • Memcached is extremely simple and has a bare minimum set of features.
  • For example, there is no support for backup, failover, or recovery.
  • It has a simple API and can be used with almost any web-programming language.
  • The primary objective of using Memcached in an application stack is often to reduce database load.

50 of 89

  • The heart of Memcached is a slab allocator.
  • Memcached stores its values in a slab. A slab itself is composed of pages, which in turn are made up of chunks or buckets.
  • The smallest size a slab can be is 1 kB and slab sizes grow at a power of 1.25.
  • Therefore, slab sizes can be 1 kB (1.25 power 0), 1.25 kB (1.25 power 1), 1.5625 kB (1.25 power 2), and so on. Memcached can store data values up to a maximum of 1 MB in size.

51 of 89

  • Values are stored and referenced by a key. A key can be up to 250 bytes in size.
  • Each object is stored in a closest sized chunk or bucket. This means an object 1.4 kB in size would be stored in a chunk that is 1.5625 kB in size.
  • This leads to wasted space, especially when objects are larger than the chunk size.
  • By default, Memcached uses up all available memory .

52 of 89

53 of 89

  • LRU algorithms govern the removal of old cache objects.
  • LRU algorithms work on a per-slab class basis. Fragmentation may occur as objects are stored and cleaned up.
  • Reallocation of memory solves part of this problem.
  • Memcached is an object cache that doesn’t organize data elements in collections, like lists, sets, sorted sets, or maps.
  • Redis, on the other hand, provides support for all these rich data structures.
  • Redis is similar to Memcached in approach but more robust.

54 of 89

Redis Internals

  • Everything in Redis is ultimately represented as a string.
  • Even collections like lists, sets, sorted sets, and maps are composed of strings.
  • Redis defines a special structure, which it calls simple dynamic string or SDS.
  • This structure consists of three parts, namely:
  • buff — A character array that stores the string
  • len — A long type that stores the length of the buff array
  • free — Number of additional bytes available for use

55 of 89

  • Redis keeps its data set in the primary memory, persisting it to disk as required.
  • Unlike MongoDB, it does not use memory-mapped files for that purpose.
  • Instead, Redis implements its own virtual memory subsystem. When a value is swapped to disk, a pointer to that disk page is stored with the key.

56 of 89

Redis Internals

57 of 89

Consistent hashing

Important principle in distributed hash tables

It maps keys to the slots

Example 50 keys distributed to 7 nodes

Value 85 goes to node 1 85 mod 7 is 1

18 mod 7 is 4

When number of nodes change consistant hashing is useful

58 of 89

  • Consistent hashing forms an important principle for distributed hash tables.
  • In consistent hashing, addition or removal of a slot does not significantly change the mapping of keys to the slots.
  • To appreciate this hashing scheme, let’s first look at an elementary hashing scheme and understand the problems that show up as slots are added or removed.

59 of 89

Consistent Hashing

60 of 89

  • In consistent hashing, the rearrangement of keys is not majorly affected when nodes are added or removed. A good way to explain consistent hashing is to draw out a circle and mark the nodes
  • consistent hashing provides effective partitioning
  • object versioning helps keep the data consistent.

61 of 89

Object Versioning

  • In a large distributed and highly scalable system, ACID transactions impose a huge overhead.
  • So Dynamo proposes object versioning and vector clocks for keeping consistency.
  • vector clocks not only help determine the order of the events but also help resolve any inconsistencies by identifying the root causes for those problems.

62 of 89

Gossip-Based Membership and Hinted Handoff

  • A gossip protocol is a style of communication protocol inspired by the form of gossip or rumor in social networks and offices.
  • A gossip communication protocol involves periodic, pair wise, interprocess interactions.
  • Reliability is usually low and peer selection is often random.
  • In hinted handoff, instead of a full quorum during message write for durability, a relaxed quorum is allowed.
  • Write is performed on the healthy nodes and hints are recorded to let the failed node know when it’s up again

63 of 89

Gossip-Based Membership and Hinted Handof

  • A gossip protocol is a style of communication protocol inspired by the form of gossip or rumor in social networks and offices.
  • A gossip communication protocol involves periodic, pair wise, interprocess interactions.
  • Reliability is usually low and peer selection is often random. In hinted handoff, instead of a full quorum during message write for durability, a relaxed quorum is allowed.
  • Write is performed on the healthy nodes and hints are recorded to let the failed node know when it’s up again.

64 of 89

CRUD operations

65 of 89

THE UNIQUE PRIMARY KEY

  • default MongoDB BSON object id
  • 12-byte structure for a key
  • The first four bytes represent the timestamp
  • The next three bytes represent the machine id
  • The following two bytes encode the process id
  • The last three bytes are the increment or the sequence counter

66 of 89

Creating Records in a Document-Centric Database

  • order, product, and their relationship table in a traditional entity-relationship diagram

67 of 89

  • t={ “order_date” : “Sat Oct 30 2010 22:30:12”,
  • “line_items” : [
  • { “item” : {
  • “name” : “latte”,
  • “unit_price” : 4
  • }, “quantity” : 1
  • },
  • { “item” : {
  • “name” : “cappuccino”,
  • “unit_price” : 4.25
  • }, “quantity” : 1 },
  • {

68 of 89

  • “item” : {
  • “name” : “regular”,
  • “unit_price” : 2
  • },
  • “quantity” : 2
  • }
  • ]
  • }
  • > db.orders.save(t);

69 of 89

  • > db.orders.find();
  • { “_id” : ObjectId(“4cccff35d3c7ab3d1941b103”), “order_date” : “Sat Oct 30 2010 22:30:12 ”, “line_items” : [ ……….

  • bin/mongod --dbpath ~/data/db
  • bin/mongo

70 of 89

DBRef

  • Although storing the entire nested document collection is advised, sometimes it’s necessary to store the nested objects separately.
  • When nested documents are stored separately, it’s your responsibility to join the record sets together.
  • There is no notion of a database join in MongoDB so you must either manually implement the join operation by using the object id on the client side or use DBRef
  • In MongoDB DBRef is a formal specification for creating references between documents.
  • A DBRef includes a collection name as well as an object id.

71 of 89

  • > t2 = { order_date: new Date(),

“line_items”: [

{ “item_name”:”latte”, “quantity”:1 },

{

“item_name”:”cappuccino”,“quantity”:1

},

{“item_name”:”regular”, “quantity”:2

} ] };

72 of 89

  • {
  • “order_date” : “Sat Oct 30 2010 23:03:31”,
  • “line_items” : [
  • { “item_name” : “latte”,“quantity” : 2
  • },
  • { “item_name” : “cappuccino”, “quantity” : 2
  • },
  • { “item_name” : “regular”, “quantity” : 2
  • } ]
  • }
  • > db.orders2.save(t2);

73 of 89

  • p4 = {“name”:”latte”, “unit_price”:4};

{ “name” : “latte”, “unit_price” : 4 }

  • p5 = { “name”: “cappuccino”, “unit_price”:4.25 };

{ “_id” : “cappuccino”, “unit_price” : 4.25 }

  • p6 = { “name”: “regular”,“unit_price”:2};

{ “_id” : “regular”, “unit_price” : 2 }

74 of 89

> db.products2.save(p4);

> db.products2.save(p5);

> db.products2.save(p6);

75 of 89

  • t3 = {

order_date: new Date(),

“line_items”: [

{“item_name”: new DBRef(‘products2’, p4._id),“quantity”:1 },

{ “item_name”: new DBRef(‘products2’, p5._id),“quantity”:1},

{“item_name”: new DBRef(‘products2’, p6._id), “quantity”:2 }

]

};

  • db.orders3.save(t3);

76 of 89

  • order1 = db.orders2.findOne();
  • {
  • “_id” : ObjectId(“4ccd06e8d3c7ab3d1941b104”),
  • “order_date” : “Sat Oct 30 2010 23:03:31 ”,
  • “line_items” : [
  • {
  • “item_name” : “latte”, “quantity” : 1
  • },
  • {
  • “item_name” : “cappuccino”, “quantity” : 1
  • },
  • {
  • “item_name” : “regular”, “quantity” : 2
  • } ]}

77 of 89

Although similar to the find() method, the findOne()

 method returns a document rather than a cursor.

var myDocument = db.bios.findOne();

78 of 89

Accessing Documents from MongoDB

  • use mydb
  • use library
  • show dbs
  • Show collections
  • db.orders.find();
  • var refdate = new Date(2010, 9, 25);
  • db.orders.find({“order_date”: {$gt: refdate}});

79 of 89

  • db.orders.find({ “line_items.item_name” : “latte” }) ;
  • db.orders.find({ “line_items.quantity” : 2 });

  • db.orders.ensureIndex({ “line_items.quantity” : 1 });
  • db.media.ensureIndex({title:1});
  • db.media.ensureIndex({title:-1});

  • 1 is ascending order
  • -1 is descending order

80 of 89

  • db.media.find({Artist: “nirvana”})

  • db.media.find().sort({title:1})

  • db.media.find().limit(10)

  • db.media.find().skip(20)

  • db.media.count()

  • db.media.find({publisher:”Apress”,type:”book”}).count()

81 of 89

Using the Create Operation in Column-Oriented Databases

  • bin/start-hbase.sh
  • bin/hbase shell

  • create the products table:
  • hbase(main):001:0> create ‘products’, ‘type’, ‘characteristics’, ‘source’
  • 0 row(s) in 1.1570 seconds

82 of 89

  • hbase(main):001:0> put ‘products’, ‘product1’, ‘type:category’, ‘coffee beans’
  • 0 row(s) in 0.0710 seconds
  • hbase(main):002:0> put ‘products’, ‘product1’, ‘type:name’, ‘arabica’
  • 0 row(s) in 0.0020 seconds
  • hbase(main):003:0> put ‘products’, ‘product1’, ‘type:genus’, ‘Coffea’
  • 0 row(s) in 0.0050 seconds
  • hbase(main):004:0> put ‘products’, ‘product1’,
  • ‘characteristics: cultivation_method’, ‘organic

83 of 89

  • To create a record with the following fi elds:
  • type:category = “coffee beans”
  • type:name = “arabica”
  • type:genus = “Coffea”
  • characteristics: cultivation_method = “organic”
  • characteristics: acidity = “low”
  • source: country = “yemen” source: terrain = “mountainous”

84 of 89

  • hbase(main):005:0> put ‘products’, ‘product1’, ‘characteristics: acidity’, ‘low’
  • 0 row(s) in 0.0030 seconds
  • hbase(main):006:0> put ‘products’, ‘product1’, ‘source: country’, ‘yemen’
  • 0 row(s) in 0.0050 seconds
  • hbase(main):007:0> put ‘products’, ‘product1’, ‘source: terrain’, ‘mountainous’
  • 0 row(s) in 0.0050 seconds
  • hbase(main):008:0>

85 of 89

  • hbase(main):008:0> get ‘products’, ‘product1’
  • COLUMN CELL
  • characteristics: acidity timestamp=1288555025970, value=lo
  • characteristics: cultivatio timestamp=1288554998029, value=organic
  • n_method
  • source: country timestamp=1288555050543, value=yemen
  • source: terrain timestamp=1288555088136, value=mountainous
  • type:category timestamp=1288554892522, value=coffee beans
  • type:genus timestamp=1288554961942, value=Coffea
  • type:name timestamp=1288554934169, value=Arabica
  • 7 row(s) in 0.0190 seconds

86 of 89

  • What if you put in a value for “type:category” a second time stored as “beans” instead of its
  • original value of “coffee beans” as follows?
  • hbase(main):009:0> put ‘products’, ‘product1’, ‘type:category’, ‘beans’
  • 0 row(s) in 0.0050 seconds

87 of 89

  • hbase(main):010:0> get ‘products’, ‘product1’
  • COLUMN CELL
  • characteristics: acidity timestamp=1288555025970, value=low
  • characteristics: cultivatio timestamp=1288554998029, value=organic
  • n_method
  • source: country timestamp=1288555050543, value=yemen
  • source: terrain timestamp=1288555088136, value=mountainous
  • type:category timestamp=1288555272656, value=beans
  • type:genus timestamp=1288554961942, value=Coffea
  • type:name timestamp=1288554934169, value=Arabica
  • 7 row(s) in 0.0370 seconds

88 of 89

  • hbase(main):011:0> get ‘products’, ‘product1’, { COLUMN => ‘type:category’,
  • VERSIONS => 4 }
  • COLUMN CELL
  • type:category timestamp=1288555272656, value=beans
  • type:category timestamp=1288554892522, value=coffee beans

89 of 89