1 of 63

Surveying Database Internals

2 of 63

MONGODB INTERNALS

  • MongoDB follows client-server architecture, commonly found in traditional RDBMSs.
  • Client-server architecture involves a single server and multiple clients connecting to the server.
  • In a sharded and replicated scenario, multiple servers — instead of only one — form the topology.
  • In a standalone mode or in a clustered and sharded topology, data is transported from client to server and back, and among the nodes.

3 of 63

MongoDB Wire Protocol

  • Clients speak to a MongoDB server using a simple TCP/IP-based socket connection.
  • The wire protocol used for the communication is a simple request-response-based socket protocol.
  • The wire protocol headers and payload are BSON encoded. The ordering of messages follows the little endian format, which is the same as in BSON
  • in a standard request-response model, a client sends a request to a server and the server responds to the request.
  • In wire protocol, a request is sent with a message header and a request payload. A response comes back with a message header and a response payload. The format for the message header between the request and the response is quite similar

4 of 63

5 of 63

MongoDB wire protocol operations:

  • OP_INSERT (code: 2002) — Insert a document. The “create” operation in CRUD jargon.
  • OP_UPDATE (code: 2001) — Update a document. The update operation in CRUD.
  • OP_QUERY (code: 2004) — Query a collection of documents. The “read” operation in CRUD.
  • OP_GET_MORE (code: 2005) — Get more data from a query. Query response can contain a large number of documents. To enhance performance and avoid sending the entire set of documents, databases involve the concept of a cursor that allows for incremental fetching of the records.
  • The OP_GET_MORE operation facilitates fetching via a cursor. additional documents

6 of 63

  • OP_REPLY (code: 1) — Reply to a client request. This operation sends responses in reply to OP_QUERY and OP_GET_MORE operations.
  • OP_KILL_CURSORS (code: 2007) — Operation to close a cursor.
  • OP_DELETE (code: 2006) — Delete a document.

7 of 63

A standard message header has the following �properties:�

  • messageLength — The length of the message in bytes. uses 4 bytes to hold the length value.
  • requestID — A unique message identifier. The client or the server, depending on which is initiating the operation, can generate the identifier.
  • responseTo — In the case of OP_QUERY and OP_GET_MORE the response from the database includes the requestID from the original client request as the responseTo value. This allows clients to map requests to responses.
  • opCode — The operation code. The allowed operations are listed earlier.

8 of 63

Inserting a Document

  • When creating and inserting a new document, a client sends an OP_INSERT operation via a request that includes:
  • A message header — A standard message header structure that includes messageLength, requestID, responseTo, and opCode.
  • An int32 value — Zero (which is simply reserved for future use).
  • A cstring — The fully qualified collection name.
  • An array — This array contains one or more documents that need to be inserted into a collection.

9 of 63

Querying a Collection�

  • An OP_QUERY message from the client includes:
  • A message header — A standard header with messageLength, requestID, responseTo, and opCode elements in it.
  • An int32 value — Contains flags that represent query options.
  • A cstring — Fully qualifi ed collection name.
  • An int32 value — Number of documents to skip.
  • Another int32 value — Number of documents to return.

10 of 63

  • In response to a client OP_QUERY operation request, a MongoDB database server responds with an OP_REPLY. An OP_REPLY message from the server includes:
  • A message header — The message header in a client request and a server response is quite similar. Also as mentioned earlier, the responseTo header property for an OP_REPLY would contain the requestID value of the client request for a corresponding OP_QUERY.
  • An int32 value — Contains response fl ags that typically denote an error or exception
  • situation. Response fl ags could contain information about query failure or invalid cursor id.
  • An int64 value — Contains the cursor id that allows a client to fetch more documents.
  • An int32 value — Starting point in the cursor.
  • Another int32 value — Number of documents returned.
  • An array — Contains the documents returned in response to the query.

11 of 63

MongoDB Database Files

  • MongoDB stores database and collection data in fi les that reside at a path specifi ed by the --dbpathoption to the mongod server program.
  • The default value for dbpath is /data/db. MongoDB follows
  • query for a collection’s storage properties using the Mongo shell. To use the shell, first start mongod. Then connect to the server using the command-line program. After you connect,
  • query for a collection’s size as follows:
  • > db.movies.dataSize(); 327280 size of the data
  • > db.movies.storageSize(); 500480 collection size & index
  • > db.movies.totalSize(); 860928
  • > db.system.namespaces.find()

12 of 63

  • db.movies.$_id_.dataSize(); index data size
  • db.movies.totalIndexSize(); total index data size
  • db.movies.validate(); size measurements and more

13 of 63

MEMBASE ARCHITECTURE

  • Membase supports the Memcached protocol and so client applications that use Memcached can easily include Membase in their application stack.
  • Each Membase node runs an instance of the ns_server, which is sort of a node supervisor and manages the activities on the node.

  • Clients interact with the ns_server using the Memcached protocol or a REST interface.

14 of 63

  • The REST interface is supported with the help of a component called Menelaus.
  • Menelaus includes a robust jQuery layer that maps REST calls down to the server.
  • Clients accessing Membase using the Memcached protocol reach the underlying data through a proxy called Moxi.
  • A proxy server is a system or router that provides a gateway between users and the internet.
  • Moxi acts as an intermediary that with the help of vBuckets always routes clients to the appropriate place.
  • vBuckets route information correctly, through consistent hashing

15 of 63

16 of 63

Membase consists of the following components:

  • ns_server — The core supervisor.
  • Memcached and Membase engine — Membase builds on top of Memcached.and support for the Telocator Alphanumeric Protocol (TAP).
  • Vbucketmigrator — Based on how ns_server starts one or more vbucketmigrator processes, data is either replicated or transferred between nodes.
  • Moxi — Memcached proxy with support for vBuckets hashing for client routing.

17 of 63

HYPERTABLE UNDER THE HOOD

  • Hypertable is a high-performance alternative to HBase. The essential characteristics of Hypertable are quite similar to HBase, which in turn is a clone of the Google Bigtable.
  • Hypertable is actually not a new project. It started around the same time as HBase in 2007
  • . Hypertable runs on top of a distributed filesystem like HDFS.
  • In HBase, column-family-centric data is stored in a row-key sorted and ordered manner.
  • In Hypertable allinformation is appended to the row-keys.
  • All data for all versions for each row-key is stored in a sorted manner for each column-family.

18 of 63

  • .
  • Access groups in Hypertable provide a way to physically store related column data together. In traditional RDBMS, data is sorted in rows and stored as such.
  • That is, data for two contiguous rows is typically stored next to each other. In column-oriented stores, data for two columns is physically stored together.
  • With Hypertable access groups you have the flexibility to put one or more columns in the same group.
  • Keeping all columns in the same access group simulates a traditional RDBMS environment. Keeping each column separate from the other simulates a column-oriented database

19 of 63

Bloom Filter

  • Bloom Filter is a data structure used to test whether an element is a member of a set.
  • Think of a Bloom Filter as an array of m number of bits.
  • An empty Bloom Filter has a value of 0 in all its m positions. Now if elements a, b, and c are members of a set, then they are mapped via a set of k hash functions to the Bloom Filter.
  • This means each of the members, that is a, b, and c are mapped via the k different hash functions to k positions on the Bloom Filter. Whenever a member is mapped via a hash function to a specific position in the m bit array the value in that particular position is set to 1.

20 of 63

  • Now to test if a given element, say w, is a member of a set, you need to pass it through the k hash functions and map the outcome on the Bloom Filter array.
  • If the value at any of the mapped positions is 0, the element is not a member of the set. If the value at all the positions is 1, then either the element is a member of the set or the element maps to one or more position where the value was set to 1 by another element..

21 of 63

APACHE CASSANDRA

  • Apache Cassandra is simultaneously a very popular (KEY VALUE store) and infamous NoSQL database.
  • Peer-to-Peer Model --- Cassandra moves away from the master-slave model and instead uses a peer-to-peer model. This means there is no single master but all the nodes are potentially masters.
  • This makes the writes and reads extremely scalable and even allows nodes to function in cases of partition tolerance.
  • However, extreme scalability comes at a cost, which in this case is a compromise in strong consistency. The peer-to-peer model follows a weak consistency model

22 of 63

Based on Gossip and Anti-entropy

  • Cassandra’s uses a protocol to communicate among the peers and detect node failure.
  • Cassandra relies on a gossip-based protocol to communicate among the nodes. idea similar to the concept of human gossip.
  • any Peer arbitrarily chooses to send messages to other nodes. In Cassandra, gossip is more systematic and is triggered by a Gossiper class that runs on the basis of a timer.
  • Nodes register themselves with the Gossiper class and receive updates as gossip propagates through the network.
  • Gossip is meant for large distributed systems and is not particularly reliable.
  • In Cassandra, the Gossiper class keeps track of nodes as gossip spreads through them.

23 of 63

  • the Gossiper to choose a random node and send that node a message.
  • This message is named GossipDigestSyncMessage. The receiving node, if active, sends an acknowledgment back to the Gossiper.
  • To complete gossip, the Gossiper sends an acknowledgment in response to the acknowledgment it receives.
  • If the communication completes all steps, gossip successfully shares the state information between the Gossiper and the node. If during gossip the communication fails, it indicates that possibly the node may be down.
  • To detect failure, Cassandra uses an algorithm called the Phi Accrual Failure Detection.

24 of 63

Fast Writes

  • Writes in Cassandra are extremely fast because they are simply appended to commit logs on any available node and no locks are imposed .
  • write into the in-memory data structure is performed only after a successful write into the commit log.

  • Typically, there is a dedicated disk on each machine for the commit log because all writes into the commit log are sequential
  • When it crosses a certain threshold, calculated based on data size and number of objects, it dumps itself to disk.

25 of 63

  • All writes are sequential to disk and also generate an index for efficient lookup based on a rowkey.
  • These indexes are also persisted along with the data.
  • Over time, many such logs could exist on disk and a merge process could run in the background to collate(combine) the different logs into one log. This process of compaction merges data in SSTables, the underlying storage format. It also leads to, deletions of data items marked for deletion, and creation of new indexes.

26 of 63

Hinted Handof

  • During a write operation a request sent to a Cassandra node may fail if the node is unavailable.
  • A write may not reconcile(restore) correctly if the node is partitioned from the network.
  • To handle these cases, Cassandra involves the concept of hinted handoff.
  • Hinted handoff can best be explained through a small illustration so let’s consider two nodes in a network, X and Y. A write is attempted on X but X is down so the write operation is sent to Y. Y stores the information with a little hint, which says that the write is for X and so please pass it on when X comes online.

27 of 63

BERKELEY DB

  • Berkeley DB comes in three distinct flavors and supports multiple confi gurations:
  • Berkeley DB — Key/value store programmed in C. This is the original flavor.
  • Berkeley DB Java Edition (JE) — Key/value store rewritten in Java. Can easily be incorporated into a Java stack.
  • Berkeley DB XML — Written in C++, this version wraps the key/value store to behave as an indexed and optimized XML storage system.
  • Berkeley DB, also referred to as BDB, is a key/value store deep in its guts. Simple as it may be at its core, a number of different configurations are possible with BDB.
  • For example, BDB can be scaled out as a highly available cluster of master-slave replicas.

28 of 63

  • BDB is a key/value data store.
  • It is a pure storage engine that stores key/value pairs.
  • BDB easily allows for higher-level API, query, and modeling abstractions on top of the underlying key/value store.
  • This facilitates fast and efficient storage of application-specifi c data, without the overhead of translating it into an abstracted data format.
  • The flexibility offered by this simple, yet elegant design, makes it possible to store structured and semi-structured data in BDB.

29 of 63

  • The key of a key/value pair almost always serves as the primary key, which is indexed. Other properties within the value could serve as secondary indexes. Secondary indexes are maintained separately in a secondary database.

30 of 63

Storage Configuration

  • four types of data structures: B-tree, Hash, Queue, and Recno.
  • B-Tree Storage : It’s a balanced tree data structure that keeps its elements sorted and allows for fast sequential access, insertions, and deletions.
  • Keys and values can be arbitrary data types.
  • In BDB the B-tree access method allows duplicates
  • .This is a good choice if you need complex data types as keys. It’s also a great choice if data access patterns lead to access of contiguous or neighboring records.
  • B-tree keeps a substantial amount of metadata to perform efficiently. Most BDB applications use the B-tree storage confi guration.

31 of 63

  • Hash Storage :
  • Like the B-tree, a hash also allows complex types to be keys. Hashes have a more linear structure as compared to a B-tree.
  • BDB hash structures allow duplicates.
  • a hash database usually outperforms a B-tree when the data set far exceeds the amount of available memory because a B-tree keeps more metadata than a hash, and a larger data set implies that the B-tree metadata may not fit in the in-memory cache.
  • In such an extreme situation the B-tree metadata as well as the actual data record itself must often be fetched from fi les, which can cause multiple I/Os per operation. The hash access method is designed to minimize the I/Os required to access the data record and therefore in these extreme cases, may perform better than a B-tree.

32 of 63

  • Queue Storage
  • A queue is a set of sequentially stored fixed-length records. Keys are restricted to logical record numbers, which are integer types.
  • Records are appended sequentially allowing for extremely fast writes. If you are impressed by Apache Cassandra’s fast writes by appending to logs, give BDB with the queue access method a try and you won’t be disappointed.
  • Methods also allow reading and updating effectively from the head of the queue.
  • A queue has additional support for row-level locking. This allows effective transactional integrity even in cases of concurrent processing.

33 of 63

  • Recno Storage
  • Recno is similar to a queue but allows variable-length records.
  • Like a queue, Recno keys are restricted to integers. The different configurations allow you to store arbitrary types of data in a collection.
  • you are welcome to store disparate value types for two keys in a collection.
  • Value types can be complex classes, which could represent a JSON document, a complex data structure, or a structured data set.
  • A single key or a single value can be as large as 4 GB in size. The possibility of secondary indexes allows filtering on the basis of value properties.
  • In general, the storage is compact and efficient.

34 of 63

Using MYSQL as NOSQL

  • The emergence of NoSQL and the growing excitement around it is making developers wonder if NoSQL options are a replacement to RDBMS.
  • This has led some to claim that RDBMS is dead and that NoSQL is the next predominant database technology.
  • It has also spurred opposing reactions where arguments are being put forward to prove that NoSQL is a “flash in the pan.”
  • NoSQL and RDBMS are both important and have their place. Peaceful coexistence of the two technologies is reality.
  • Plurality in technology has always been the norm and polyglot persistence is the present and the future..

35 of 63

  • Polyglot persistence is a term that refers to using multiple data storage technologies within a single system, in order to meet varying data storage needs.

36 of 63

37 of 63

  • The two ( nosql & rdbms) share many underlying ideas.
  • One little glimpse of that appears in the context of the structure of indexes in RDBMS and in some of the NoSQL products, which often use B-tree and B-tree-like structures for storage.
  • MYSQL---- One of the most popular open-source relational databases is MySQL.
  • MySQL is modular in design, provides pluggable storage engines, and allows pluggable modules to support additional features as desired.
  • At a conceptual level a MySQL server, accessed from a client, could be depicted as

38 of 63

39 of 63

  • MySQL is a fast database server.
  • Its typical read and write response times for a few thousand rows are impressive.
  • As the amount of data increases, you can boost MySQL performance by running the server on a machine that has ample memory.
  • MySQL, like most RDBMS products ,caches fetched rows in its storage engine’s buffer pool, thereby providing improved performance on subsequent fetches of the same rows.
  • However, with increased data the SQL overheads become substantial. Each fetch, especially when frequent and concurrently issued by multiple clients, leads to several expensive actions:

40 of 63

with increased data, several expensive actions:

  • Parsing SQL statements
  • Opening tables
  • Locking tables
  • Making SQL execution plans
  • Unlocking tables
  • Closing tables
  • Managing concurrent access

41 of 63

42 of 63

  • to boost performance under heavy loads, you need to cache as much data as possible.
  • Memcached is a typical in-memory caching solution that works well with MySQL.
  • Rows are cached and served to clients exclusively from memory. When large amounts of memory — say over 32 GB — is available, MySQL with Memcached works well to serve more than 400,000 queries per second.
  • These queries of course are primary key lookups

43 of 63

Using Memcached with MySQL is beneficial but the architecture has its drawbacks:

  • Data is in-memory in two places: the storage engine buffer and Memcached.
  • Replication of data between the storage engine and Memcached can have inconsistent states of data.
  • The data is fetched into Memcached via the SQL layer ,hence SQL overhead is still present
  • Memcached performance is superior only until all relevant data fits in memory. Disk I/O overheads can be high and can make the system slow.

44 of 63

Handler Socket

  • An alternative to using MySQL with Memcached is to bypass the SQL layer and get directly to the storage engine.
  • This is exactly what the HandlerSocket plugin for MySQL does. The HandlerSocket plugin for MySQL is an open-source plugin that allows the bypassing of the SQL layer to access the underlying MySQL storage engine.

45 of 63

46 of 63

  • Using HandlerSocket is better than using Memcached with the SQL layer, because it not only avoids the SQL layer but also avoids duplicate caches and the possible inconsistencies in replication.
  • HandlerSocket interfaces directly with the storage engine, so there are not two replicas but a single store.
  • The HandlerSocket-based NoSQL solution is especially suitable for high-performance reads.
  • The base MySQL storage engine provides transactional support and recovery from a crash.

47 of 63

MOSTLY IMMUTABLE DATA STORES

  • RDBMS offers transactional support and the consistency that NoSQL frequently lacks. This is major reason for not adopting NoSQL.
  • Lot of modern-day applications have little or no need for transactional support. This is mainly because the data is often written once and read and manipulated many times.
  • Many social media applications, for example those that send messages or tweets or publish and propagate status updates, are primarily written once and consumed many times.
  • The write-once and read-many-times paradigm is also prevalent for updates, e-mails, SMS messages, or feedback.
  • Applications that solicit responses to polls, feedback, ratings, and comments are also often write-once and read-multiple-times type cases.

48 of 63

Polyglot Persistence at Facebook

  • Facebook in particular uses MySQL for many mission-critical features. Facebook is also a big HBase user.
  • Facebook is about large volume and superior performance and its MySQL optimizations are no exception to that.
  • Its work is focused on maximizing queries per second and controlling the variance of the request-response times.

49 of 63

Facebook

  • Read responses were an average of 4ms and writes were 5ms.
  • Maximum rows read per second scaled up to a value of 450 million, which is obviously very large compared to most systems.
  • 13 million queries per second were processed at peak.
  • 3.2 million row updates and 5.2 million InnoDB disk operations were performed in boundary cases.

50 of 63

  • Facebook has focused on reliability more than maximizing queries per second, although the queries per-second numbers are very impressive too.
  • Active sub-second-level monitoring and profiling allows Facebook database teams to identify points of server performance fractures, called stalls. Slower queries and problems have been progressively identified and corrected.
  • Facebook is also the birthplace of Cassandra. Facebook has lately abandoned Cassandra and gone in favor of HBase.
  • The current Facebook messaging infrastructure is built on HBase. Facebook’s new messaging system supports storage of more than 135 billion messages a month.
  • Facebook have adopted polyglot persistence strategies .

51 of 63

  • Like Facebook, Twitter and LinkedIn have adopted polyglot persistence.
  • Twitter, for example, uses MySQL and Cassandra actively.
  • Twitter also uses a graph database, named FlockDB, for maintaining relationships, such as who’s following whom and who you receive phone notifications from.
  • Twitter’s popularity and data volume have grown immensely over the years.

52 of 63

Data Warehousing and Business Intelligence

  • An entire category of applications is built to store and manipulate archived data sets. [to store old records or documents so that you can access them again in the future]
  • Usually, these data warehouses are built out of old transactional data, which is typically referred to as fact data.
  • Data, in the data warehouse, is then analyzed and manipulated to uncover patterns or decipher trends.
  • All such archived and warehoused data is read-only and the transactional requirements is minimal.
  • These data sets have traditionally been stored in special-purpose data stores, which have the capability to store large volumes of data and analyze the data .

53 of 63

  • With the advent of Hadoop, some of the large-scale analytics is done by MapReduce-based jobs. The MapReduce-based model of analytics is being enriched by the availability of querying tools like Hive and high-level languages like Pig.
  • The Apache Mahout project builds a machine learning infrastructure on top of Hadoop.
  • Therefore, you could run collaborative filtering algorithms or Naive Bayes classifiers over a Hadoop MapReduce infrastructure using Mahout.
  • Data warehouse is an enterprise system used for the analysis and reporting of structured and semi-structured data from multiple sources

54 of 63

WEB FRAMEWORKS AND NOSQL

  • Using Rails with NoSQL—
  • most popular agile web development framework.
  • .Rails implements a Model-View-Controller (MVC) frameworks.
  • The use of ActiveRecord enables automatic mapping of model objects to data persisted in relational tables.
  • Views provide the user interface to manipulate the underlying data and controllers facilitate the coordination between the model and the view.
  • To use MongoDB with Rails, first switch ActiveRecord off.

gem install rails

gem install mongo_mapper

55 of 63

Using Django with NoSQL

  • Django is to the Python community what Rails is to Ruby developers. Django is a lightweight web framework that allows for rapid prototyping and fast development.
  • The Django idiom is also based on an ORM (Object Relational Mapper)to map models to databases. The SQL standard and the presence of a disintermediating ORM layer makes it possible for Django applications to swap one RDBMS for another. However, doing the same for the NoSQL world is not common.
  • The django-nonrel independent open-source project was put together to address all these issues and to provide a common level of abstraction for Django to work with multiple NoSQL products

56 of 63

Using Spring Data

  • Though Rails and Django are popular web frameworks for agile development, a lot of enterprise developers still use Java to build their new-generation applications
  • . Spring is a popular Java dependency injection framework that has been widely adopted around the world. Spring has included NoSQL support via its Spring Data project.
  • For simplicity and faster development, you can use the SpringSource Tool Suite (STS),
  • . STS uses Maven to configure and build a project so the defi nitions are specifi ed in a project’s project object model (POM).

57 of 63

MIGRATING FROM RDBMS TO NOSQL

  • Migrating from a structured schema to a schema-less form is not very hard.
  • In many cases you could simply export the data from RDBMS tables and move them into NoSQL collections.
  • However, things get complicated when the NoSQL database is a column-family, sorted ordered, or a key/value store.
  • Changes in paradigm often lead to redesign efforts.
  • The greater impedance mismatch is around ad-hoc querying and secondary indexes, which are often diffi cult to support in a NoSQL environment.

58 of 63

  • NoSQL looks at the data store from a query perspective and not from a generic storage viewpoint.
  • To facilitate data importation from RDBMS to Hadoop for NoSQL-style manipulations, Cloudera has created an open-source product called Sqoop.
  • Sqoop is a command-line tool with the following capabilities:
  • Imports individual RDBMS tables or entire databases to files in HDFS
  • Generates Java classes to allow you to interact with your imported data
  • Provides the ability to import from SQL databases straight into your Hive data warehouse

59 of 63

Using Django with NoSQL

# models.py:

class MyModel(models.Model): //Model instance reference

name = models.CharField(max_length=64)

lowercase_name = models.CharField(max_length=64, editable=False)

last_modified = models.DateTimeField(auto_now=True)

// set the time when someone modified his feedback

month_last_modified = models.IntegerField(editable=False)

def save(self, *args, **kwargs):

// *args is used to pass an arbitrary number of arguments to a function

//kwargs is keyword arguments

// self ---- reference object

60 of 63

self.lowercase_name = self.name.lower()

self.month_last_modified = self.last_modified.month

super(MyModel, self).save(*args, **kwargs)

def run_query(name, month):

MyModel.objects.filter(lowercase_name=name.lower(),

month_last_modified=month) //filter---used for querying

61 of 63

Spring code

<?xml version=”1.0” encoding=”UTF-8”?> // encoding is converting to binary

xmlns=”http://maven.apache.org/POM/4.0.0” //The namespace can be defined by an xmlns attribute

……………………….

<!-- Spring -->

<dependency>

<groupId> springframework </groupId>

<artifactId> spring-context </artifactId> //  artifactId attribute in Apache Maven …name

<version>${org.springframework-version}</version>

//Java platform that provides comprehensive infrastructure support for developing Java applications

62 of 63

.

<exclusions> //exclude dependancy

<span class=”hiddenSpellError” “> SLF4j </span> -->

<exclusion>

<groupId>commons-logging</groupId>

<artifactId>commons-logging</artifactId>

</exclusion>

</exclusions>

</dependency>

63 of 63

<dependency> //Spring container “injects” objects into other objects ..

<groupId>org.apache.commons</groupId>

<artifactId>commons-io</artifactId>

<version>1.3.2</version>

</dependency>

<dependency>