1 of 94

Hadoop ecosystem

Hadoop ecosystem contains components
like HDFS and HDFS components,
MapReduce,
YARN, Hive, Apache Pig, Apache HBase and HBase components, HCatalog, Avro, Thrift, Drill, Apache mahout, Sqoop, Apache Flume, Ambari, Zookeeper and Apache OOzie

3 of 94

Hbase& Hcatalog

Apache Hbase:

This is a Hadoop ecosystem component which is a distributed database that was designed to store structured data in tables that could have billions of row and millions of columns. HBase is scalable, distributed. HBase, provide real-time access to read or write data in HDFS.

HCATALOG:

It is a table and storage management layer for Hadoop. HCatalog supports different components available in Hadoop ecosystems like MapReduce, Hive, and Pig to easily read and write data from the cluster. HCatalog is a key component of Hive that enables the user to store their data in any format and structure.�
By default, HCatalog supports RCFile, CSV, JSON, sequenceFile and ORC file formats.

4 of 94

Apache Mahout�

Mahout is open source framework for creating scalable machine learning algorithm and data mining library. Once data is stored in Hadoop HDFS, mahout provides the data science tools to automatically find meaningful patterns in those big data sets.

Algorithms of Mahout are:

Clustering – Here it takes the item in particular class and organizes them into naturally occurring groups, such that item belonging to the same group are similar to each other.
Collaborative filtering – It mines user behavior and makes product recommendations (e.g. Amazon recommendations)
Classifications – It learns from existing categorization and then assigns unclassified items to the best category.
Frequent pattern mining – It analyzes items in a group (e.g. items in a shopping cart or terms in query session) and then identifies which two items typically appear together.

5 of 94

Apache Sqoop & Flume�

Sqoop:
Imports data from external sources into related Hadoop ecosystem components like HDFS, Hbase or Hive.
It also exports data from Hadoop to other external sources.
Sqoop works with relational databases such as teradata, Netezza, oracle, MySQL.

6 of 94

Apache Flume

Flume efficiently collects, aggregate and moves a large amount of data from its origin and sending it back to HDFS.
It is fault tolerant and reliable mechanism.
This Hadoop Ecosystem component allows the data flow from the source into Hadoop environment.
It uses a simple extensible data model that allows for the online analytic application.
Using Flume, we can get the data from multiple servers immediately into hadoop.

7 of 94

Query languages for hadoop

MapReduce (MR) is a criterion of Big Data processing model with parallel and distributed large datasets.
This model knows difficult problems related to low‑level and batch nature of MR that gives rise to an abstraction layer on the top of MR.
Several High‑Level MapReduce Query Languages built on the top of MR provide more abstract query languages and extend the MR programming model

8 of 94

These High‑Level MapReduce Query Languages remove the burden of MR programming away from the developers and make a soft migration of existing competences with SQL skills to Big Data.
Common High‑Level MapReduce Query Languages built directly on the top of MR that translate queries into executable native MR jobs.
It evaluates the performance of the four presented High‑Level MapReduce Query Languages: JAQL, Hive, Big SQL and Pig, with regards to their insightful perspectives and ease of programming.

10 of 94

Query languages for hadoop

Pig, from Yahoo! and now incubating at Apache, has an imperative language called Pig Latin for performing operations on large data files.
Jaql, from IBM is a declarative query language for JSON data.
Hive, from Facebook is a data warehouse system with a declarative query language that is a hybrid of SQL and Hadoop streaming.

11 of 94

HIVE & PIG

Hive:

The Hadoop ecosystem component, Apache Hive, is an open source data warehouse system for querying and analysing large datasets stored in Hadoop files.
Hive do three main functions: data summarization, query, and analysis.
Hive use language called HiveQL (HQL), which is similar to SQL.
HiveQL automatically translates SQL-like queries into MapReduce jobs which will execute on Hadoop.

12 of 94

Pig:

Apache Pig is a high-level language platform for analyzing and querying huge dataset that are stored in HDFS.
Pig as a component of Hadoop Ecosystem uses PigLatin language.
It is very similar to SQL.
It loads the data, applies the required filters and translate the data in the required format.

13 of 94

STREAM COMPUTING

Big data stream computing is able to analyze and process data in real time to gain an immediate insight, and it is typically applied to the analysis of vast amount of data in real time and to process them at a high speed.
A high-performance computer system that analyzes multiple data streams from many sources live.
The word stream in stream computing is used to mean pulling in streams of data, processing the data and streaming it back out as a single flow.
Stream computing uses software algorithms that analyzes the data in real time as it streams in to (which)increase speed and accuracy when dealing with data handling and analysis

14 of 94

In a stream processing system, applications typically act as continuous queries, ingesting data continuously, analyzing and correlating the data, and generating a stream of results.
Applications are represented as data-flow graphs composed of operators and interconnected by streams.
The individual operators implement algorithms for data analysis, such as parsing, filtering, feature extraction, and classification.
Such algorithms are typically single-pass because of the high data rates of external feeds (e.g., market information from stock exchanges, environmental sensors readings from sites in a forest, etc.).

15 of 94

Streamcomputing

16 of 94

IBM announced its stream computing system, called System S.
ATI Technologies also announced a stream computing technology that describes its technology that enables the graphics processors (GPUs) to work in conjunction with high-performance, low-latency CPUs to solve complex computational problems.

18 of 94

PIG

Pig raises the level of abstraction for processing large datasets.
With Pig, the data structures are much richer, typically being multivalued and nested; and the set of transformations you can apply to the data are much more powerful
Pig Latin, a Parallel Data Flow Language. Pig Latin is a data flow language. This means it allows users to describe how data from one or more inputs should be read, processed, and then stored to one or more outputs in parallel.

Pig is made up of two pieces:

The language used to express data flows, called Pig Latin.
The execution environment to run Pig Latin programs. There are currently two environments: local execution in a single JVM and distributed execution on a Hadoop cluster.

20 of 94

Apache Pig Components�

There are several components in the Apache Pig framework.

Parser

At first, all the Pig Scripts are handled by the Parser. Parser basically checks the syntax of the script, does type checking, and other miscellaneous checks. Afterwards, Parser’s output will be a DAG (directed acyclic graph) that represents the Pig Latin statements as well as logical operators.�The logical operators of the script are represented as the nodes and the data flows are represented as edges in DAG (the logical plan)

Optimizer

Afterwards, the logical plan (DAG) is passed to the logical optimizer. It carries out the logical optimizations further such as projection and push down.

21 of 94

Compiler

Then compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine

Eventually, all the MapReduce jobs are submitted to Hadoop in a sorted order.

Ultimately, it produces the desired results

22 of 94

Important points

A Pig Latin program is made up of a series of operations, or transformations, that are applied to the input data to produce output.
Pig transforms your query into series of mapreduce task and you unaware of this.
You will focus on the data and you dont know nature of execution.
Pig is a scripting language for exploring large datasets.
Pig’s sweet spot is its ability to process terabytes of data simply by using only half-dozen lines of Pig Latin from the console.

23 of 94

Pig was designed to be extensible. Virtually all parts of the processing path are customizable: loading, storing, filtering, grouping, and joining can all be altered by userdefined functions (UDFs).
As another benefit, UDFs tend to be more reusable than the libraries developed for writing MapReduce programs.
Pig isn’t suitable for all data processing tasks.
If you want to perform a query that touches only a small amount of data in a large dataset, then Pig will not perform well, since it is set up to scan the whole dataset, or at least large portions of it.

24 of 94

Execution Types

Pig has two execution types or modes:

local mode and
MapReduce mode.

Local mode:

In local mode, Pig runs in a single JVM and accesses the local filesystem. This mode is suitable only for small datasets and when trying out Pig.
The execution type is set using the -x or -exectype option. To run in local mode, set the option to local.
% pig -x local
grunt>
This starts Grunt, the Pig interactive shell.

25 of 94

MapReduce mode

In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster.
The cluster may be a pseudo or fully distributed cluster.
To use MapReduce mode, you first need to check that the version of Pig you downloaded is compatible with the version of Hadoop you are using. Pig releases will only work against particular versions of Hadoop.

26 of 94

Running Pig Programs

There are three ways of executing Pig programs, all of which work in both local and MapReduce mode:

Script:

Pig can run a script file that contains Pig commands. For example, pig script.pig runs the commands in the local file script.pig. Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on the command line.

Grunt:

Grunt is an interactive shell for running Pig commands. Grunt is started when no file is specified for Pig to run, and the -e option is not used. It is also possible to run Pig scripts from within Grunt using run and exec.

Embedded:

You can run Pig programs from Java using the PigServer class, much like you can use JDBC to run SQL programs from Java. For programmatic access to Grunt, use PigRunner.

27 of 94

Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump operator).
Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin script in a single file with .pig extension.
Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, and using them in our script.

28 of 94

We can run your Pig scripts in the shell after invoking the Grunt shell. Moreover, there are certain useful shell and utility commands offered by the Grunt shell.

30 of 94

PigLatin

Apache Pig offers High-level language like Pig Latin to perform data analysis programs.
A Pig Latin program consists of a collection of statements. A statement can be thought of as an operation, or a command. For example, a GROUP operation is a type of statement.

grouped_records = GROUP records BY year;

Statements are usually terminated with a semicolon, as in the example of the GROUPstatement. In fact, this is an example of a statement that must be terminated with a semicolon: it is a syntax error to omit it. In Grunt no error

31 of 94

While we need to analyze data in Hadoop using Apache Pig, we use Pig Latin language.
Basically, first, we need to transform Pig Latin statements into MapReduce jobs using an interpreter layer. In this way, the Hadoop process these jobs.
Pig Latin is a very simple language with SQL like semantics.
It is possible to use it in a productive manner.

32 of 94

It also contains a rich set of functions.
Those exhibits data manipulation.
Moreover, by writing user-defined functions (UDF) using Java, we can extend them easily.
That implies they are extensible in nature.

33 of 94

Data Model in Pig Latin�

The data model of Pig is fully nested. In addition, the outermost structure of the Pig Latin data model is a Relation. Also, it is a bag. While−
A bag, what we call a collection of tuples.
A tuple, what we call an ordered set of fields.
A field, what we call a piece of data.

34 of 94

Statements in Pig Latin

Also, make sure, statements are the basic constructs while processing data using Pig Latin.
Basically, statements work with relations. Also, includes expressions and schemas.
Here, every statement ends with a semicolon (;).
Moreover, through statements, we will perform several operations using operators, those are offered by Pig Latin.
However, Pig Latin statements take a relation as input and produce another relation as output, while performing all other operations Except LOAD and STORE.
Its semantic checking will be carried out, once we enter a Load statement in the Grunt shell.
Although, we need to use the Dump operator, in order to see the contents of the schema.
Because, the MapReduce job for loading the data into the file system will be carried out, only after performing the dump operation.

35 of 94

Pig Latin Datatypes�

int
“Int” represents a signed 32-bit integer.�For Example: 10
long
It represents a signed 64-bit integer.�For Example: 10L
float
This data type represents a signed 32-bit floating point.�For Example: 10.5F
double
“double” represents a 64-bit floating point.�For Example: 10.5
chararray
It represents a character array (string) in Unicode UTF-8 format.�For Example: ‘Data Flair’
Bytearray
This data type represents a Byte array (blob).

36 of 94

Boolean
“Boolean” represents a Boolean value.�For Example : true/ false. �Note: It is case insensitive.
Datetime
It represents a date-time.�For Example : 1970-01-01T00:00:00.000+00:00
Biginteger
This data type represents a Java BigInteger.�For Example: 60708090709
Bigdecimal
“Bigdecimal” represents a Java BigDecimal�For Example: 185.98376256272893883

37 of 94

Complex Types

Tuple
Bag
Map

Pig Latin Operators

Arithmetic Operators

Comparison Operators

Type Construction Operators

38 of 94

Data Processing Operators

Loading and Storing

LOAD

It loads the data from a file system into a relation.

STORE

It stores a relation to the file system (local/HDFS).

Filtering

FILTER

There is a removal of unwanted rows from a relation.

DISTINCT

We can remove duplicate rows from a relation by this operator.

FOREACH, GENERATE

It transforms the data based on the columns of data.

STREAM

To transform a relation using an external program.

39 of 94

Diagnostic Operators
DUMP

It prints the content of a relationship through the console.

DESCRIBE

It describes the schema of a relation.

EXPLAIN

We can view the logical, physical execution plans to evaluate a relation.

ILLUSTRATE

It displays all the execution steps as the series of statements.

40 of 94

Grouping and Joining

JOIN 🡪We can join two or more relations.
COGROUP 🡪There is a grouping of the data into two or more relations.
GROUP🡪It groups the data in a single relation.
CROSS🡪We can create the cross product of two or more relations.

Sorting

ORDER

It arranges a relation in an order based on one or more fields.

LIMIT

We can get a particular number of tuples from a relation.

Combining and Splitting

UNION

We can combine two or more relations into one relation.

SPLIT

To split a single relation into more relations.

41 of 94

Hive

Apache Hive is an open source data warehouse system built on top of Hadoop ,for querying and analyzing large datasets stored in Hadoop files.
It process structured and semi-structured data in Hadoop.
Hive runs on your workstation and converts your SQL query into a series of Map Reduce jobs for execution on a Hadoop cluster.
Hive organizes data into tables, which provide a means for attaching structure to data stored in HDFS.
Metadata such as table schemas is stored in a database called the meta store.

43 of 94

Metastore

It stores metadata for each and every table
Hive also includes the partition metadata.
This helps the driver to track the progress of various data sets distributed over the cluster.
It stores the data in a traditional RDBMS format.
Backup server regularly replicates the data which it can retrieve in case of data loss.

44 of 94

Driver

It acts like a controller which receives the HiveQL statements.
The driver starts the execution of the statement by creating sessions.
It monitors the life cycle and progress of the execution.
Driver stores the necessary metadata generated during the execution of a HiveQL statement.
It also acts as a collection point of data or query result obtained after the Reduce operation

45 of 94

Compiler

It performs the compilation of the HiveQL query.
This converts the query to an execution plan. The plan contains the tasks.
It also contains steps needed to be performed by the MapReduce to get the output as translated by the query.
The compiler in Hive converts the query to an Abstract Syntax Tree (AST).
First, check for compatibility and compile-time errors, then converts the AST to a Directed Acyclic Graph (DAG).

46 of 94

Optimizer – It performs various transformations on the execution plan to provide optimized DAG. It aggregates the transformations together, such as converting a pipeline of joins to a single join, for better performance. The optimizer can also split the tasks, such as applying a transformation on data before a reduce operation, to provide better performance.
Executor – Once compilation and optimization complete, the executor executes the tasks. Executor takes care of pipelining the tasks.

47 of 94

CLI, UI, and Thrift Server –
CLI (command-line interface) provides a user interface for an external user to interact with Hive.
Thrift server in Hive allows external clients to interact with Hive over a network, similar to the JDBC or ODBC protocols.

48 of 94

Hive Shell

The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL.
HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced by MySQL
When starting Hive for the first time, we can check that it is working by listing its tables The command must be terminated with a semicolon to tell Hive to execute it:

Like SQL, HiveQL is generally case insensitive (except for string comparisons), so show tables; works equally well here. The tab key will auto complete Hive keywords and functions.
For a fresh install, the command takes a few seconds to run since it is lazily creating the metastore database on your machine. (The database stores its files in a directory called metastore_db, which is relative to where you ran the hive command from.)

hive> SHOW TABLES;

Time taken: 10.425 seconds

50 of 94

Hive Client
Hive Services
Processing and Resource Management
Distributed Storage

51 of 94

Hive Client�

Hive supports applications written in any language like Python, Java, C++, Ruby, etc. Using JDBC, ODBC, and Thrift drivers, for performing queries on the Hive. Hence, one can easily write a hive client application in any language of its own choice.
Hive clients are categorized into three types:

1. Thrift Clients

The Hive server is based on Apache Thrift so that it can serve the request from a thrift client.

2. JDBC client

Hive allows for the Java applications to connect to it using the JDBC driver. JDBC driver uses Thrift to communicate with the Hive Server.

3. ODBC client

Hive ODBC driver allows applications based on the ODBC protocol to connect to Hive. Similar to the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive Server.

52 of 94

Hive Service�

cli

The command line interface to Hive (the shell). This is the default service.
To perform all queries, Hive provides various services like the Hive server2, Beeline, etc. The various services offered by Hive are:

Hive sever

HiveServer2 is the successor of HiveServer1. HiveServer2 enables clients to execute queries against the Hive. It allows multiple clients to submit requests to Hive and retrieve the final results. It is basically designed to provide the best support for open API clients like JDBC and ODBC.

hwi

The Hive Web Interface.

jar

The Hive equivalent to hadoop jar, a convenient way to run Java applications that
includes both Hadoop and Hive classes on the classpath.

53 of 94

Meta Store

Metastore is a central repository that stores the metadata information about the structure of tables and partitions, including column and column type information.
It also stores information of serializer and deserializer, required for the read/write operation, and HDFS files where data is stored. This metastore is generally a relational database.
Metastore provides a Thrift interface for querying and manipulating Hive metadata.

We can configure metastore in any of the two modes:

Remote: In remote mode, metastore is a Thrift service and is useful for non-Java applications.
Embedded: In embedded mode, the client can directly interact with the metastore using JDBC.

54 of 94

Embedded

The metastore is the central repository of Hive metadata. The metastore is divided into two pieces: a service and the backing store for the data. By default, the metastore service runs in the same JVM as the Hive service and contains an embedded database instance backed by the local disk. This is called the embedded metastore configuration
Using an embedded metastore is a simple way to get started with Hive; however, only one embedded Derby database can access the database files on disk at any one time, which means you can only have one Hive session open at a time that shares the same metastore. Trying to start a second session gives the error:
Failed to start database 'metastore_db‘
when it attempts to open a connection to the metastore

55 of 94

The solution to supporting multiple sessions (and therefore multiple users) is to use a standalone database. This configuration is referred to as a local metastore, since the metastore service still runs in the same process as the Hive service, but connects to a database running in a separate process, either on the same machine or on a remote machine.
There’s another metastore configuration called a remote metastore, where one or more metastore servers run in separate processes to the Hive service. This brings better manageability and security, since the database tier can be completely firewalled off, and the clients no longer need the database credentials.

58 of 94

SQl vs HiveQL

59 of 94

HiveQL

The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data in a Metastore.

SELECT [ALL | DISTINCT] select_expr, select_expr, ...

FROM table_reference [WHERE where_condition]

[GROUP BY col_list]

[HAVING having_condition]

[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]

[LIMIT number];

Creating Data Base:

CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>

61 of 94

Data Types in Hive

62 of 94

Complex Data Types

63 of 94

Opertators

Relational Operators
Arithmetic Operators
Logical Operators
String Operators
Operators on Complex Types

64 of 94

Hive DDL commands

Hive DDL commands are the statements used for defining and changing the structure of a table or database in Hive. It is used to build or modify the tables and other objects in the database.
The several types of Hive DDL commands are:
CREATE
SHOW
DESCRIBE
USE
DROP
ALTER
TRUNCATE

65 of 94

Hive DML Commands

Hive DML (Data Manipulation Language) commands are used to insert, update, retrieve, and delete data from the Hive table once the table and database schema has been defined using Hive DDL commands.
The various Hive DML commands are:
LOAD
SELECT
INSERT
DELETE
UPDATE
EXPORT
IMPORT

66 of 94

Joins

Inner join in Hive
Left Outer Join in Hive
Right Outer Join in Hive
Full Outer Join in Hive

67 of 94

Partition

Apache Hive organizes tables into partitions. Partitioning is a way of dividing a table into related parts based on the values of particular columns like date, city, and department.
Each table in the hive can have one or more partition keys to identify a particular partition. Using partition it is easy to do queries on slices of the data.
Here are two types of Partitioning in Apache Hive-

Static Partitioning

Dynamic Partitioning

In Apache Hive for decomposing table data sets into more manageable parts, it uses Hive Bucketing concept. However, there are much more to learn about Bucketing in Hive.

68 of 94

HBase

Hbasics

HBase is a distributed column-oriented database built on top of HDFS.
HBase is the Hadoop application to use when you require real-time read/write random-access to very large datasets.

Why Hbase:

RDBMS get exponentially slow as the data becomes large
Expects data to be highly structured, i.e. ability to fit in a well-defined schema
Any change in schema might require a downtime

70 of 94

Hbase concepts

There are 3 types of servers in a master-slave type of HBase Architecture. They are

HBase HMaster

Server

ZooKeeper

Region servers, these servers serve data for reads and write purposes. That means clients can directly communicate with HBase Region Servers while accessing data.
The HBase Master process handles the region assignment as well as DDL (create, delete tables) operations. And finally, a part of HDFS, Zookeeper.

71 of 94

HMasterServer�

The master server -Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.
Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the regions to less occupied servers.
Maintains the state of the cluster by negotiating the load balancing.
Is responsible for schema changes and other metadata operations such as creation of tables and column families.

72 of 94

�Regions�

Regions are nothing but tables that are split up and spread across the region servers.

Region server

The region servers have regions that -Communicate with the client and handle data-related operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.

�

73 of 94

Zookeeper�

Zookeeper is an open-source project that provides services like maintaining configuration information, naming, providing distributed synchronization, etc.
Zookeeper has nodes representing different region servers. Master servers use these nodes to discover available servers.
In addition to availability, the nodes are also used to track server failures or network partitions.
Clients communicate with region servers via zookeeper.
In pseudo and standalone modes, HBase itself will take care of zookeeper.

74 of 94

Regions�

Tables are automatically partitioned horizontally by HBase into regions. Each region comprises a subset of a table’s rows.
A region is denoted by the table it belongs to.
Initially, a table comprises a single region, but as the size increases, after it crosses a configurable size threshold, it splits at a row boundary into two new regions of approximately equal size.
Until this first split happens, all loading will be against the single server hosting the original region.

75 of 94

As the table grows, the number of its regions grows. Regions are the units that get distributed over an HBase Server
In this way, a table that is too big for any one server can be carried by a cluster of servers with each node hosting a subset of the table’s total regions.
Load on a table gets distributed.
The online set of sorted regions comprises the table’s total content.

77 of 94

To maintain server state in the HBase Cluster, HBase uses ZooKeeper as a distributed coordination service.
Basically, which servers are alive and available is maintained by Zookeeper, and also it provides server failure notification.
Moreove, Zookeeper maintains guarantee common shared state.

78 of 94

Clients

There are a number of client options for interacting with an Hbase cluster.

Java
HBase, like Hadoop, is written in Java
MapReduce HBase classes and utilities in the org.apache.hadoop.hbase.mapreduce package facilitate using HBase as a source and/or sink in MapReduce jobs.
The TableInputFormat class makes splits on region boundaries.
The Hbase TableOutputFormat will write the result of reduce into HBase.

79 of 94

Hbase with Avro, REST, and Thrift interfaces. These are useful when the interacting application is written in a language other than Java.
In all cases, a Java server hosts an instance of the HBase client brokering application Avro, REST, and Thrift requests in and out of the HBase cluster. This extra work proxying requests and responses means these interfaces are slower than using the Java client directly.

80 of 94

HBase Vs RDBMS

Database Type

HBase

HBase is the column-oriented database. On defining Column-oriented, each column is a contiguous unit of page.

RDBMS

Whereas, RDBMS is row-oriented that means here each row is a contiguous unit of page.

Schema-type

Schema of HBase is less restrictive, adding columns on the fly is possible. RDBMS

Schema of RDBMS is more restrictive.

81 of 94

Sparse Tables

HBase

HBase is good with the Sparse table.

RDBMS

Whereas, RDBMS is not optimized for sparse tables.

Scale up/ Scale out

HBase

HBase supports scale out. It means while we need memory processing power and more disk, we need to add new servers to the cluster rather than upgrading the present one.

RDBMS

However, RDBMS supports scale up. That means while we need memory processing power and more disk, we need upgrade same server to a more powerful server, rather than adding new servers.

82 of 94

Amount of data

HBase

While here it does not depend on the particular machine but the number of machines.

RDBMS

In RDBMS, on the configuration of the server, amount of data depends.

Support

HBase

For HBase, there is no built-in support.

RDBMS

And, RDBMS has ACID support.

83 of 94

Data type

HBase

HBase supports both structured and nonstructural data.

RDBMS

RDBMS is suited for structured data.

Transaction integrity

HBase

In HBase, there is no transaction guaranty.

RDBMS

Whereas, RDBMS mostly guarantees transaction integrity.

JOINs

HBase

HBase supports JOINs.

RDBMS

RDBMS does not support JOINs.

84 of 94

Referential integrity

HBase

While it comes to referential integrity, there is no in-built support.

RDBMS

And, RDBMS, supports referential integrity.

85 of 94

Bigsql

IBM Big SQL is a high performance massively parallel processing (MPP) SQL engine for Hadoop that makes querying enterprise data from the organization in an easy and secure experience.

86 of 94

A Big SQL query can quickly access a variety of data sources including HDFS, RDBMS, NoSQL databases, object stores, and WebHDFS by using a single database connection or single query for best-in-class analytic capabilities.

With Big SQL, your organization can derive significant value from your enterprise data.

87 of 94

Big SQL provides tools to help you manage your system and your databases, and you can use popular analytic tools to visualize your data.
Big SQL includes several tools and interfaces that are largely comparable to tools and interfaces that exist with most relational database management systems.

88 of 94

Big SQL's robust engine executes complex queries for relational data and Hadoop data.

Big SQL provides an advanced SQL compiler and a cost-based optimizer for efficient query execution.

Combining these with a massive parallel processing (MPP) engine helps distribute query execution across nodes in a cluster.

89 of 94

The Big SQL architecture uses the latest relational database technology from IBM.

The database infrastructure provides a logical view of the data (by allowing storage and management of metadata) and a view of the query compilation, plus the optimization and runtime environment for optimal SQL processing.

91 of 94

Applications connect on a specific node based on specific user configurations.
SQL statements are routed through this node, to Big SQL management node, or the coordinating node.
There can be one or many management nodes, but there is only one Big SQL management node. SQL statements are compiled and optimized to generate a parallel execution query plan.

92 of 94

Then, a runtime engine distributes task(query) to worker nodes on the compute node and manipulates the consumption and return of the result set.

The compute node is a node that can be a physical server or operating system.

The worker nodes can contain the temporary tables, the runtime execution, the readers and writers, and the data nodes.
The DataNode holds the data.

93 of 94

When a worker node receives a query, it dispatches special processes that know how to read and write HDFS data natively.
Big SQL uses native and Java open source–readers (and writers) that are able to ingest different file formats.
The Big SQL engine pushes predicates down to these processes so that they can, in turn, apply projection and selection closer to the data. These processes also transform input data into an appropriate format for consumption inside Big SQL.

94 of 94

All of these nodes can be on one Management Node, or each part on separate Management Node.
We can separate the Big SQL management node from the other Hadoop master nodes.
This arrangement can allow the Big SQL management node to have enough resources to store intermediate data from the Big SQL data nodes.

1 of 94

2 of 94

3 of 94

4 of 94

5 of 94

6 of 94

7 of 94

8 of 94

9 of 94

10 of 94

11 of 94

12 of 94

13 of 94

14 of 94

15 of 94

16 of 94

17 of 94

18 of 94

19 of 94

20 of 94

21 of 94

22 of 94

23 of 94

24 of 94

25 of 94

26 of 94

27 of 94

28 of 94

29 of 94

30 of 94

31 of 94

32 of 94

33 of 94

34 of 94

35 of 94

36 of 94

37 of 94

38 of 94

39 of 94

40 of 94

41 of 94

42 of 94

43 of 94

44 of 94

45 of 94

46 of 94

47 of 94

48 of 94

49 of 94

50 of 94

51 of 94

52 of 94

53 of 94

54 of 94

55 of 94

56 of 94

57 of 94

58 of 94

59 of 94

60 of 94

61 of 94

62 of 94

63 of 94

64 of 94

65 of 94

66 of 94

67 of 94

68 of 94

69 of 94

70 of 94

71 of 94

72 of 94

73 of 94

74 of 94

75 of 94

76 of 94

77 of 94

78 of 94

79 of 94

80 of 94