2 of 27

Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs.
A huge dataset when processed results in another huge data set, which should also be processed sequentially. At this point, a new solution is needed to access any point of data in a single unit of time (random access).
Hadoop Random Access Databases
Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the databases that store huge amounts of data and access the data in a random manner.

3 of 27

What is HBase?

HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable.

HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS).

It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System.

One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read and write access.

5 of 27

HBase and HDFS

HDFS	HBase
HDFS is a distributed file system suitable for storing large files.	HBase is a database built on top of the HDFS.
HDFS does not support fast individual record lookups.	HBase provides fast lookups for larger tables.
It provides high latency batch processing; no concept of batch processing.	It provides low latency access to single rows from billions of records (Random access).
It provides only sequential access of data.	HBase internally uses Hash tables and provides random access, and it stores the data in indexed HDFS files for faster lookups.

6 of 27

Storage Mechanism in HBase

HBase is a column-oriented database and the tables in it are sorted by row.

The table schema defines only column families, which are the key value pairs.

Row ID	Column Family			Column Family			Column Family
Row ID	col1	col2	col3	col1	col2	col3	col1	col2	col3
1
2
3

7 of 27

Column Oriented and Row Oriented

Column-oriented databases are those that store data tables as sections of columns of data, rather than as rows of data. Shortly, they will have column families.

Row-Oriented Database	Column-Oriented Database
It is suitable for Online Transaction Process (OLTP).	It is suitable for Online Analytical Processing (OLAP).
Such databases are designed for small number of rows and columns.	Column-oriented databases are designed for huge tables.

9 of 27

HBase and RDBMS

HBase	RDBMS
HBase is schema-less, it doesn't have the concept of fixed columns schema; defines only column families.	An RDBMS is governed by its schema, which describes the whole structure of tables.
It is built for wide tables. HBase is horizontally scalable.	It is thin and built for small tables. Hard to scale.
No transactions are there in HBase.	RDBMS is transactional.
It has de-normalized data.	It will have normalized data.
It is good for semi-structured as well as structured data.	It is good for structured data.

10 of 27

Features of HBase

HBase is linearly scalable.
It has automatic failure support.
It provides consistent read and writes.
It integrates with Hadoop, both as a source and a destination.
It has easy java API for client.
It provides data replication across clusters.

11 of 27

Where to Use HBase

Apache HBase is used to have random, real-time read/write access to Big Data.

It hosts very large tables on top of clusters of commodity hardware.

Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up on Google File System, likewise Apache HBase works on top of Hadoop and HDFS.

12 of 27

Applications of HBase

It is used whenever there is a need to write heavy applications.

HBase is used whenever we need to provide fast random access to available data.

Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

13 of 27

HBase - Architecture

14 of 27

MasterServer

Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.

Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the regions to less occupied servers.

Maintains the state of the cluster by negotiating the load balancing.

Is responsible for schema changes and other metadata operations such as creation of tables and column families.

15 of 27

Regions

Regions are nothing but tables that are split up and spread across the region servers.

Region server:

The region servers have regions that -

Communicate with the client and handle data-related operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.

16 of 27

When we take a deeper look into the region server, it contain regions and stores as shown below:

17 of 27

Zookeeper

Zookeeper is an open-source project that provides services like maintaining configuration information, naming, providing distributed synchronization, etc.

Zookeeper has ephemeral nodes representing different region servers. Master servers use these nodes to discover available servers.

In addition to availability, the nodes are also used to track server failures or network partitions.

Clients communicate with region servers via zookeeper.

In pseudo and standalone modes, HBase itself will take care of zookeeper.

18 of 27

HBase Shell : HBase contains a shell using which you can communicate with HBase.

General Commands:

status - Provides the status of HBase, for example, the number of servers.
version - Provides the version of HBase being used.
table_help - Provides help for table-reference commands.
whoami - Provides information about the user.

Data Definition Language:

These are the commands that operate on the tables in HBase.

create - Creates a table.

list - Lists all the tables in HBase.

disable - Disables a table.

is_disabled - Verifies whether a table is disabled.

enable - Enables a table.

is_enabled - Verifies whether a table is enabled.

19 of 27

describe - Provides the description of a table.

alter - Alters a table.

exists - Verifies whether a table exists.

drop - Drops a table from HBase.

drop_all - Drops the tables matching the ‘regex’ given in the command.

Java Admin API - Prior to all the above commands, Java provides an Admin API to achieve DDL functionalities through programming. Under org.apache.hadoop.hbase.client package, HBaseAdmin and HTableDescriptor are the two important classes in this package that provide DDL functionalities.

20 of 27

Data Manipulation Language:

put - Puts a cell value at a specified column in a specified row in a particular table.
get - Fetches the contents of row or a cell.
delete - Deletes a cell value in a table.
deleteall - Deletes all the cells in a given row.
scan - Scans and returns the table data.
count - Counts and returns the number of rows in a table.
truncate - Disables, drops, and recreates a specified table.

Java client API - Prior to all the above commands, Java provides a client API to achieve DML functionalities, CRUD (Create Retrieve Update Delete) operations and more through programming, under org.apache.hadoop.hbase.client package. HTable Put and Get are the important classes in this package.

21 of 27

Table Creation in HBase

Creating a Table using HBase Shell

Creating a Table Using java API

22 of 27

Creating a Table using HBase Shell

You can create a table using the create command, here you must specify the table name and the Column Family name.

create ‘<table name>’,’<column family>’

create 'emp', 'personal data', 'professional data'

Row key	personal data	professional data

23 of 27

Creating a Table Using java API

You can create a table in HBase using the createTable() method of

HBaseAdmin class. This class belongs to the org.apache.hadoop.hbase.client package.

steps to create a table in HBase using java API.

Step1: Instantiate HBaseAdmin

Step2: Create TableDescriptor

Step3: Execute through Admin

24 of 27

Step1: Instantiate HBaseAdmin

This class requires the Configuration object as a parameter, therefore initially instantiate the Configuration class and pass this instance to HBaseAdmin.

Configuration conf = HBaseConfiguration.create();

HBaseAdmin admin = new HBaseAdmin(conf);

25 of 27

Step2: Create TableDescriptor

HTableDescriptor is a class that belongs to the org.apache.hadoop.hbase classThis class is like a container of table names and column families.

//creating table descriptor

HTableDescriptor table = new HTableDescriptor(toBytes("Table name"));

//creating column family descriptor

HColumnDescriptor family = new HColumnDescriptor(toBytes("column family"));

//adding coloumn family to HTable

table.addFamily(family);

26 of 27

Step 3: Execute through Admin

Using the createTable() method of HBaseAdmin class, you can execute the created table in Admin mode.

admin.createTable(table);

27 of 27

import java.io.IOException;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.HColumnDescriptor;

import org.apache.hadoop.hbase.HTableDescriptor;

import org.apache.hadoop.hbase.client.HBaseAdmin;

import org.apache.hadoop.hbase.TableName;

import org.apache.hadoop.conf.Configuration;

public class CreateTable {

public static void main(String[] args) throws IOException {

// Instantiating configuration class

Configuration con = HBaseConfiguration.create();

// Instantiating HbaseAdmin class

HBaseAdmin admin = new HBaseAdmin(con);

// Instantiating table descriptor class

HTableDescriptor tableDescriptor = new HTableDescriptor(TableName.valueOf("emp"));

// Adding column families to table descriptor

tableDescriptor.addFamily(new HColumnDescriptor("personal"));

tableDescriptor.addFamily(new HColumnDescriptor("professional"));

// Execute the table through admin

admin.createTable(tableDescriptor);

System.out.println(" Table created ");

} }