1 of 21

Apache Hive

2 of 21

What is HIVE ?

  • Hive is a data warehouse system which is used to analyze structured data.
  • It is built on the top of Hadoop. It was developed by Facebook.
  • Hive provides the functionality of reading, writing, and managing large datasets residing in distributed storage.
  • It runs SQL like queries called HQL (Hive query language) which gets internally converted to MapReduce jobs.
  • Using Hive, we can skip the requirement of the traditional approach of writing complex MapReduce programs.
  • Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and User Defined Functions (UDF).

3 of 21

Features of Hive

  • Hive is fast and scalable.

  • It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark jobs.

  • It is capable of analyzing large datasets stored in HDFS.

  • It allows different storage types such as plain text, RCFile, and HBase.
  • It uses indexing to accelerate queries.
  • It can operate on compressed data stored in the Hadoop ecosystem.
  • It supports user-defined functions (UDFs) where user can provide its functionality.

4 of 21

Limitations of Hive

  • Hive is not capable of handling real-time data.

  • It is not designed for online transaction processing.

  • Hive queries contain high latency.

5 of 21

Differences between Hive and Pig

Hive

Pig

Hive is commonly used by Data Analysts.

Pig is commonly used by programmers.

It follows SQL-like queries.

It follows the data-flow language.

It can handle structured data.

It can handle semi-structured data.

It works on server-side of HDFS cluster.

It works on client-side of HDFS cluster.

Hive is slower than Pig.

Pig is comparatively faster than Hive.

6 of 21

Hive Architecture

7 of 21

Working of Hive

8 of 21

Step No.

Operation

1

Execute Query: The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.

2

Get Plan: The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query.

3

Get Metadata: The compiler sends metadata request to Metastore (any database).

4

Send Metadata: Metastore sends metadata as a response to the compiler.

5

Send Plan: The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete.

6

Execute Plan: The driver sends the execute plan to the execution engine.

7

Execute Job: Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job.

7.1

Metadata Ops: Meanwhile in execution, the execution engine can execute metadata operations with Metastore.

8

Fetch Result: The execution engine receives the results from Data nodes.

9

Send Results: The execution engine sends those resultant values to the driver.

10

Send Results: The driver sends the results to Hive Interfaces.

9 of 21

HIVE Data Types

  • Hive data types are categorized as

  • Numeric types such as Integer ,Decimal 

  • String types such as STRING,Varchar(variable length),CHAR(fixed-length)

  • Date/Time Types (YYYY--MM--DD)

  • Complex Type such as Struct,Map,Array

10 of 21

Hive - Create Database

  • In Hive, the database is considered as a catalog or namespace of tables. So, we can maintain multiple tables within a database where a unique name is assigned to each table. Hive also provides a default database with a name default.

  •       Initially, we check the default database provided by Hive. 

       So, to check the list of existing databases, follow the below command: -

hive> show databases;  

11 of 21

  • hive> create database cse;  

  • hive> create a database if not exists cse;

  • hive> create the database cse WITH DBPROPERTIES ('creator' = 'Srinu', 'date' = '2021-02-05');
  • hive> describe database extended cse; 
  • hive> drop database cse;  

  • hive> drop database if exists cse;  

  • hive> drop database if exists cse cascade;

12 of 21

Hive - Create Table

  • Internal table/Managed Tables
  • External table

  • By default, these tables are stored in a subdirectory under the directory defined by hive.metastore.warehouse.dir (i.e. /user/hive/warehouse).
  • Internal Table/Managed Tables:
  • The internal tables are also called managed tables as the lifecycle of their data is controlled by the Hive.  

  • The internal tables are not flexible enough to share with other tools like Pig.
  • If we try to drop the internal table, Hive deletes both table schema and data.

13 of 21

Hive - Create Table

  • Internal table/Managed Tables
  • External table

  • The external keyword is used to specify the external table, whereas the location keyword is used to determine the location of loaded data.
  • External Table:
  • The external table allows us to create and access a table and a data externally. 

  • As the table is external, the data is not present in the Hive directory. 
  • Therefore, if we try to drop the table, the metadata of the table will be deleted, but the data still exists.

14 of 21

Hive Services

Hive provides many services  such as

  • Hive CLI (Command Line Interface): This is the default shell provided by the Hive where you can execute your Hive queries and commands directly.
  • Apache Hive Web Interfaces: Apart from the command line interface, Hive also provides a web based GUI for executing Hive queries and commands.
  • Hive Server: Hive server is built on Apache Thrift and therefore, is also referred as Thrift Server that allows different clients to submit requests to Hive and retrieve the final result.
  • Apache Hive Driver: It is responsible for receiving the queries submitted through the CLI, the web UI, Thrift, ODBC or JDBC interfaces by a client. Then, the driver passes the query to the compiler where parsing, type checking and semantic analysis takes place with the help of schema present in the metastore. 

  • In the next step, an optimized logical plan is generated in the form of a DAG (Directed Acyclic Graph) of map-reduce tasks and HDFS tasks.

  • Finally, the execution engine executes these tasks in the order of their dependencies, using Hadoop.

15 of 21

  • Metastore: You can think metastore as a central repository for storing all the Hive metadata information.
  • Hive metadata includes various types of information like structure of tables and the partitions along with the column, column type, serializer and deserializer which is required for Read/Write operation on the data present in HDFS.

The metastore comprises of two fundamental units:

  • A service that provides metastore access to other Hive services.

  • Disk storage for the metadata which is separate from HDFS storage.

16 of 21

Hive Metastore

  • Metastore stores the meta data information using RDBMS and an open source ORM (Object Relational Model) layer called Data Nucleus which converts the object representation into relational schema and vice versa.
  •  The reason for choosing RDBMS instead of HDFS is to achieve low latency. We can implement metastore in following three configurations:

1. Embedded Metastore

2. Local Metastore

3.Remote Metastore

17 of 21

1. Embedded Metastore:

Both the metastore service and the Hive service runs in the same JVM by default using an embedded Derby Database instance where metadata is stored in the local disk. This is called embedded metastore configuration.

  • In this case, only one user can connect to metastore database at a time. 
  • If you start a second instance of Hive driver, you will get an error.
  • This is good for unit testing, but not for the practical solutions.

18 of 21

2. Local Metastore:

This configuration allows us to have multiple Hive sessions i.e. Multiple users can use the metastore database at the same time.

This is achieved by using any JDBC compliant database like MySQL which runs in a separate JVM or a different machine than that of the Hive service and metastore service which are running in the same JVM

In general, the most popular choice is to implement a MySQL server as the metastore database.

19 of 21

3. Remote Metastore:

In the remote metastore configuration, the metastore service runs on its own separate JVM and not inthe Hive service JVM.

Other processes communicate with the metastore server using Thrift Network APIs. You can have one or more metastore servers in this case to provide more availability.

The main advantage of using remote metastore is you do not need to share JDBC login credential with each Hive user to access the metastore database.

20 of 21

Difference between RDBMS and Hive

Hive is a data warehouse software system that provides data query and analysis.

RDBMS is a such type of database management system which is specifically designed for relational databases.

Hive gives an interface like SQL to query data stored in various databases and file systems that integrate with Hadoop. 

A relational database refers to a database that stores data in a structured format using rows and columns and that structured form is known as table.

It is used to maintain data warehouse.

It is used to maintain database.

It uses HQL (Hive Query Language).

It uses SQL (Structured Query Language).

Schema varies in it.

Schema is fixed in RDBMS.

21 of 21

Difference between RDBMS and Hive

Normalized and de-normalized both type of data is stored.

Normalized data is stored.

Table in hive are dense.

Tables in rdms are sparse.

Hive supports automation partition.

RDBMS doesn’t support partitioning.

Hive Sharding method is used for partition.

RDBMS No partition method is used.