Apache Hive
What is HIVE ?
Features of Hive
Limitations of Hive
Differences between Hive and Pig
Hive | Pig |
Hive is commonly used by Data Analysts. | Pig is commonly used by programmers. |
It follows SQL-like queries. | It follows the data-flow language. |
It can handle structured data. | It can handle semi-structured data. |
It works on server-side of HDFS cluster. | It works on client-side of HDFS cluster. |
Hive is slower than Pig. | Pig is comparatively faster than Hive. |
Hive Architecture
Working of Hive
�
Step No. | Operation |
1 | Execute Query: The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute. |
2 | Get Plan: The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query. |
3 | Get Metadata: The compiler sends metadata request to Metastore (any database). |
4 | Send Metadata: Metastore sends metadata as a response to the compiler. |
5 | Send Plan: The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete. |
6 | Execute Plan: The driver sends the execute plan to the execution engine. |
7 | Execute Job: Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job. |
7.1 | Metadata Ops: Meanwhile in execution, the execution engine can execute metadata operations with Metastore. |
8 | Fetch Result: The execution engine receives the results from Data nodes. |
9 | Send Results: The execution engine sends those resultant values to the driver. |
10 | Send Results: The driver sends the results to Hive Interfaces. |
HIVE Data Types
Hive - Create Database
So, to check the list of existing databases, follow the below command: -
hive> show databases;
Hive - Create Table
Hive - Create Table
Hive Services
Hive provides many services such as
The metastore comprises of two fundamental units:
Hive Metastore
1. Embedded Metastore
2. Local Metastore
3.Remote Metastore
1. Embedded Metastore:
Both the metastore service and the Hive service runs in the same JVM by default using an embedded Derby Database instance where metadata is stored in the local disk. This is called embedded metastore configuration.
2. Local Metastore:
This configuration allows us to have multiple Hive sessions i.e. Multiple users can use the metastore database at the same time.
This is achieved by using any JDBC compliant database like MySQL which runs in a separate JVM or a different machine than that of the Hive service and metastore service which are running in the same JVM
In general, the most popular choice is to implement a MySQL server as the metastore database.
3. Remote Metastore:
In the remote metastore configuration, the metastore service runs on its own separate JVM and not inthe Hive service JVM.
Other processes communicate with the metastore server using Thrift Network APIs. You can have one or more metastore servers in this case to provide more availability.
The main advantage of using remote metastore is you do not need to share JDBC login credential with each Hive user to access the metastore database.
Difference between RDBMS and Hive
Hive is a data warehouse software system that provides data query and analysis.
RDBMS is a such type of database management system which is specifically designed for relational databases.
Hive gives an interface like SQL to query data stored in various databases and file systems that integrate with Hadoop.
A relational database refers to a database that stores data in a structured format using rows and columns and that structured form is known as table.
It is used to maintain data warehouse.
It is used to maintain database.
It uses HQL (Hive Query Language).
It uses SQL (Structured Query Language).
Schema varies in it.
Schema is fixed in RDBMS.
Difference between RDBMS and Hive
Normalized and de-normalized both type of data is stored.
Normalized data is stored.
Table in hive are dense.
Tables in rdms are sparse.
Hive supports automation partition.
RDBMS doesn’t support partitioning.
Hive Sharding method is used for partition.
RDBMS No partition method is used.