Hadoop ecosystem
Hbase& Hcatalog
Apache Hbase:
HCATALOG:
Apache Mahout�
Algorithms of Mahout are:
Apache Sqoop & Flume�
Apache Flume
Query languages for hadoop
Query languages for hadoop
HIVE & PIG
Hive:
Pig:
STREAM COMPUTING
Streamcomputing
PIG
PIG
Pig is made up of two pieces:
Apache Pig Components�
There are several components in the Apache Pig framework.
Parser
Optimizer
Compiler
Execution engine
Important points
Execution Types
Pig has two execution types or modes:
Local mode:
MapReduce mode
Running Pig Programs
There are three ways of executing Pig programs, all of which work in both local and MapReduce mode:
Script:
Pig can run a script file that contains Pig commands. For example, pig script.pig runs the commands in the local file script.pig. Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on the command line.
Grunt:
Embedded:
PigLatin
grouped_records = GROUP records BY year;
Statements are usually terminated with a semicolon, as in the example of the GROUPstatement. In fact, this is an example of a statement that must be terminated with a semicolon: it is a syntax error to omit it. In Grunt no error
Data Model in Pig Latin�
Statements in Pig Latin
Pig Latin Datatypes�
Complex Types
Pig Latin Operators
Arithmetic Operators
Comparison Operators
Type Construction Operators
Data Processing Operators
Loading and Storing
It loads the data from a file system into a relation.
It stores a relation to the file system (local/HDFS).
Filtering
There is a removal of unwanted rows from a relation.
We can remove duplicate rows from a relation by this operator.
It transforms the data based on the columns of data.
To transform a relation using an external program.
It prints the content of a relationship through the console.
It describes the schema of a relation.
We can view the logical, physical execution plans to evaluate a relation.
It displays all the execution steps as the series of statements.
Grouping and Joining
Sorting
It arranges a relation in an order based on one or more fields.
We can get a particular number of tuples from a relation.
Combining and Splitting
We can combine two or more relations into one relation.
To split a single relation into more relations.
Hive
Metastore
Driver
Compiler
Hive Shell
hive> SHOW TABLES;
OK
Time taken: 10.425 seconds
Hive Client�
1. Thrift Clients
2. JDBC client
3. ODBC client
Hive Service�
cli
Hive sever
hwi
jar
Meta Store
We can configure metastore in any of the two modes:
Embedded
SQl vs HiveQL
HiveQL
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference [WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
Creating Data Base:
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Data Types in Hive
Complex Data Types
Opertators
Hive DDL commands
Hive DML Commands
Joins
Partition
Static Partitioning
Dynamic Partitioning
HBase
Hbasics
Why Hbase:
Hbase concepts
There are 3 types of servers in a master-slave type of HBase Architecture. They are
HBase HMaster
Server
ZooKeeper
HMasterServer�
�Regions�
Region server
�
Zookeeper�
Regions�
Clients
There are a number of client options for interacting with an Hbase cluster.
HBase Vs RDBMS
Database Type
HBase
RDBMS
Schema-type
Schema of RDBMS is more restrictive.
Sparse Tables
HBase
RDBMS
Scale up/ Scale out
HBase
RDBMS
Amount of data
HBase
RDBMS
Support
HBase
RDBMS
Data type
HBase
RDBMS
Transaction integrity
HBase
RDBMS
JOINs
HBase
RDBMS
Referential integrity
HBase
RDBMS
Bigsql