1 of 207

Hadoop �

2 of 207

It all started with two people, Mike Cafarella and Doug Cutting.
who were in the process of building a search engine system that can index 1 billion pages.
They estimated that such a system will cost around half a million dollars in hardware, with a monthly running cost of $30,000, which is quite expensive.
They came across a paper, published in 2003, that described the architecture of Google’s distributed file system, called GFS.
Later in 2004, Google published one more paper that introduced MapReduce to the world.
Finally, these two papers led to the foundation of the framework called “Hadoop“.

3 of 207

What is Hadoop ?�

Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in volume.
Hadoop is written in Java and is not OLAP (online analytical processing).
It is used for batch/offline processing.
It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more.
Moreover it can be scaled up just by adding nodes in the cluster.

4 of 207

5 of 207

6 of 207

7 of 207

8 of 207

9 of 207

10 of 207

11 of 207

12 of 207

13 of 207

14 of 207

15 of 207

16 of 207

17 of 207

18 of 207

19 of 207

Now, you must have got an idea why Big Data is a problem statement and how Hadoop solves it.

The first problem is storing the colossal amount of data: Storing huge data in a traditional system is not possible.
The reason is obvious, the storage will be limited to one system and the data is increasing at a tremendous rate.

20 of 207

The second problem is storing heterogeneous data : we know that storing is a problem. The data is not only huge, but it is also present in various formats i.e. unstructured, semi-structured and structured. So, you need to make sure that you have a system to store different types of data that is generated from various sources.

21 of 207

The third problem The Processing Speed :

The time taken to process this huge amount of data is quite high as the data to be processed is too large.

22 of 207

The first problem is storing the colossal amount of data:�

HDFS provides a distributed way to store Big Data.
Your data is stored in blocks in DataNodes and you specify the size of each block.

512 MB

128 MB

23 of 207

Second problem was storing a variety of data.

In HDFS you can store all kinds of data whether it is structured, semi-structured or unstructured.

In HDFS, there is no pre-dumping schema validation. It also follows write once and read many models. Due to this, you can just write any kind of data once and you can read it multiple times for finding insights.

24 of 207

The third challenge was about processing the data faster

We move the processing unit to data instead of moving data to the processing unit.

It means that instead of moving data from different nodes to a single master node for processing, the processing logic is sent to the nodes where data is stored so as that each node can process a part of data in parallel.

Finally, all of the intermediary output produced by each node is merged together and the final response is sent back to the client.

25 of 207

26 of 207

Hadoop Architecture�

27 of 207

28 of 207

29 of 207

30 of 207

31 of 207

32 of 207

33 of 207

34 of 207

35 of 207

36 of 207

37 of 207

38 of 207

39 of 207

40 of 207

41 of 207

42 of 207

43 of 207

44 of 207

45 of 207

46 of 207

47 of 207

48 of 207

49 of 207

50 of 207

51 of 207

52 of 207

53 of 207

54 of 207

55 of 207

56 of 207

57 of 207

58 of 207

59 of 207

Not more than one replica be placed on one node.

Not more than two replicas are placed on the same rack.

Also, the number of racks used for block replication should always be smaller than the number of replicas.
Two copies will be present in one rack and third is present in different rack this is called “Replica Placement Policy”

Replica placement via Rack awareness in Hadoop

60 of 207

61 of 207

62 of 207

63 of 207

64 of 207

Advantages of Implementing Rack Awareness

Preventing data loss against rack failure

Minimize the cost of write and maximize the read speed

Maximize network bandwidth and low latency

65 of 207

66 of 207

67 of 207

68 of 207

69 of 207

70 of 207

71 of 207

72 of 207

73 of 207

74 of 207

75 of 207

76 of 207

77 of 207

78 of 207

79 of 207

80 of 207

81 of 207

82 of 207

83 of 207

84 of 207

85 of 207

86 of 207

87 of 207

88 of 207

89 of 207

90 of 207

91 of 207

92 of 207

93 of 207

94 of 207

95 of 207

96 of 207

97 of 207

98 of 207

99 of 207

100 of 207

101 of 207

102 of 207

103 of 207

104 of 207

105 of 207

106 of 207

107 of 207

108 of 207

109 of 207

110 of 207

111 of 207

112 of 207

113 of 207

114 of 207

YARN�

YARN comprises of two major components: ResourceManager and NodeManager.

115 of 207

ResourceManager �

It is a cluster-level (one for each cluster) component and runs on the master machine
It manages resources and schedules applications running on top of YARN
It has two components: Scheduler & ApplicationManager
The Scheduler is responsible for allocating resources to the various running applications
The ApplicationManager is responsible for accepting job submissions and negotiating the first container for executing the application
It keeps a track of the heartbeats from the Node Manager

116 of 207

NodeManager�

It is a node-level component (one on each node) and runs on each slave machine
It is responsible for managing containers and monitoring resource utilization in each container
It also keeps track of node health and log management
It continuously communicates with ResourceManager to remain up-to-date

117 of 207

Below is the list of few data types in Java along with the equivalent Hadoop variant

Java Data Types	Hadoop Data Types	Description
Integer	IntWritable	It is the Hadoop variant of Integer. It is used to pass integer numbers as key or value.
Float	FloatWritable	Hadoop variant of Float used to pass floating point numbers as key or value.
Long	LongWritable	Hadoop variant of Long data type to store long values.
Short	ShortWritable	Hadoop variant of Short data type to store short values.
Double	DoubleWritable	Hadoop variant of Double to store double values.
String	Text	Hadoop variant of String to pass string characters as key or value.
Byte	ByteWritable	Hadoop variant of byte to store sequence of bytes.
null	NullWritable	Hadoop variant of null to pass null as a key or value. Usually NullWritable is used as data type for output key of the reducer, when the output key is not important in the final result.

118 of 207

Hadoop

How the name suggested to Hadoop?
Who developed Hadoop?
The story of yellow elephant.

119 of 207

Hadoop ECO-System

120 of 207

121 of 207

122 of 207

123 of 207

124 of 207

125 of 207

126 of 207

127 of 207

128 of 207

129 of 207

130 of 207

131 of 207

132 of 207

133 of 207

134 of 207

135 of 207

Conclusion

Hadoop Ecosystem owes its success to the whole developer community, many big companies like Facebook, Google, Yahoo, University of California (Berkeley) etc. have contributed their part to increase Hadoop’s capabilities.
Inside a Hadoop Ecosystem, knowledge about one or two tools (Hadoop components) would not help in building a solution. You need to learn a set of Hadoop components, which works together to build a solution.
Based on the use cases, we can choose a set of services from Hadoop Ecosystem and create a tailored solution for an organization.

136 of 207

Top Big Data Technologies

Top big data technologies are divided into 4 fields
Data Storage
Data Mining
Data Analytics
Data Visualization

137 of 207

138 of 207

Big Data Technologies in Data Storage

Hadoop: Hadoop Framework was designed to store and process data in a Distributed Data Processing Environment with commodity hardware with a simple programming model. It can Store and Analyze the data present in different machines with High Speeds and Low Costs.

Developed by: Apache Software Foundation in the year 2011
Written in: JAVA
Current stable version: Hadoop 3.3.0

Companies Using Hadoop:

139 of 207

Big Data Technologies in Data Storage.

MongoDB:The NoSQL Document Databases like MongoDB, offer a direct alternative to the rigid schema used in Relational Databases. This allows MongoDB to offer Flexibility while handling a wide variety of Datatypes at large volumes and across Distributed Architectures.

Developed by: MongoDB in the year 2009
Written in: C++, Go, JavaScript, Python
Current stable version: MongoDB 4.0.10

Companies Using MongoDB:

140 of 207

Big Data Technologies in Data Storage.

Rainstor:RainStor is a software company that developed a Database Management System of the same name designed to Manage and Analyse Big Data for large enterprises. It uses Deduplication Techniques to organize the process of storing large amounts of data for reference.

Developed by: RainStor Software company in the year 2004.
Works like: SQL
Current stable version: RainStor 5.5

Companies Using RainStor:

141 of 207

Big Data Technologies in Data Storage.

Hunk:Hunk lets you access data in remote Hadoop Clusters through virtual indexes and lets you use the Splunk Search Processing Language to analyse your data. With Hunk, you can Report and Visualize large amounts from your Hadoop and NoSQL data sources.

Developed by: Splunk INC in the year 2013.
Written in: JAVA
Current stable version: Splunk Hunk 6.2

142 of 207

Big Data Technologies used in Data Mining.

Presto: Presto is an open source Distributed SQL Query Engine for running Interactive Analytic Queries against data sources of all sizes ranging from Gigabytes to Petabytes. Presto allows querying data in Hive, Cassandra, Relational Databases and Proprietary Data Stores.

Developed by: Apache Foundation in the year 2013.
Written in: JAVA
Current stable version: Presto 0.22

Companies Using Presto:

�

143 of 207

Big Data Technologies used in Data Mining.

Rapid Miner: RapidMiner is a Centralized solution that features a very powerful and robust Graphical User Interface that enables users to Create, Deliver, and maintain Predictive Analytics. It allows creating very Advanced Workflows, Scripting support in several languages.

Developed by: RapidMiner in the year 2001
Written in: JAVA
Current stable version: RapidMiner 9.2

Companies Using RapidMiner:

144 of 207

Big Data Technologies used in Data Mining.

Elasticsearch :Elasticsearch is a Search Engine based on the Lucene Library. It provides a Distributed, MultiTenant-capable, Full-Text Search Engine with an HTTP Web Interface and Schema-free JSON documents.

Developed by: Elastic NV in the year 2012.
Written in: JAVA
Current stable version: ElasticSearch 7.1

Companies Using Elasticsearch:

145 of 207