1 of 207

Hadoop �

2 of 207

  • It all started with two people, Mike Cafarella and Doug Cutting.
  • who were in the process of building a search engine system that can index 1 billion pages.
  • They estimated that such a system will cost around half a million dollars in hardware, with a monthly running cost of $30,000, which is quite expensive.
  • They came across a paper, published in 2003, that described the architecture of Google’s distributed file system, called GFS.
  • Later in 2004, Google published one more paper that introduced MapReduce to the world.
  • Finally, these two papers led to the foundation of the framework called “Hadoop“.

3 of 207

What is Hadoop ?�

  • Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in volume.
  • Hadoop is written in Java and is not OLAP (online analytical processing).
  • It is used for batch/offline processing.
  • It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more.
  • Moreover it can be scaled up just by adding nodes in the cluster.

4 of 207

5 of 207

6 of 207

7 of 207

8 of 207

9 of 207

10 of 207

11 of 207

12 of 207

13 of 207

14 of 207

15 of 207

16 of 207

17 of 207

18 of 207

19 of 207

Now, you must have got an idea why Big Data is a problem statement and how Hadoop solves it.

  • The first problem is storing the colossal amount of data: Storing huge data in a traditional system is not possible.
  • The reason is obvious, the storage will be limited to one system and the data is increasing at a tremendous rate.

20 of 207

  • The second problem is storing heterogeneous data : we know that storing is a problem. The data is not only huge, but it is also present in various formats i.e. unstructured, semi-structured and structured. So, you need to make sure that you have a system to store different types of data that is generated from various sources.

21 of 207

  • The third problem The Processing Speed :

  • The time taken to process this huge amount of data is quite high as the data to be processed is too large.  

22 of 207

The first problem is storing the colossal amount of data:

  • HDFS provides a distributed way to store Big Data.
  • Your data is stored in blocks in DataNodes and you specify the size of each block.

512 MB

128 MB

128 MB

128 MB

128 MB

23 of 207

Second problem was storing a variety of data.

  • In HDFS you can store all kinds of data whether it is structured, semi-structured or unstructured.

  • In HDFS, there is no pre-dumping schema validation. It also follows write once and read many models. Due to this, you can just write any kind of data once and you can read it multiple times for finding insights.

24 of 207

The third challenge was about processing the data faster

  • We move the processing unit to data instead of moving data to the processing unit.

  • It means that instead of moving data from different nodes to a single master node for processing, the processing logic is sent to the nodes where data is stored so as that each node can process a part of data in parallel.

  • Finally, all of the intermediary output produced by each node is merged together and the final response is sent back to the client.

25 of 207

26 of 207

Hadoop Architecture�

27 of 207

28 of 207

29 of 207

30 of 207

31 of 207

32 of 207

33 of 207

34 of 207

35 of 207

36 of 207

37 of 207

38 of 207

39 of 207

40 of 207

41 of 207

42 of 207

43 of 207

44 of 207

45 of 207

46 of 207

47 of 207

48 of 207

49 of 207

50 of 207

51 of 207

52 of 207

53 of 207

54 of 207

55 of 207

56 of 207

57 of 207

58 of 207

59 of 207

  • Not more than one replica be placed on one node.

  • Not more than two replicas are placed on the same rack.

  • Also, the number of racks used for block replication should always be smaller than the number of replicas.
  • Two copies will be present in one rack and third is present in different rack this is called “Replica Placement Policy”

Replica placement via Rack awareness in Hadoop

60 of 207

61 of 207

62 of 207

63 of 207

64 of 207

Advantages of Implementing Rack Awareness

  • Preventing data loss against rack failure

  • Minimize the cost of write and maximize the read speed

  • Maximize network bandwidth and low latency

65 of 207

66 of 207

67 of 207

68 of 207

69 of 207

70 of 207

71 of 207

72 of 207

73 of 207

74 of 207

75 of 207

76 of 207

77 of 207

78 of 207

79 of 207

80 of 207

81 of 207

82 of 207

83 of 207

84 of 207

85 of 207

86 of 207

87 of 207

88 of 207

89 of 207

90 of 207

91 of 207

92 of 207

93 of 207

94 of 207

95 of 207

96 of 207

97 of 207

98 of 207

99 of 207

100 of 207

101 of 207

102 of 207

103 of 207

104 of 207

105 of 207

106 of 207

107 of 207

108 of 207

109 of 207

110 of 207

111 of 207

112 of 207

113 of 207

114 of 207

YARN�

YARN comprises of two major components: ResourceManager and NodeManager.

115 of 207

ResourceManager �

  • It is a cluster-level (one for each cluster) component and runs on the master machine
  • It manages resources and schedules applications running on top of YARN
  • It has two components: Scheduler & ApplicationManager
  • The Scheduler is responsible for allocating resources to the various running applications
  • The ApplicationManager is responsible for accepting job submissions and negotiating the first container for executing the application
  • It keeps a track of the heartbeats from the Node Manager

116 of 207

NodeManager�

  • It is a node-level component (one on each node) and runs on each slave machine
  • It is responsible for managing containers and monitoring resource utilization in each container
  • It also keeps track of node health and log management
  • It continuously communicates with ResourceManager to remain up-to-date

117 of 207

Below is the list of few data types in Java along with the equivalent Hadoop variant

Java Data Types

Hadoop Data Types

Description

Integer

IntWritable

It is the Hadoop variant of Integer. It is used to pass integer numbers as key or value.

Float

FloatWritable

Hadoop variant of Float used to pass floating point numbers as key or value.

Long

LongWritable

Hadoop variant of Long data type to store long values.

Short

ShortWritable

Hadoop variant of Short data type to store short values.

Double

DoubleWritable

Hadoop variant of Double to store double values.

String

Text

Hadoop variant of String to pass string characters as key or value.

Byte

ByteWritable

Hadoop variant of byte to store sequence of bytes.

null

NullWritable

Hadoop variant of null to pass null as a key or value. Usually NullWritable is used as data type for output key of the reducer, when the output key is not important in the final result.

118 of 207

Hadoop

  • How the name suggested to Hadoop?
  • Who developed Hadoop?
  • The story of yellow elephant.

119 of 207

Hadoop ECO-System

120 of 207

121 of 207

122 of 207

123 of 207

124 of 207

125 of 207

126 of 207

127 of 207

128 of 207

129 of 207

130 of 207

131 of 207

132 of 207

133 of 207

134 of 207

135 of 207

Conclusion

  • Hadoop Ecosystem owes its success to the whole developer community, many big companies like Facebook, Google, Yahoo, University of California (Berkeley) etc. have contributed their part to increase Hadoop’s capabilities.
  • Inside a Hadoop Ecosystem, knowledge about one or two tools (Hadoop components) would not help in building a solution. You need to learn a set of Hadoop components, which works together to build a solution.
  • Based on the use cases, we can choose a set of services from Hadoop Ecosystem and create a tailored solution for an organization.

136 of 207

Top Big Data Technologies

  • Top big data technologies are divided into 4 fields 
  • Data Storage
  • Data Mining
  • Data Analytics
  • Data Visualization

137 of 207

138 of 207

Big Data Technologies in Data Storage

  • HadoopHadoop Framework was designed to store and process data in a Distributed Data Processing Environment with commodity hardware with a simple programming model. It can Store and Analyze the data present in different machines with High Speeds and Low Costs.
          • Developed by: Apache Software Foundation in the year 2011 
          • Written in: JAVA
          • Current stable version: Hadoop 3.3.0

          • Companies Using Hadoop:

139 of 207

Big Data Technologies in Data Storage.

  • MongoDB:The NoSQL Document Databases like MongoDB, offer a direct alternative to the rigid schema used in Relational Databases. This allows MongoDB to offer Flexibility while handling a wide variety of Datatypes at large volumes and across Distributed Architectures.
          • Developed by: MongoDB in the year 2009
          • Written in: C++, Go, JavaScript, Python
          • Current stable version: MongoDB 4.0.10

          • Companies Using MongoDB:

140 of 207

Big Data Technologies in Data Storage.

  • Rainstor:RainStor is a software company that developed a Database Management System of the same name designed to Manage and Analyse Big Data for large enterprises. It uses Deduplication Techniques to organize the process of storing large amounts of data for reference.
          • Developed by: RainStor Software company in the year 2004.
          • Works like: SQL
          • Current stable version: RainStor 5.5

          • Companies Using RainStor:

141 of 207

Big Data Technologies in Data Storage.

  • Hunk:Hunk lets you access data in remote Hadoop Clusters through virtual indexes and lets you use the Splunk Search Processing Language to analyse your data. With Hunk, you can Report and Visualize large amounts from your Hadoop and NoSQL data sources.

          • Developed by: Splunk INC in the year 2013.
          • Written in: JAVA
          • Current stable version: Splunk Hunk 6.2

142 of 207

Big Data Technologies used in Data Mining.

  • Presto: Presto is an open source Distributed SQL Query Engine for running Interactive Analytic Queries against data sources of all sizes ranging from Gigabytes to Petabytes. Presto allows querying data in HiveCassandraRelational Databases and Proprietary Data Stores.
          • Developed by: Apache Foundation in the year 2013.
          • Written in: JAVA
          • Current stable version: Presto 0.22

          • Companies Using Presto:

143 of 207

Big Data Technologies used in Data Mining.

  • Rapid Miner: RapidMiner is a Centralized solution that features a very powerful and robust Graphical User Interface that enables users to Create, Deliver, and maintain Predictive Analytics. It allows creating very Advanced Workflows, Scripting support in several languages.
          • Developed by: RapidMiner in the year 2001
          • Written in: JAVA
          • Current stable version: RapidMiner 9.2

          • Companies Using RapidMiner:

144 of 207

Big Data Technologies used in Data Mining.

  • Elasticsearch :Elasticsearch is a Search Engine based on the Lucene Library. It provides a Distributed, MultiTenant-capable, Full-Text Search Engine with an HTTP Web Interface and Schema-free JSON documents.

          • Developed by: Elastic NV in the year 2012.
          • Written in: JAVA
          • Current stable version: ElasticSearch 7.1

          • Companies Using Elasticsearch:

145 of 207

Big Data Technologies used in Data Analytics.

  • KafkaApache Kafka is a Distributed Streaming platform. A streaming platform has Three Key Capabilities that are as follows:
          • Publisher
          • Subscriber
          • Consumer
          • This is similar to a Message Queue or an Enterprise Messaging System.

  • Developed by: Apache Software Foundation in the year 2011
  • Written in: Scala, JAVA
  • Current stable version: Apache Kafka 2.2.0
  • Companies Using Kafka:

146 of 207

Big Data Technologies used in Data Analytics.

  • Splunk:Splunk captures, Indexes, and correlates Real-time data in a Searchable Repository from which it can generate Graphs, Reports, Alerts, Dashboards, and Data Visualizations. It is also used for Application Management, Security and Compliance, as well as Business and Web Analytics.

          • Developed by: Splunk INC in the year 2014 6th May
          • Written in: AJAX, C++, Python, XML
          • Current stable version: Splunk 7.3

          • Companies Using Splunk:

147 of 207

Big Data Technologies used in Data Analytics.

  • KNIME: KNIME allows users to visually create Data Flows, Selectively execute some or All Analysis steps, and Inspect the Results, Models, and Interactive views. KNIME is written in Java and based on Eclipse and makes use of its Extension mechanism to add Plugins providing Additional Functionality.

          • Developed by: KNIME in the year 2008
          • Written in: JAVA
          • Current stable version: KNIME 3.7.2

          • Companies Using KNIME:

148 of 207

Big Data Technologies used in Data Analytics.

  • SparkSpark provides In-Memory Computing capabilities to deliver Speed, a Generalized Execution Model to support a wide variety of applications, and Java, Scala, and Python APIs for ease of development.

          • Developed by: Apache Software Foundation
          • Written in: Java, Scala, Python, R
          • Current stable version: Apache Spark 2.4.3

          • Companies Using Spark:

149 of 207

Big Data Technologies used in Data Analytics.

  • R-LanguageR is a Programming Language and free software environment for Statistical Computing and Graphics. The R language is widely used among Statisticians and Data Miners for developing Statistical Software and majorly in Data Analysis.

          • Developed by: R-Foundation in the year 2000 29th Feb
          • Written in: Fortran
          • Current stable version: R-3.6.0

          • Companies Using R-Language:

150 of 207

Big Data Technologies used in Data Analytics.

  • Blockchain: BlockChain is used in essential functions such as payment, escrow, and title can also reduce fraud, increase financial privacy, speed up transactions, and internationalize markets.
  • BlockChain can be used for achieving the following in a Business Network Environment:

          • Shared Ledger: Here we can append the Distributed System of records across a Business network.
          • Smart Contract: Business terms are embedded in the transaction Database and Executed with transactions.
          • Privacy: Ensuring appropriate Visibility, Transactions are Secure, Authenticated and Verifiable
          • Consensus: All parties in a Business network agree to network verified transactions.

  • Developed by: Bitcoin
  • Written in: JavaScript, C++, Python
  • Current stable version: Blockchain 4.0
  • Companies Using Blockchain:

151 of 207

Data Visualization Big Data technologies

  • Tableau:Tableau is a Powerful and Fastest growing Data Visualization tool used in the Business Intelligence Industry. Data analysis is very fast with Tableau and the Visualizations created are in the form of Dashboards and Worksheets.

          • Developed by: TableAU 2013 May 17th
          • Written in: JAVA, C++, Python, C
          • Current stable version: TableAU 8.2

          • Companies Using Tableau:

152 of 207

Data Visualization Big Data technologies

  • Plotly : Mainly used to make creating Graphs faster and more efficient. API libraries for Python, RMATLAB, Node.js, Julia, and Arduino and a REST API. Plotly can also be used to style Interactive Graphs with Jupyter notebook.

          • Developed by: Plotly in the year 2012
          • Written in: JavaScript
          • Current stable version: Plotly 1.47.4

          • Companies Using Plotly:

153 of 207

Emerging Big Data Technologies

  • TensorFlow :TensorFlow has a Comprehensive, Flexible Ecosystem of tools, Libraries and Community resources that lets Researchers push the state-of-the-art in Machine Learning and Developers can easily build and deploy Machine Learning powered applications.

          • Developed by: Google Brain Team in the year 2019
          • Written in: Python, C++, CUDA
          • Current stable version: TensorFlow 2.0 beta

          • Companies Using TensorFlow:

154 of 207

Emerging Big Data Technologies

  • Beam: Apache Beam provides a Portable API layer for building sophisticated Parallel-Data Processing Pipelines that may be executed across a diversity of Execution Engines or Runners.

          • Developed by: Apache Software Foundation in the year 2016 June 15th
          • Written in: JAVA, Python
          • Current stable version: Apache Beam 0.1.0 incubating.

          • Companies Using Beam:

155 of 207

Emerging Big Data Technologies

  • Docker :Docker is a tool designed to make it easier to Create, Deploy, and Run applications by using Containers. Containers allow a developer to Package up an application with all the parts it needs, such as Libraries and other Dependencies, and Ship it all out as One Package.

          • Developed by: Docker INC in the year 2003 13th of March.
          • Written in: Go
          • Current stable version: Docker 18.09

          • Companies Using Docker:

156 of 207

Emerging Big Data Technologies

  • Airflow : Apache Airflow is a WorkFlow Automation and Scheduling System that can be used to author and manage Data Pipelines. Airflow uses workflows made of Directed Acyclic Graphs (DAGs) of tasks. Defining Workflows in code provides Easier Maintenance, Testing and Versioning.

          • Developed by: Apache Software Foundation on May 15th 2019
          • Written in: Python
          • Current stable version: Apache AirFlow 1.10.3

          • Companies Using AirFlow:

157 of 207

Emerging Big Data Technologies

  • Kubernetes :Kubernetes is a Vendor-Agnostic Cluster and Container Management tool, Open Sourced by Google in 2014. It provides a platform for Automation, Deployment, Scaling, and Operations of Application Containers across Clusters of Hosts.

          • Developed by: Cloud Native Computing Foundation in the year 2015 21st of July
          • Written in: Go
          • Current stable version: Kubernetes 1.14

          • Companies Using Kubernetes:

158 of 207

Parallel copying with "distcp"

  • The distributed copy command is distcp.
  • distcp is a general utility for copying large data sets between distributed filesystems within and across clusters.
  • You can also use distcp to copy data to and from cloud.
  • The distcp command submits a regular MapReduce job that performs a file-by-file copy.

$:hadoop distcp

159 of 207

  • DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. 
  •  It uses MapReduce to effect its distribution, error handling and recovery, and reporting. 
  •  It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. 
  • The most common invocation of DistCp is an inter-cluster copy:

  • $ hadoop distcp hdfs://nn1:8020/foo/bar \ hdfs://nn2:8020/bar/f

  • Note that DistCp expects absolute paths.

160 of 207

HDFS Balancers

  • HDFS data might not always be distributed uniformly across DataNodes.
  • One common reason is addition of new DataNodes to an existing cluster.
  • HDFS provides a balancer utility that analyzes block placement and balances data across the DataNodes.
  • The balancer moves blocks until the cluster is deemed to be balanced, which means that the utilization of every DataNode differs from the utilization of the cluster by no more than a given threshold percentage.
  • The balancer does not balance between individual volumes on a single DataNode.
  • HDFS provides a tool called Balancer, that analyzes block placement and rebalances data across the DataNode, and it is generally managed by the Hadoop Administrator

161 of 207

HDFS Balancers

  • In Hadoop, HDFS new blocks are allocated evenly among all the datanodes. But in large scale cluster, each node has different capacity, you will often need to add new nodes or remove old nodes for better performance. Then How Hadoop will balance the data usage on all data nodes?
  • The answer is that Hadoop has its balanced policy to make sure all nodes data are balanced , So, there is HDFS Balancer to rebalance among the cluster datanodes, for unbalanced situation like new nodes adding, deletion caused unbalancing etc.
  • HDFS balancer doesn’t run at background, has to run manually. To run HDFS balancer Command :

�$hdfs balancer [-threshold <threshold>]Percentage of disk capacity

The threshold parameter is number between 0 and 100 .�From the average cluster utilization, the balancer process will try to converge all datanodes’ usage in the range [average – threshold, average + threshold].

162 of 207

HDFS Balancers

  • Default threshold is 10%
  • For example, if the cluster current utilization is 50% full, then higher usage datanodes will start move data to lower usage nodes.

– Higher (average + threshold): 60%�– Lower (average – threshold): 40%

Cluster balancing algorithm: The HDFS Balancer runs in iterations. Each iteration contains the following four steps: 

  • Storage group classification.
  • Storage group pairing.
  • Block move scheduling.
  • Block move execution.

163 of 207

164 of 207

165 of 207

166 of 207

167 of 207

168 of 207

169 of 207

170 of 207

171 of 207

172 of 207

173 of 207

174 of 207

175 of 207

176 of 207

177 of 207

178 of 207

179 of 207

180 of 207

181 of 207

182 of 207

183 of 207

184 of 207

185 of 207

186 of 207

187 of 207

188 of 207

189 of 207

190 of 207

191 of 207

192 of 207

193 of 207

194 of 207

195 of 207

196 of 207

197 of 207

198 of 207

199 of 207

200 of 207

201 of 207

202 of 207

203 of 207

204 of 207

205 of 207

  • Configuration: This class provides access to configuration parameters on a client or server machine.

This class is present in org.apache.hadoop.conf package.

  • FileSystem: FileSystem is an abstract base class for a generic file system. It may be implemented as distributed system or a local system. 

The local implementation is LocalFileSystem and distributed implementation is DistributedFileSystem.

  • static FileSystem get(URI uri, Configuration conf) 

                          — Returns the FileSystem for this URI.

  • public FSDataOutputStream create(Path f)

206 of 207

  • FSDataInputStream: FSDataInputStream class is a specialization of  java.io.DataInputStream with support for random access, so we can read from any part of the stream. It is an utility that wraps a FSInputStream in a DataInputStream and buffers input through a BufferedInputStream.

  • public FSDataInputStream open(Path f)

  • FSDataOutputStream class is counterpart for FSDataInputStream, to open a stream for output. It is an utility that wraps a OutputStream in a DataOutputStream, buffers output through a BufferedOutputStream and creates a checksum file.

  • public void write(byte[] b, int off, int len) throws IOException 

  • IOUtils: It is a utility class (handy tool) for I/O related functionality on HDFS.

It is present in org.apache.hadoop.io package.

  • IOUtils.copyBytes(InputStream in, OutputStream out, int buffSize, boolean close) 

  • IOUtils.closeStream(Closeable stream) 

207 of 207

  • URI stands for Uniform Resource Identifier. A Uniform Resource Identifier is a sequence of characters used for identification of a particular resource. It enables for the interaction of the representation of the resource over the network using specific protocols.

  • Java.net.URI:This class provides methods for creating URI instances from its components or by parsing the string form of those components, for accessing and retrieving different components of a URI instance.

  • create() : creates a new URI object.