UNIT – III
Map Reduce Technique
What is MapReduce?�
A MapReduce is a data processing tool which is used to process the data parallelly in a distributed form. It was developed in 2004, on the basis of paper titled as "MapReduce: Simplified Data Processing on Large Clusters," published by Google.
The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase. In the Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed to the reducer as input. The reducer runs only after the Mapper is over. The reducer too takes input in key-value format, and the output of reducer is the final output.
�MapReduce
MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application.
How MapReduce Works?�
Anatomy of a Map Reduce Job Run
Map reduce job run process is mainly depends on
�Anatomy of a Map Reduce Job Run
MapReduce can be used to work with a solitary method call: submit() on a Job object (you can likewise call waitForCompletion(), which presents the activity on the off chance that it hasn’t been submitted effectively, at that point sits tight for it to finish).
Let’s understand the components –
How to submit Job?�
Failures�
Real user code can process crash, can be full of bugs or even the machine can fail. The capability of Hadoop to handle such failures is the biggest benefit of using it which allows the job to be completed successfully. Any of the following components
�Shuffle and Sorting
�
Shuffling in MapReduce
Sorting in MapReduce
Map Reduce Types and Formats
Mapper class’s KEYIN must be consistent with inputformat.class
Mapper class’s KEYOUT must be consistent with map.out.key.class…
Formats
�� Key Features of MapReduce
The following advanced features characterize MapReduce:
1. Highly scalable
A framework with excellent scalability is Apache Hadoop MapReduce. This is because of its capacity for distributing and storing large amounts of data across numerous servers. These servers can all run simultaneously and are all reasonably priced.
2. Versatile
Businesses can use MapReduce programming to access new data sources. It makes it possible for companies to work with many forms of data. Enterprises can access both organized and unstructured data with this method and acquire valuable insights from the various data sources.
3. Secure
The MapReduce programming model uses the HBase and HDFS security approaches, and only authenticated users are permitted to view and manipulate the data. HDFS uses a replication technique in Hadoop 2 to provide fault tolerance. Depending on the replication factor, it makes a clone of each block on the various machines.
4. Affordability
With the help of the MapReduce programming framework and Hadoop’s scalable design, big data volumes may be stored and processed very affordably. Such a system is particularly cost-effective and highly scalable, making it ideal for business models that must store data that is constantly expanding to meet the demands of the present.
5. Fast-paced
The Hadoop Distributed File System, a distributed storage technique used by MapReduce, is a mapping system for finding data in a cluster. The data processing technologies, such as MapReduce programming, are typically placed on the same servers that enable quicker data processing.