2 of 30

What is MapReduce?�

A MapReduce is a data processing tool which is used to process the data parallelly in a distributed form. It was developed in 2004, on the basis of paper titled as "MapReduce: Simplified Data Processing on Large Clusters," published by Google.

The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase. In the Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed to the reducer as input. The reducer runs only after the Mapper is over. The reducer too takes input in key-value format, and the output of reducer is the final output.

3 of 30

�MapReduce

MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application.

5 of 30

How MapReduce Works?�

The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The Map task takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.

7 of 30

Input Phase − Here we have a Record Reader that translates each record in an input file and sends the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and processes each one of them to generate zero or more key-value pairs.
Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce algorithm; it is optional

8 of 30

Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step.
Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer function and writes them onto a file using a record writer.

10 of 30

Anatomy of a Map Reduce Job Run

Map reduce job run process is mainly depends on

JOB SUBMISSION
JOB INITIALIZAATION
TASK ASSIGNMENT
TASK EXECUTION
PROGRESS AND STATUS UPDATES
JOB COMPLETION
FAILURES

11 of 30

�Anatomy of a Map Reduce Job Run

MapReduce can be used to work with a solitary method call: submit() on a Job object (you can likewise call waitForCompletion(), which presents the activity on the off chance that it hasn’t been submitted effectively, at that point sits tight for it to finish).

12 of 30

Let’s understand the components –

Client: Submitting the MapReduce job.
Yarn node manager: In a cluster, it monitors and launches the compute containers on machines.
Yarn resource manager: Handles the allocation of computing resources coordination on the cluster.
MapReduce application master Facilitates the tasks running the MapReduce work.
Distributed Filesystem: Shares job files with other entities.

14 of 30

How to submit Job?�

To create an internal JobSubmitter instance, use the submit() which further calls submitJobInternal() on it. Having submitted the job, waitForCompletion() polls the job’s progress after submitting the job once per second.
The resource manager asks for a new application ID that is used for MapReduce Job ID.
Output specification of the job is checked. For e.g. an error is thrown to the MapReduce program or the job is not submitted or the output directory already exists or it has not been specified.
If the splits cannot be computed, it computes the input splits for the job. This can be due to the job is not submitted and an error is thrown to the MapReduce program.

15 of 30

Resources needed to run the job are copied – it includes the job JAR file, and the computed input splits, to the shared filesystem in a directory named after the job ID and the configuration file.
It copies job JAR with a high replication factor, which is controlled by mapreduce.client.submit.file.replication property. AS there are a number of copies across the cluster for the node managers to access.
By calling submitApplication(), submits the job to the resource manager

16 of 30

Failures�

Real user code can process crash, can be full of bugs or even the machine can fail. The capability of Hadoop to handle such failures is the biggest benefit of using it which allows the job to be completed successfully. Any of the following components

Application master
Node manager
Resource manager
Task

18 of 30

�Shuffle and Sorting

In this lesson, we will learn completely about MapReduce Shuffling and Sorting. Here we will offer you a detailed description of the Hadoop Shuffling and Sorting phase. Initially, we will discuss what is MapReduce Shuffling, next with MapReduce Sorting, then we will discuss MapReduce the secondary sorting phase in detail.
Shuffling is the process by which it transfers the mapper’s intermediate output to the reducer. Reducer gets one or more keys and associated values based on reducers. The intermediated key – value generated by the mapper is sorted automatically by key. In Sort phase merging and sorting of the map, the output takes place.

�

20 of 30

Shuffling in MapReduce

The process of moving data from the mappers to reducers is shuffling. Shuffling is also the process by which the system performs the sort. Then it moves the map output to the reducer as input. This is the reason the shuffle phase is required for the reducers. Else, they would not have any input (or input from every mapper). Meanwhile, shuffling can begin even before the map phase has finished. Therefore this saves some time and completes the tasks in lesser time.

21 of 30

Sorting in MapReduce

MapReduce Framework automatically sorts the keys generated by the mapper. Therefore, before starting of reducer, all intermediate key-value pairs get sorted by key and not by value. It does not sort values transferred to each reducer. They can be in any order.

22 of 30

Map Reduce Types and Formats

MapReduce is the processing unit of Hadoop, using which the data in Hadoop can be processed.
The MapReduce task works on <Key, Value> pair.
Two main features of MapReduce are parallel programming model and large-scale distributed model.
MapReduce allows for the distributed processing of the map and reduction operations.

Map procedure(Transform): Performs a filtering and sorting operation.
Reduce procedure(Aggregates): Performs a summary operation

MapReduce Workflow:

24 of 30

Mapper class’s KEYIN must be consistent with inputformat.class

Mapper class’s KEYOUT must be consistent with map.out.key.class…

25 of 30

Formats

Map Reduce formats is basically classified in two types..these are:

Input formats

Text input format
Binary Input format
Multiple input formats
DB Input formats

26 of 30

Output formats

Text output formats
Binary output formats
Multiple output formats
Lazy outputs formats
DB output formats

28 of 30

Output formats

Text output formats
Binary output formats
Multiple output formats
Lazy outputs formats
DB output formats

29 of 30

�� Key Features of MapReduce

The following advanced features characterize MapReduce:

Highly scalable
Versatile
Secure
Affordability
Fast-paced

30 of 30

1. Highly scalable

A framework with excellent scalability is Apache Hadoop MapReduce. This is because of its capacity for distributing and storing large amounts of data across numerous servers. These servers can all run simultaneously and are all reasonably priced.

2. Versatile

Businesses can use MapReduce programming to access new data sources. It makes it possible for companies to work with many forms of data. Enterprises can access both organized and unstructured data with this method and acquire valuable insights from the various data sources.

3. Secure

The MapReduce programming model uses the HBase and HDFS security approaches, and only authenticated users are permitted to view and manipulate the data. HDFS uses a replication technique in Hadoop 2 to provide fault tolerance. Depending on the replication factor, it makes a clone of each block on the various machines.

4. Affordability

With the help of the MapReduce programming framework and Hadoop’s scalable design, big data volumes may be stored and processed very affordably. Such a system is particularly cost-effective and highly scalable, making it ideal for business models that must store data that is constantly expanding to meet the demands of the present.

5. Fast-paced

The Hadoop Distributed File System, a distributed storage technique used by MapReduce, is a mapping system for finding data in a cluster. The data processing technologies, such as MapReduce programming, are typically placed on the same servers that enable quicker data processing.

1 of 30