1 of 14

Introduction to Data Science

S.V.V.D.Jagadeesh

Sr. Assistant Professor

Dept of Artificial Intelligence & Data Science

LAKIREDDY BALI REDDY COLLEGE OF ENGINEERING

2 of 14

At the end of this unit, Student will be able to:

CO3: Choose the appropriate databases for handling big data. (Understand-L2)

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

Unit-III Outcomes

LBRCE

IDS

3 of 14

New big data technologies such as Hadoop and Spark make it much easier to work with and control a cluster of computers.
Hadoop can scale up to thousands of computers, creating a cluster with petabytes of storage.
This enables businesses to grasp the value of the massive amount of data available. �

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

Introduction to Hadoop

LBRCE

IDS

4 of 14

Apache Hadoop is a framework that simplifies working with a cluster of computers.
It aims to be all of the following things and more:

■ Reliable—By automatically creating multiple copies of the data and redeploying processing logic in case of failure.

■ Fault tolerant—It detects faults and applies automatic recovery.

■ Scalable—Data and its processing are distributed over clusters of computers (horizontal scaling).

■ Portable—Installable on all kinds of hardware and operating system

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

Hadoop Framework for Large Datasets

LBRCE

IDS

5 of 14

At the heart of Hadoop we find

■ A distributed file system (HDFS)

■ A method to execute programs on a massive scale (MapReduce)

■ A system to manage the cluster resources (YARN)

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

Components of Hadoop Framework

LBRCE

IDS

6 of 14

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

Components of Hadoop Framework

LBRCE

IDS

7 of 14

Hadoop uses a programming method called MapReduce to achieve parallelism.
A MapReduce algorithm splits up the data, processes it in parallel, and then sorts, combines, and aggregates the results back together.
However, the MapReduce algorithm isn’t well suited for interactive analysis or iterative programs because it writes the data to a disk in between each computational step.
This is expensive when working with large data sets.

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

MapReduce: For Achieving Parallelism

LBRCE

IDS

8 of 14

You’re the director of a toy company.
Every toy has two colors, and when a client orders a toy from the web page, the web page puts an order file on Hadoop with the colors of the toy.
Your task is to find out how many color units you need to prepare. You’ll use a MapReduce-style algorithm to count the colors.

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

MapReduce: Case Study

LBRCE

IDS

9 of 14

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

MapReduce: Case Study

LBRCE

IDS

10 of 14

As the name suggests, the process roughly boils down to two big phases:

■ Mapping phase—The documents are split up into key-value pairs. Until we reduce, we can have many duplicates.

■ Reduce phase—It’s not unlike a SQL “group by.” The different unique occurrences are grouped together, and depending on the reducing function, a different result can be created.

Here we wanted a count per color, so that’s what the reduce function returns.

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

MapReduce: Case Study

LBRCE

IDS

11 of 14

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

MapReduce: Case Study

LBRCE

IDS

12 of 14

The whole process is described in the following six steps

1 Reading the input files.

2 Passing each line to a mapper job.

3 The mapper job parses the colors (keys) out of the file and outputs a file for each color with the number of times it has been encountered (value). Or more technically said, it maps a key (the color) to a value (the number of occurrences).

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

MapReduce: Case Study

LBRCE

IDS

13 of 14

4 The keys get shuffled and sorted to facilitate the aggregation.

5 The reduce phase sums the number of occurrences per color and outputs one file per key with the total number of occurrences for each color.

6 The keys are collected in an output file.

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

MapReduce: Case Study

LBRCE

IDS

14 of 14

Unit-III Outcomes
Introduction to Hadoop
Hadoop Framework for Large Datasets
Components of Hadoop Framework
MapReduce: For Achieving Parallelism
MapReduce: Case Study

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

Summary

LBRCE

IDS