1 of 14

Introduction to Data Science

By

S.V.V.D.Jagadeesh

Sr. Assistant Professor

Dept of Artificial Intelligence & Data Science

LAKIREDDY BALI REDDY COLLEGE OF ENGINEERING

2 of 14

At the end of this unit, Student will be able to:

  • CO3: Choose the appropriate databases for handling big data. (Understand-L2)

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

Unit-III Outcomes

LBRCE

IDS

3 of 14

  • New big data technologies such as Hadoop and Spark make it much easier to work with and control a cluster of computers.
  • Hadoop can scale up to thousands of computers, creating a cluster with petabytes of storage.
  • This enables businesses to grasp the value of the massive amount of data available. �

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

Introduction to Hadoop

LBRCE

IDS

4 of 14

  • Apache Hadoop is a framework that simplifies working with a cluster of computers.
  • It aims to be all of the following things and more:

■ Reliable—By automatically creating multiple copies of the data and redeploying processing logic in case of failure.

■ Fault tolerant—It detects faults and applies automatic recovery.

■ Scalable—Data and its processing are distributed over clusters of computers (horizontal scaling).

■ Portable—Installable on all kinds of hardware and operating system

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

Hadoop Framework for Large Datasets

LBRCE

IDS

5 of 14

At the heart of Hadoop we find

■ A distributed file system (HDFS)

■ A method to execute programs on a massive scale (MapReduce)

■ A system to manage the cluster resources (YARN)

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

Components of Hadoop Framework

LBRCE

IDS

6 of 14

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

Components of Hadoop Framework

LBRCE

IDS

7 of 14

  • Hadoop uses a programming method called MapReduce to achieve parallelism.
  • A MapReduce algorithm splits up the data, processes it in parallel, and then sorts, combines, and aggregates the results back together.
  • However, the MapReduce algorithm isn’t well suited for interactive analysis or iterative programs because it writes the data to a disk in between each computational step.
  • This is expensive when working with large data sets.

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

MapReduce: For Achieving Parallelism

LBRCE

IDS

8 of 14

  • You’re the director of a toy company.
  • Every toy has two colors, and when a client orders a toy from the web page, the web page puts an order file on Hadoop with the colors of the toy.
  • Your task is to find out how many color units you need to prepare. You’ll use a MapReduce-style algorithm to count the colors.

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

MapReduce: Case Study

LBRCE

IDS

9 of 14

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

MapReduce: Case Study

LBRCE

IDS

10 of 14

  • As the name suggests, the process roughly boils down to two big phases:

■ Mapping phase—The documents are split up into key-value pairs. Until we reduce, we can have many duplicates.

■ Reduce phase—It’s not unlike a SQL “group by.” The different unique occurrences are grouped together, and depending on the reducing function, a different result can be created.

Here we wanted a count per color, so that’s what the reduce function returns.

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

MapReduce: Case Study

LBRCE

IDS

11 of 14

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

MapReduce: Case Study

LBRCE

IDS

12 of 14

  • The whole process is described in the following six steps

1 Reading the input files.

2 Passing each line to a mapper job.

3 The mapper job parses the colors (keys) out of the file and outputs a file for each color with the number of times it has been encountered (value). Or more technically said, it maps a key (the color) to a value (the number of occurrences).

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

MapReduce: Case Study

LBRCE

IDS

13 of 14

4 The keys get shuffled and sorted to facilitate the aggregation.

5 The reduce phase sums the number of occurrences per color and outputs one file per key with the total number of occurrences for each color.

6 The keys are collected in an output file.

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

MapReduce: Case Study

LBRCE

IDS

14 of 14

  • Unit-III Outcomes
  • Introduction to Hadoop
  • Hadoop Framework for Large Datasets
  • Components of Hadoop Framework
  • MapReduce: For Achieving Parallelism
  • MapReduce: Case Study

S.V.V.D.Jagadeesh

Wednesday, February 19, 2025

Summary

LBRCE

IDS