Introduction to Data Science
By
S.V.V.D.Jagadeesh
Sr. Assistant Professor
Dept of Artificial Intelligence & Data Science
LAKIREDDY BALI REDDY COLLEGE OF ENGINEERING
At the end of this unit, Student will be able to:
S.V.V.D.Jagadeesh
Wednesday, February 19, 2025
Unit-III Outcomes
LBRCE
IDS
S.V.V.D.Jagadeesh
Wednesday, February 19, 2025
Introduction to Hadoop
LBRCE
IDS
■ Reliable—By automatically creating multiple copies of the data and redeploying processing logic in case of failure.
■ Fault tolerant—It detects faults and applies automatic recovery.
■ Scalable—Data and its processing are distributed over clusters of computers (horizontal scaling).
■ Portable—Installable on all kinds of hardware and operating system
S.V.V.D.Jagadeesh
Wednesday, February 19, 2025
Hadoop Framework for Large Datasets
LBRCE
IDS
At the heart of Hadoop we find
■ A distributed file system (HDFS)
■ A method to execute programs on a massive scale (MapReduce)
■ A system to manage the cluster resources (YARN)
S.V.V.D.Jagadeesh
Wednesday, February 19, 2025
Components of Hadoop Framework
LBRCE
IDS
S.V.V.D.Jagadeesh
Wednesday, February 19, 2025
Components of Hadoop Framework
LBRCE
IDS
S.V.V.D.Jagadeesh
Wednesday, February 19, 2025
MapReduce: For Achieving Parallelism
LBRCE
IDS
S.V.V.D.Jagadeesh
Wednesday, February 19, 2025
MapReduce: Case Study
LBRCE
IDS
S.V.V.D.Jagadeesh
Wednesday, February 19, 2025
MapReduce: Case Study
LBRCE
IDS
■ Mapping phase—The documents are split up into key-value pairs. Until we reduce, we can have many duplicates.
■ Reduce phase—It’s not unlike a SQL “group by.” The different unique occurrences are grouped together, and depending on the reducing function, a different result can be created.
Here we wanted a count per color, so that’s what the reduce function returns.
S.V.V.D.Jagadeesh
Wednesday, February 19, 2025
MapReduce: Case Study
LBRCE
IDS
S.V.V.D.Jagadeesh
Wednesday, February 19, 2025
MapReduce: Case Study
LBRCE
IDS
1 Reading the input files.
2 Passing each line to a mapper job.
3 The mapper job parses the colors (keys) out of the file and outputs a file for each color with the number of times it has been encountered (value). Or more technically said, it maps a key (the color) to a value (the number of occurrences).
S.V.V.D.Jagadeesh
Wednesday, February 19, 2025
MapReduce: Case Study
LBRCE
IDS
4 The keys get shuffled and sorted to facilitate the aggregation.
5 The reduce phase sums the number of occurrences per color and outputs one file per key with the total number of occurrences for each color.
6 The keys are collected in an output file.
S.V.V.D.Jagadeesh
Wednesday, February 19, 2025
MapReduce: Case Study
LBRCE
IDS
S.V.V.D.Jagadeesh
Wednesday, February 19, 2025
Summary
LBRCE
IDS