BIG DATA ANALYTICS�22SCS21��MODULE 1:
Dr. Madhuri J
1
8/5/2024
Dr. Madhuri J
2
8/5/2024
Dr. Madhuri J
3
8/5/2024
Dr. Madhuri J
4
8/5/2024
Dr. Madhuri J
5
8/5/2024
Maximilien Brice, © CERN
The Earthscope
Dr. Madhuri J
6
8/5/2024
1.
Dr. Madhuri J
7
8/5/2024
Data
BIG DATA
Big data -Definition
Dr. Madhuri J
8
8/5/2024
Big Data: A definition
Dr. Madhuri J
9
8/5/2024
Source: Harness the Power of Big Data: The IBM Big Data Platform
Type of Data
Dr. Madhuri J
10
8/5/2024
Who’s Generating Big Data
Dr. Madhuri J
11
8/5/2024
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and networks
(measuring all kinds of data)
Challenges
Dr. Madhuri J
12
8/5/2024
How to transfer Big Data?
Characteristics of Big Data: �1-Scale (Volume)
Dr. Madhuri J
13
8/5/2024
Exponential increase in collected/generated data
•A typical PC might have had 10 gigabytes of storage in 2000.
•Today, Face book ingests 500 terabytes of new data every day.
•Boeing 737 will generate 240 terabytes of flight data during a single flight across the US.
•The smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video.
Dr. Madhuri J
14
8/5/2024
Characteristics of Big Data: �2-Complexity (Varity)
Dr. Madhuri J
15
8/5/2024
To extract knowledge🡺 all these types of data need to linked together
Dr. Madhuri J
16
8/5/2024
Characteristics of Big Data: �3-Speed (Velocity)
Dr. Madhuri J
17
8/5/2024
Dr. Madhuri J
18
8/5/2024
Big Data is a Hot Topic Because Technology Makes it Possible to Analyze ALL Available Data
Cost effectively manage and analyze all available data in its native form unstructured, structured, streaming
Dr. Madhuri J
19
8/5/2024
Big Data
Dr. Madhuri J
20
8/5/2024
What is Big Data
Dr. Madhuri J
21
8/5/2024
Dr. Madhuri J
22
8/5/2024
Why HADOOP?
Dr. Madhuri J
23
8/5/2024
Hadoop
Dr. Madhuri J
24
8/5/2024
Hadoop!
Dr. Madhuri J
25
8/5/2024
Dr. Madhuri J
26
8/5/2024
What we’ve got in Hadoop
Dr. Madhuri J
27
8/5/2024
Hadoop
HDFS:
MapReduce
Dr. Madhuri J
28
8/5/2024
RDBMS compared to MapReduce
Dr. Madhuri J
29
8/5/2024
| Traditional RDBMS | MapReduce |
Data size | Gigabytes | Petabytes |
Access | Interactive and batch | Batch |
Updates | Read and write many times Write once | Read many times |
Transactions | ACID | None |
Structure | Schema-on-write | Schema-on-read |
Integrity | High | Low |
Scaling | Nonlinear | Linear |
Grid Computing
Dr. Madhuri J
30
8/5/2024
Dr. Madhuri J
31
8/5/2024
Volunteer Computing
Dr. Madhuri J
32
8/5/2024
Hadoop Ecosystem
Dr. Madhuri J
33
8/5/2024
MapReduce
Weather Dataset
Dr. Madhuri J
34
8/5/2024
Dr. Madhuri J
35
8/5/2024
Dr. Madhuri J
36
8/5/2024
Dr. Madhuri J
37
8/5/2024
Example: Weather Dataset
Brute Force approach – Bash:
(each year’s logs are compressed to a single yearXXXX.gz file)
Dr. Madhuri J
38
8/5/2024
Analyzing the Data with Hadoop�Map Reduce
Dr. Madhuri J
39
8/5/2024
Map Function
Dr. Madhuri J
40
8/5/2024
Dr. Madhuri J
41
8/5/2024
MapReduce logical data flow
Dr. Madhuri J
42
8/5/2024
MapReduce
Weather Dataset with MapReduce
Input formatting phase
Dr. Madhuri J
43
8/5/2024
HDFS
Data
Formatting
<key, value> collection
MapReduce
Input formatting phase
Output (to MR framework):
Dr. Madhuri J
44
8/5/2024
HDFS
Data
Formatting
<key, value> collection
MapReduce
Map phase
Dr. Madhuri J
45
8/5/2024
<key, value> collection
Map
<key, value> collection
MapReduce
MR framework processing phase
Dr. Madhuri J
46
8/5/2024
<key, value> collection
MR framework processing
<key, values> collection
MapReduce
Reduce phase
Dr. Madhuri J
47
8/5/2024
<key, value> collection
Reduce
<key, values> collection
MapReduce
Data output phase
Dr. Madhuri J
48
8/5/2024
<key, value> collection
output
HDFS
Data
MapReduce
Dr. Madhuri J
49
8/5/2024
Some code.. Map function
MapReduce
Dr. Madhuri J
50
8/5/2024
Some code.. Reduce function
MapReduce
Dr. Madhuri J
51
8/5/2024
Some code.. Putting it all together
And running:
hadoop MaxTemperature input/ncdc/sample.txt output
Scaling Out
Dr. Madhuri J
52
8/5/2024
Data Flow
Dr. Madhuri J
53
8/5/2024
Dr. Madhuri J
54
8/5/2024
Dr. Madhuri J
55
8/5/2024
MapReduce data flow with a single reduce task
Dr. Madhuri J
56
8/5/2024
Dr. Madhuri J
57
8/5/2024
MapReduce
Dr. Madhuri J
58
8/5/2024
Job Tracker
Task Tracker
Task Tracker
Task Tracker
Hadoop Streaming
Dr. Madhuri J
59
8/5/2024
Hadoop Pipes
Dr. Madhuri J
60
8/5/2024
Design of HDFS
HDFS is designed for:
HDFS cannot be used for:
Dr. Madhuri J
61
8/5/2024
HDFS Concepts
Dr. Madhuri J
62
8/5/2024
Namenode and Datanodes
Master/slave architecture
Dr. Madhuri J
63
8/5/2024
Dr. Madhuri J
64
8/5/2024
Dr. Madhuri J
65
8/5/2024
HDFS Architecture
Dr. Madhuri J
66
8/5/2024
Secondary Namenode
Dr. Madhuri J
67
8/5/2024
Limitations of Current HDFS Architecture
Dr. Madhuri J
68
8/5/2024
HDFS Federation
Dr. Madhuri J
69
8/5/2024
HDFS Federation Architecture
Dr. Madhuri J
70
8/5/2024
HDFS Federation Architecture
Dr. Madhuri J
71
8/5/2024
Block pool and Namespace Volume
Dr. Madhuri J
72
8/5/2024
HDFS High Availability
Dr. Madhuri J
73
8/5/2024
Availability if NameNode fails�
Dr. Madhuri J
74
8/5/2024
Quorum Journal Nodes
Dr. Madhuri J
75
8/5/2024
Failover and fencing
Dr. Madhuri J
76
8/5/2024
Hadoop Filesystem
Dr. Madhuri J
77
8/5/2024
Data replication
Dr. Madhuri J
78
8/5/2024
Replica Placement
Dr. Madhuri J
79
8/5/2024
Replica Selection
Dr. Madhuri J
80
8/5/2024
Anatomy of a File Read
Dr. Madhuri J
81
8/5/2024
Step 1: The client opens the file it wishes to read by calling open() on the FileSystem object,which for HDFS is an instance of DistributedFileSystem.
Step 2: DistributedFileSystem calls the namenode, using remote procedure calls (RPCs), to determine the locations of the first few blocks in the file. The namenode returns the addresses of the datanodes that have a copy of that block.
The DistributedFileSystem returns an FSDataInputStream (an input stream that supports file seeks) to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O.
Dr. Madhuri J
82
8/5/2024
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first (closest) datanode for the first block in the file.
Step 4:Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block.
Step 6: Blocks are read in order, with the DFSInputStream opening new connections to datanodes. It will also call the namenode to retrieve the datanode locations for the next batch of blocks as needed. When the client has finished reading, it calls close().
Dr. Madhuri J
83
8/5/2024
Network Topology and Hadoop
Dr. Madhuri J
84
8/5/2024
Network Topology and Hadoop
Dr. Madhuri J
85
8/5/2024
Dr. Madhuri J
86
8/5/2024
Anatomy of a File Write
Dr. Madhuri J
87
8/5/2024
Step 1 :The client creates the file by calling create(). DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace.
Step 2: The namenode performs various checks to record of the new file.
Step 3: As the client writes data the DFSOutputStream splits it into packets, which it writes to an internal queue called the data queue. The namenode to allocate new blocks by picking a list of suitable datanodes to store the replica.
Step 4: The DataStreamer streams the packets to the first datanode in the pipeline,which stores each packet and forwards it to the second datanode in the pipeline.
Dr. Madhuri J
88
8/5/2024
Step 5: A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline.
Step 6: When the client has finished writing data, it calls close().
Step 7: This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete.
Dr. Madhuri J
89
8/5/2024
Replica Placement
Dr. Madhuri J
90
8/5/2024
Hadoop Architecture
Dr. Madhuri J
91
8/5/2024
Thank You
Dr. Madhuri J
92
8/5/2024
Hadoop Distributed File System
Dr. Madhuri J
93
8/5/2024
MapReduce engine
Dr. Madhuri J
94
8/5/2024
Scaling out!
Dr. Madhuri J
95
8/5/2024
Dr. Madhuri J
96
8/5/2024
Dr. Madhuri J
97
8/5/2024
Uses of Hadoop
Dr. Madhuri J
98
8/5/2024
Big Data Analytics(BDA)
Dr. Madhuri J
99
8/5/2024
Dr. Madhuri J
100
8/5/2024
Why Big Data and BI
Dr. Madhuri J
101
8/5/2024
4 types of Big Data BI
Dr. Madhuri J
102
8/5/2024
So, in a nutshell
Dr. Madhuri J
103
8/5/2024
Problems Associated with reading and writing data from multiple disks
Hadoop is the solution
-- reliable, scalable platform for storage and analysis
--Open sourse
Dr. Madhuri J
104
8/5/2024
Dr. Madhuri J
105
8/5/2024
An OS for Networks
Dr. Madhuri J
106
8/5/2024
Global Network View
Protocols
Protocols
Control via forwarding
interface
Network Operating System
Control Programs
Software-Defined Networking (SDN)
Dr. Madhuri J
107
8/5/2024
Big Data Conundrum
Dr. Madhuri J
108
8/5/2024
Source: IBM http://www-01.ibm.com/software/data/bigdata/
The Big Data platform Manifesto�imperatives and underlying technologies
Dr. Madhuri J
109
8/5/2024
HADOOP
Dr. Madhuri J
110
8/5/2024
Manage & store huge volume of any data
Hadoop File System
MapReduce
Hadoop
Dr. Madhuri J
111
8/5/2024
Hadoop Related �Names to Know
Dr. Madhuri J
112
8/5/2024
Some concepts
Dr. Madhuri J
113
8/5/2024
Dr. Madhuri J
114
8/5/2024
Resources
Dr. Madhuri J
115
8/5/2024
Thank you
Dr. Madhuri J
116
8/5/2024