1 of 116

BIG DATA ANALYTICS�22SCS21MODULE 1:

Dr. Madhuri J

1

8/5/2024

2 of 116

Dr. Madhuri J

2

8/5/2024

3 of 116

Dr. Madhuri J

3

8/5/2024

4 of 116

Dr. Madhuri J

4

8/5/2024

5 of 116

Dr. Madhuri J

5

8/5/2024

Maximilien Brice, © CERN

6 of 116

The Earthscope

  • The Earthscope is the world's largest science project.
  • Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data.
  • It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. (http://www.msnbc.msn.com/id/44363598/ns/technology_and_science-future_of_technology/#.TmetOdQ--uI)

Dr. Madhuri J

6

8/5/2024

1.

7 of 116

Dr. Madhuri J

7

8/5/2024

Data

BIG DATA

8 of 116

Big data -Definition

  • Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools
  • The challenges include capture, storage, search, sharing, analysis, and visualization.

Dr. Madhuri J

8

8/5/2024

9 of 116

Big Data: A definition

  • Put another way, big data is the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored due to the limitations of traditional data management technologies

Dr. Madhuri J

9

8/5/2024

Source: Harness the Power of Big Data: The IBM Big Data Platform

10 of 116

Type of Data

  • Relational Data (Tables/Transaction/Legacy Data)
  • Text Data (Web)
  • Semi-structured Data (XML)
  • Graph Data
    • Social Network, Semantic Web (RDF), …

  • Streaming Data
    • You can only scan the data once

Dr. Madhuri J

10

8/5/2024

11 of 116

Who’s Generating Big Data

  • The progress and innovation is no longer hindered by the ability to collect data
  • But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion

Dr. Madhuri J

11

8/5/2024

Social media and networks

(all of us are generating data)

Scientific instruments

(collecting all sorts of data)

Mobile devices

(tracking all objects all the time)

Sensor technology and networks

(measuring all kinds of data)

12 of 116

Challenges

Dr. Madhuri J

12

8/5/2024

How to transfer Big Data?

13 of 116

Characteristics of Big Data: �1-Scale (Volume)

  • Data Volume
    • 44x increase from 2009 2020
    • From 0.8 zettabytes to 35zb
  • Data volume is increasing exponentially

Dr. Madhuri J

13

8/5/2024

Exponential increase in collected/generated data

14 of 116

A typical PC might have had 10 gigabytes of storage in 2000.

•Today, Face book ingests 500 terabytes of new data every day.

•Boeing 737 will generate 240 terabytes of flight data during a single flight across the US.

•The smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video.

Dr. Madhuri J

14

8/5/2024

15 of 116

Characteristics of Big Data: �2-Complexity (Varity)

  • Various formats, types, and structures
  • Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…
  • Static data vs. streaming data
  • A single application can be generating/collecting many types of data

Dr. Madhuri J

15

8/5/2024

To extract knowledge🡺 all these types of data need to linked together

16 of 116

  • Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media.
  • Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure.
  • Big Data analysis includes different types of data

Dr. Madhuri J

16

8/5/2024

17 of 116

Characteristics of Big Data: �3-Speed (Velocity)

  • Data is begin generated fast and need to be processed fast
  • Online Data Analytics
  • Late decisions 🡺 missing opportunities
  • Examples
    • E-Promotions: Based on your current location, your purchase history, what you like 🡺 send promotions right now for store next to you

    • Healthcare monitoring: sensors monitoring your activities and body 🡺 any abnormal measurements require immediate reaction

Dr. Madhuri J

17

8/5/2024

18 of 116

  • Click streams and ad impressions capture user behavior at millions of events per second
  • high-frequency stock trading algorithms reflect market changes within microseconds
  • machine to machine processes exchange data between billions of devices
  • infrastructure and sensors generate massive log data in real-time
  • on-line gaming systems support millions of concurrent users, each producing multiple inputs per second.

Dr. Madhuri J

18

8/5/2024

19 of 116

Big Data is a Hot Topic Because Technology Makes it Possible to Analyze ALL Available Data

Cost effectively manage and analyze all available data in its native form unstructured, structured, streaming

Dr. Madhuri J

19

8/5/2024

20 of 116

Big Data

  • Can you think of running a query on 20,980,000 GB file.
  • Email users send more than 204 million messages;
  • Mobile Web receives 217 new users;
  • Google receives over 2 million search queries;
  • YouTube users upload 48 hours of new video;
  • Facebook users share 684,000 bits of content;
  • Twitter users send more than 100,000 tweets;

Dr. Madhuri J

20

8/5/2024

21 of 116

What is Big Data

  • Collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications .
  • Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques.
  • Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure. Big Data analysis includes different types of data

Dr. Madhuri J

21

8/5/2024

22 of 116

Dr. Madhuri J

22

8/5/2024

23 of 116

Why HADOOP?

Dr. Madhuri J

23

8/5/2024

24 of 116

Hadoop

  • Hadoop is open source software.
  • Hadoop is a platform/framework
    • Which allows the user to quickly write and test distributed systems
    • Which is efficient in automatically distributing the data and work across machines

Dr. Madhuri J

24

8/5/2024

25 of 116

Hadoop!

  • Created by Doug Cutting
  • Started as a module in nutch and then matured as an apache project
  • Named it after his son's stuffed elephant

Dr. Madhuri J

25

8/5/2024

26 of 116

  • Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage.
  • It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.
  • Designed to answer the question: “How to process big data with reasonable cost and time?”

Dr. Madhuri J

26

8/5/2024

27 of 116

What we’ve got in Hadoop

  • Fault-tolerant file system
  • Hadoop Distributed File System (HDFS)
  • Modeled on Google File system
  • Takes computation to data
  • Data Locality
  •  Scalability:
    • Program remains same for 10, 100, 1000,… nodes
    • Corresponding performance improvement
  • Parallel computation using MapReduce
  • Other components – Pig, Hbase, HIVE, ZooKeeper

Dr. Madhuri J

27

8/5/2024

28 of 116

Hadoop

  • HDFS and MapReduce are two core components of Hadoop

HDFS:

  • Hadoop Distributed File System
  • makes our job easy to store the data on commodity hardware
  • Built to expect hardware failures
  • Intended for large files & batch inserts

MapReduce

  • For parallel processing

Dr. Madhuri J

28

8/5/2024

29 of 116

RDBMS compared to MapReduce

Dr. Madhuri J

29

8/5/2024

Traditional RDBMS

MapReduce

Data size

Gigabytes

Petabytes

Access

Interactive and batch

Batch

Updates

Read and write many times Write once

Read many times

Transactions

ACID

None

Structure

Schema-on-write

Schema-on-read

Integrity

High

Low

Scaling

Nonlinear

Linear

30 of 116

Grid Computing

  • The high-performance computing (HPC) and grid computing communities have been doing large-scale data processing for years, using such application program interfaces (APIs) as the Message Passing Interface (MPI).
  • This works well for predominantly compute-intensive jobs, but it becomes a problem when nodes need to access larger data volumes.
  • MPI gives great control to programmers.
  • Hadoop tries to co-locate the data with the compute nodes, so data access is fast because it is local. This feature, known as data locality

Dr. Madhuri J

30

8/5/2024

31 of 116

  • Coordinating the processes in a large-scale distributed computation is a challenge.
  • Distributed processing frameworks like MapReduce has the implementation that detects failed tasks and reschedules replacements on machines that are healthy.
  • MapReduce is able to do this because it is a shared-nothing architecture, meaning that tasks have no dependence on one other.

Dr. Madhuri J

31

8/5/2024

32 of 116

Volunteer Computing

  • Volunteers donate CPU time from their otherwise idle computers to analyze radio telescope data for signs of intelligent life outside Earth. SETI@home is the most well known of many volunteer computing projects.
  • Volunteer computing projects work by breaking the problems they are trying to solve into chunks called work units, which are sent to computers around the world to be analyzed.
  • When the analysis is completed, the results are sent back to the server, and the client gets another work unit.
  • The SETI@home problem is very CPU-intensive, which makes it suitable for running on hundreds of thousands of computers across the world because the time to transfer the work unit is dwarfed by the time to run the computation on it.

Dr. Madhuri J

32

8/5/2024

33 of 116

Hadoop Ecosystem

Dr. Madhuri J

33

8/5/2024

34 of 116

MapReduce

    • A distributed data processing model and execution environment that runs on large clusters of commodity machines.
  • Can be used with Java, Ruby, Python, C++ and more
  • Inherently parallel, thus putting very large-scale data analysis into the hands of anyone with enough machines at their disposal

Weather Dataset

  • Weather sensors collect data every hour at many locations across the globe and gather a large volume of log data.
  • The data we will use is from the National Climatic Data Center, or NCDC.
  • The data is stored using a line-oriented ASCII format, in which each line is a record.

Dr. Madhuri J

34

8/5/2024

35 of 116

Dr. Madhuri J

35

8/5/2024

36 of 116

Dr. Madhuri J

36

8/5/2024

37 of 116

  • Mission - calculate max temperature each year around the world
  • Problem - millions of temperature measurements records
  • Challenges
    • The file size for different years varies widely, so some processes will finish much earlier than others. A better approach, is to split the input into fixed-size chunks and assign each chunk to a process.
    • combining the results from independent processes may require further processing as data for a particular year will typically be split into several chunks, each processed independently.
    • When we start using multiple machines, a whole host of other factors come into play, mainly falling into the categories of coordination and reliability. Who runs the overall job? How do we deal with failed processes?

Dr. Madhuri J

37

8/5/2024

38 of 116

Example: Weather Dataset

Brute Force approach – Bash:

(each year’s logs are compressed to a single yearXXXX.gz file)

  • The complete run for the century took 42 minutes in one run on a single EC2 High-CPU Extra Large Instance.

Dr. Madhuri J

38

8/5/2024

39 of 116

Analyzing the Data with Hadoop�Map Reduce

  • MapReduce works by breaking the processing into two phases: the map phase and the reduce phase.
  • The input to our map phase is the raw NCDC data. We choose a text input format that gives us each line in the dataset as a text value. The key is the offset of the beginning of the line from the beginning of the file.
  • Only air temparature and year data is required to find the maximum temperature in every year.

Dr. Madhuri J

39

8/5/2024

40 of 116

Map Function

  • The raw dataset are presented to the map function as the key-value pairs:

  • The map function merely extracts the year and the air temperature (indicated in bold text),and emits them as its output

Dr. Madhuri J

40

8/5/2024

41 of 116

  • This processing sorts and groups the key-value pairs by key. So, continuing the example, our reduce function sees the following input:

  • Each year appears with a list of all its air temperature readings. All the reduce function has to do now is iterate through the list and pick up the maximum reading:

Dr. Madhuri J

41

8/5/2024

42 of 116

MapReduce logical data flow

Dr. Madhuri J

42

8/5/2024

43 of 116

MapReduce

Weather Dataset with MapReduce

Input formatting phase

  • The input to MR job is the raw NCDC data
  • Input format: we use Hadoop text formatter class:
    • When given a directory (HDFS URL), outputs a Hadoop <key,value> collection:
      • The key is the offset of the beginning of the line from the beginning of the file
      • The value is the line text

Dr. Madhuri J

43

8/5/2024

HDFS

Data

Formatting

<key, value> collection

44 of 116

MapReduce

Input formatting phase

  • In our example –
    • NCDC log file

Output (to MR framework):

Dr. Madhuri J

44

8/5/2024

HDFS

Data

Formatting

<key, value> collection

45 of 116

MapReduce

Map phase

  • The input to our map phase is the lines <offset, line_text> pairs
  • Map function pulls out the year and the air temperature, since these are the only fields we are interested in
  • Map function also drops bad records - filters out temperatures that are missing suspect, or erroneous.
  • Map Output (<year, temp> pairs):

Dr. Madhuri J

45

8/5/2024

<key, value> collection

Map

<key, value> collection

46 of 116

MapReduce

MR framework processing phase

  • The output from the map function is processed by the MR framework before being sent to the reduce function
  • This processing sorts and groups the key-value pairs by key
  • MR framework processing output (<year, temperatures> pairs):

Dr. Madhuri J

46

8/5/2024

<key, value> collection

MR framework processing

<key, values> collection

47 of 116

MapReduce

Reduce phase

  • The input to our reduce phase is the <year, temperatures> pairs
  • All the reduce function has to do now is iterate through the list and pick up the maximum reading
  • Reduce output:

Dr. Madhuri J

47

8/5/2024

<key, value> collection

Reduce

<key, values> collection

48 of 116

MapReduce

Data output phase

  • The input to the data output class is the <year, max temperature> pairs from the reduce function
  • When using the default Hadoop output formatter, the output is written to a pre-defined directory, which contains one output file per reducer.
  • Output formatter file output:

Dr. Madhuri J

48

8/5/2024

<key, value> collection

output

HDFS

Data

49 of 116

MapReduce

Dr. Madhuri J

49

8/5/2024

Some code.. Map function

50 of 116

MapReduce

Dr. Madhuri J

50

8/5/2024

Some code.. Reduce function

51 of 116

MapReduce

Dr. Madhuri J

51

8/5/2024

Some code.. Putting it all together

And running:

hadoop MaxTemperature input/ncdc/sample.txt output

52 of 116

Scaling Out

  • To scale out, we need to store the data in a distributed filesystem (HDFS).
  • This allows Hadoop to move the MapReduce computation to each machine hosting a part of the data, using Hadoop’s resource management system, called YARN

Dr. Madhuri J

52

8/5/2024

53 of 116

Data Flow

  • A MapReduce job is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program, and configuration
  • Hadoop runs the job by dividing it into tasks, of which there are two types:
    • map tasks and reduce tasks.
  • Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits.
  • If we are processing the splits in parallel, the processing is better load balanced when the splits are small.
  • On the other hand, if splits are too small, the overhead of managing the splits and map task creation begins to dominate the total job execution time

Dr. Madhuri J

53

8/5/2024

54 of 116

  • For most jobs, a good split size tends to be the size of an HDFS block, which is 128 MB by default.
  • Hadoop does its best to run the map task on a node where the input data resides in HDFS, because it doesn’t use valuable cluster bandwidth. This is called the data locality optimization.

Dr. Madhuri J

54

8/5/2024

55 of 116

  • Data-local (a), rack-local (b), and off-rack (c) map tasks

Dr. Madhuri J

55

8/5/2024

56 of 116

MapReduce data flow with a single reduce task

Dr. Madhuri J

56

8/5/2024

57 of 116

  • Map tasks write their output to the local disk, not to HDFS.
  • Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete, the map output can be thrown away.
  • Reduce tasks don’t have the advantage of data locality; the input to a single reduce task is normally the output from all mappers.
  • The sorted map outputs have to be transferred across the network to the node where the reduce task is running.
  • They are merged and then passed to the user-defined reduce function.
  • The output of the reduce is normally stored in HDFS for reliability.

Dr. Madhuri J

57

8/5/2024

58 of 116

MapReduce

Dr. Madhuri J

58

8/5/2024

  • There are two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers.

  • Jobtracker - coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers.
  • Tasktrackers - run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job.

Job Tracker

Task Tracker

Task Tracker

Task Tracker

59 of 116

Hadoop Streaming

  • Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java.
  • Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program
  • Map input data is passed over standard input to your map function, which processes it line by line and writes lines to standard output.
  • A map output key-value pair is written as a single tab-delimited line.
  • The reduce function reads lines from standard input, sorted by key, and writes its results to standard output.

Dr. Madhuri J

59

8/5/2024

60 of 116

Hadoop Pipes

  • Uses standard input and output to communicate with the map and reduce code.
  • Pipes uses sockets as the channel over which the tasktracker communicates with the process running

Dr. Madhuri J

60

8/5/2024

61 of 116

Design of HDFS

HDFS is designed for:

  • Very large Files
  • Streaming data access
  • To work with Commodity hardware

HDFS cannot be used for:

  • Low-latency data access
  • Lots of small files
  • Multiple writers, arbitrary file modifications

Dr. Madhuri J

61

8/5/2024

62 of 116

HDFS Concepts

  • Block size is the minimum amount of data that it can read or write.
  • HDFS block size is 64MB
  • If the block large enough, the time to transfer the data from the disk can be made significantly larger than the time to seek to the start of the block. (Minimizing the cost of seeks)

Dr. Madhuri J

62

8/5/2024

63 of 116

Namenode and Datanodes

Master/slave architecture

  • HDFS cluster consists of a single Namenode, a master server that manages the file system namespace image and regulates access to files by clients (Edit Log).
  • Namespace image refers to the file names with their paths maintained by a name node
  • Namenode maintains the filesystem tree and the metadata for all the files and directories in the tree.
  • Namenode also knows the Datanode on which all the blocks for a given file are located.
  • There are a number of DataNode usually one per node in a cluster.

Dr. Madhuri J

63

8/5/2024

64 of 116

  • Main components:
    • Name node
    • Data node

Dr. Madhuri J

64

8/5/2024

65 of 116

  • The DataNodes manage storage attached to the nodes that they run on.
  • A file is split into one or more blocks and set of blocks are stored in DataNodes.
  • DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.
  • A client accesses the filesystem on behalf of the user by communicating with the namenode and datanodes

Dr. Madhuri J

65

8/5/2024

66 of 116

HDFS Architecture

Dr. Madhuri J

66

8/5/2024

67 of 116

Secondary Namenode

  • Periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.

  • Runs on a separate physical machine, since it requires plenty of CPU and memory to perform the merge.

  • It can be used at the time of primary namenode failure.

Dr. Madhuri J

67

8/5/2024

68 of 116

Limitations of Current HDFS Architecture

  • Due to single NameNode, we can have only a limited number of DataNodes that a single NameNode can handle.

  • The operations of the filesystem are also limited to the number of tasks that NameNode handles at a time. Thus, the performance of the cluster depends on the NameNode throughput.

  • Because of a single namespace, there is no isolation among the occupant organizations which are using the cluster.

Dr. Madhuri J

68

8/5/2024

69 of 116

HDFS Federation

  • HDFS Federation feature introduced in Hadoop 2 enhances the existing HDFS architecture.

  • Multiple namenodes are added which allows a cluster to scale horizontally, each of which manages a portion of the filesystem namespace.

  • Eg: one namenode might manage all the files rooted under /user, say, and a second namenode might handle files under /share

  • Under federation, each namenode manages a namespace volume, and a block pool.

Dr. Madhuri J

69

8/5/2024

70 of 116

HDFS Federation Architecture

  • NN1, NN2 stands for Namenode. NS1, NS2 stands for namespace.
  • Each namespace has its own block pool. (NS1 has block1)

Dr. Madhuri J

70

8/5/2024

71 of 116

HDFS Federation Architecture

  • All the NameNodes uses DataNodes as the common storage.

  • Every NameNode is independent of the other and does not require any coordination amongst themselves.

  • Each Datanode gets registered to all the NameNodes in the cluster and store blocks for all the block pools in the cluster.

  • DataNodes periodically send heartbeats and block reports to all the NameNode in the cluster and handles the instructions from the NameNodes.

Dr. Madhuri J

71

8/5/2024

72 of 116

Block pool and Namespace Volume

  • Block pool
    • HDFS Federation architecture is the collection of blocks belonging to the single namespace.
    • Each block pool is managed independently from each other. 
  • Namespace Volume
    • Namespace with its block pool is termed as Namespace Volume. 
    • On deleting the NameNode or namespace, the corresponding block pool present in the DataNodes also gets deleted. 

Dr. Madhuri J

72

8/5/2024

73 of 116

HDFS High Availability

  • High availability feature in Hadoop ensures the availability of the Hadoop cluster without any downtime, in unfavorable conditions like NameNode failure, DataNode failure, machine crash, etc

  • Availability if NameNode fails
    • Namenode is the sole repository of the metadata and the file-to-block mapping.
    • There are a pair of namenodes in an active-standby configuration.
    • In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption

Dr. Madhuri J

73

8/5/2024

74 of 116

Availability if NameNode fails

  • Architectural changes to needed to enable namespace availability.
    • The namenodes must use highly available shared storage to share the edit log.
    • Datanodes must send block reports to both namenodes because the block mappings are stored in a namenode’s memory, and not on disk.
    • Clients must be configured to handle namenode failover, using a mechanism that is transparent to users.
  • The secondary namenode’s role is subsumed by the standby.

Dr. Madhuri J

74

8/5/2024

75 of 116

Quorum Journal Nodes

  • The active node and the passive nodes communicate with a group of separate daemons called “Journal Nodes’
  • The active NameNode updates the edit log in the JNs.
  • The standby nodes, continuously watch the JNs for edit log change.
  • Standby nodes read the changes in edit logs and apply them to their namespace.
  • If the active NameNode fails, the standby will ensure that it has read all the edits from Journal Nodes before promoting itself to the Active state.. This ensures synchronization.

Dr. Madhuri J

75

8/5/2024

76 of 116

Failover and fencing

  • The transition from the active namenode to the standby is managed by a new entity in the system called the failover controller.
    • ZooKeeper is the default implementation.
    • Graceful failover and ungraceful failover.
    • A network partition can trigger a failover transition, even though the previously active namenode is still running.

  • Fencing ensures that the previously active namenode is prevented from doing any damage and causing corruption
    • STONITH or “shoot the other node in the head” is a fencing technique which uses a specialized power distribution unit to forcibly power down the NameNode machine.

Dr. Madhuri J

76

8/5/2024

77 of 116

Hadoop Filesystem

  • The Java abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop, and there are several concrete implementations, which are described as follows:

Dr. Madhuri J

77

8/5/2024

78 of 116

Data replication

  • HDFS is designed to store very large files across machines in a large cluster.
  • Each file is a sequence of blocks.
  • All blocks in the file except the last are of the same size.
  • Blocks are replicated for fault tolerance.
  • Block size and replicas are configurable per file.
  • The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.
  • BlockReport contains all the blocks on a Datanode
  • The placement of the replicas is critical to HDFS reliability and performance.
  • Optimizing replica placement distinguishes HDFS from other distributed file systems.

Dr. Madhuri J

78

8/5/2024

79 of 116

Replica Placement

  • Rack-aware replica placement: Many racks, communication between racks are through switches.
  • Network bandwidth between machines on the same rack is greater than those in different racks.
  • Namenode determines the rack id for each DataNode.
  • Replicas are typically placed on unique racks
  • Replicas are placed: one on a node in a local rack, one on a different node in the local rack and one on a node in a different rack.
  • 1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly across remaining racks.

Dr. Madhuri J

79

8/5/2024

80 of 116

Replica Selection

  • Replica selection for READ operation: HDFS tries to minimize the bandwidth consumption and latency.
  • If there is a replica on the Reader node then that is preferred.
  • HDFS cluster may span multiple data centers: replica in the local data center is preferred over the remote one

Dr. Madhuri J

80

8/5/2024

81 of 116

Anatomy of a File Read

Dr. Madhuri J

81

8/5/2024

82 of 116

Step 1: The client opens the file it wishes to read by calling open() on the FileSystem object,which for HDFS is an instance of DistributedFileSystem.

Step 2: DistributedFileSystem calls the namenode, using remote procedure calls (RPCs), to determine the locations of the first few blocks in the file. The namenode returns the addresses of the datanodes that have a copy of that block.

The DistributedFileSystem returns an FSDataInputStream (an input stream that supports file seeks) to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O.

Dr. Madhuri J

82

8/5/2024

83 of 116

Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first (closest) datanode for the first block in the file.

Step 4:Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream.

Step 5: When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block.

Step 6: Blocks are read in order, with the DFSInputStream opening new connections to datanodes. It will also call the namenode to retrieve the datanode locations for the next batch of blocks as needed. When the client has finished reading, it calls close().

Dr. Madhuri J

83

8/5/2024

84 of 116

Network Topology and Hadoop

  • The limiting factor is the rate at which we can transfer data between nodes—bandwidth is a scarce commodity.
  • the bandwidth available for each of the following scenarios becomes progressively less:
    • Processes on the same node
    • Different nodes on the same rack
    • Nodes on different racks in the same data center
    • Nodes in different data centers

Dr. Madhuri J

84

8/5/2024

85 of 116

Network Topology and Hadoop

  • For example, imagine a node n1 on rack r1 in data center d1. This can be represented as /d1/r1/n1.
  • Using this notation, here are the distances for the four scenarios:
    • distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)
    • distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)
    • distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center)
    • distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)

Dr. Madhuri J

85

8/5/2024

86 of 116

Dr. Madhuri J

86

8/5/2024

87 of 116

Anatomy of a File Write

Dr. Madhuri J

87

8/5/2024

88 of 116

Step 1 :The client creates the file by calling create(). DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace.

Step 2: The namenode performs various checks to record of the new file.

Step 3: As the client writes data the DFSOutputStream splits it into packets, which it writes to an internal queue called the data queue. The namenode to allocate new blocks by picking a list of suitable datanodes to store the replica.

Step 4: The DataStreamer streams the packets to the first datanode in the pipeline,which stores each packet and forwards it to the second datanode in the pipeline.

Dr. Madhuri J

88

8/5/2024

89 of 116

Step 5: A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline.

Step 6: When the client has finished writing data, it calls close().

Step 7: This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete.

Dr. Madhuri J

89

8/5/2024

90 of 116

Replica Placement

  • How does the namenode choose which datanodes to store replicas?
    • placing all replicas on a single node incurs the lowest write bandwidth penalty.
    • the read bandwidth is high for off-rack reads.
    • placing replicas in different data centers may maximize
    • redundancy, but at the cost of bandwidth.
  • Hadoop’s default strategy
    • First replica on the same node as the client
    • The second replica is placed on a different rack from the first (off-rack), chosen at random.
    • The third replica is placed on the same rack as the second, but on a different node chosen at random.
    • Further replicas are placed on random nodes in the cluster, although the system tries to avoid placing too many replicas on the same rack.

Dr. Madhuri J

90

8/5/2024

91 of 116

Hadoop Architecture

Dr. Madhuri J

91

8/5/2024

92 of 116

Thank You

Dr. Madhuri J

92

8/5/2024

93 of 116

Hadoop Distributed File System

Dr. Madhuri J

93

8/5/2024

94 of 116

MapReduce engine

Dr. Madhuri J

94

8/5/2024

95 of 116

Scaling out!

  • Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits.
  • Splits are normally corresponds to (one or more) file blocks
  • Hadoop creates one map task for each split, which runs the user defined map function for each record in the split.
    • The processing is better load balanced when the splits are small.
    • If splits are too small, the overhead of managing the splits and map task creation begins to dominate the total job execution time.
  • A good split size tends to be the size of an HDFS block, which is 128 MB by default.

Dr. Madhuri J

95

8/5/2024

96 of 116

  • Hadoop does its best to run the map task on a node where the input data resides in HDFS. This is called the data locality optimization.
  • If all the nodes hosting the HDFS blocks for a map task’s input split are running other map tasks, the job scheduler will look for a free map slot on a node in the same rack as one of the blocks.
  • If this is not possible, an off-rack node is used, which results in an inter-rack network transfer.
  • The three possibilities of data transfer are illustrated as follows: (a), rack-local (b), and off-rack (c) map tasks

Dr. Madhuri J

96

8/5/2024

97 of 116

  • Map tasks write their output to the local disk, not to HDFS - Map output is intermediate output: it’s processed by reduce tasks to produce the final output
  • Once the job is complete the map output can be thrown away, So storing it in HDFS, with replication, would be overkill.

Dr. Madhuri J

97

8/5/2024

98 of 116

Uses of Hadoop

  • Scalable: It can reliably store and process petabytes.
  • Economical: It distributes the data and processing across clusters of commonly available computers (in thousands).
  • Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located.
  • Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.

Dr. Madhuri J

98

8/5/2024

99 of 116

Big Data Analytics(BDA)

  • Examining large amount of data
  • Appropriate information
  • Competitive advantage
  • Identification of hidden patterns, unknown correlations
  • Better business decisions: strategic and operational
  • Effective marketing, customer satisfaction, increased revenue

Dr. Madhuri J

99

8/5/2024

100 of 116

Dr. Madhuri J

100

8/5/2024

101 of 116

Why Big Data and BI

Dr. Madhuri J

101

8/5/2024

102 of 116

4 types of Big Data BI

  • Prescriptive: Analysis of actions to be taken.
  • Predictive: Analysis of scenarios that may happen.
  • Diagnostic: Look at past performance.
  • Descriptive: Real time dashboard.

Dr. Madhuri J

102

8/5/2024

103 of 116

So, in a nutshell

  • Big Data is about better analytics!

Dr. Madhuri J

103

8/5/2024

104 of 116

Problems Associated with reading and writing data from multiple disks

  • Hardware failure
  • Combining from different disks

Hadoop is the solution

-- reliable, scalable platform for storage and analysis

--Open sourse

Dr. Madhuri J

104

8/5/2024

105 of 116

Dr. Madhuri J

105

8/5/2024

106 of 116

An OS for Networks

  • Towards an Operating System for Networks

Dr. Madhuri J

106

8/5/2024

Global Network View

Protocols

Protocols

Control via forwarding

interface

Network Operating System

Control Programs

Software-Defined Networking (SDN)

107 of 116

Dr. Madhuri J

107

8/5/2024

108 of 116

Big Data Conundrum

  • Problems:
    • Although there is a massive spike available data, the percentage of the data that an enterprise can understand is on the decline
    • The data that the enterprise is trying to understand is saturated with both useful signals and lots of noise

Dr. Madhuri J

108

8/5/2024

109 of 116

The Big Data platform Manifesto�imperatives and underlying technologies

Dr. Madhuri J

109

8/5/2024

110 of 116

HADOOP

Dr. Madhuri J

110

8/5/2024

Manage & store huge volume of any data

Hadoop File System

MapReduce

111 of 116

Hadoop

  • Hadoop is a distributed file system and data processing engine that is designed to handle extremely high volumes of data in any structure.
  • Hadoop has two components:
    • The Hadoop distributed file system (HDFS), which supports data in structured relational form, in unstructured form, and in any form in between
    • The MapReduce programing paradigm for managing applications on multiple distributed servers
  • The focus is on supporting redundancy, distributed architectures, and parallel processing

Dr. Madhuri J

111

8/5/2024

112 of 116

Hadoop Related �Names to Know

  • Apache Avro: designed for communication between Hadoop nodes through data serialization
  • Cassandra and Hbase: a non-relational database designed for use with Hadoop
  • Hive: a query language similar to SQL (HiveQL) but compatible with Hadoop
  • Mahout: an AI tool designed for machine learning; that is, to assist with filtering data for analysis and exploration
  • Pig Latin: A data-flow language and execution framework for parallel computation
  • ZooKeeper: Keeps all the parts coordinated and working together

Dr. Madhuri J

112

8/5/2024

113 of 116

Some concepts

  • NoSQL (Not Only SQL): Databases that “move beyond” relational data models (i.e., no tables, limited or no use of SQL)
    • Focus on retrieval of data and appending new data (not necessarily tables)
    • Focus on key-value data stores that can be used to locate data objects
    • Focus on supporting storage of large quantities of unstructured data
    • SQL is not used for storage or retrieval of data
    • No ACID (atomicity, consistency, isolation, durability)

Dr. Madhuri J

113

8/5/2024

114 of 116

Dr. Madhuri J

114

8/5/2024

115 of 116

Resources

  • BigInsights Wiki
  • Udacity – Big data and data science
  • BigData University

Dr. Madhuri J

115

8/5/2024

116 of 116

Thank you

Dr. Madhuri J

116

8/5/2024