Big Data Analytics (23MB54)
Unit-I-Introduction to Data Mining and Big Data
Name: R Pavitra
Designation: Asstiant.Professor
Department: IT
College: LBRCE
1.1
�Course Outcomes:�
CO1: Apply data mining algorithms for classification and clustering.
CO2: Understand Big data framework.
CO3: Understanding the map reduces the way of solving analytic problems.
CO4: Illustrate the problem and its solutions using Data Analytics .
CO5: Analyze big data applications .
*
2
*
3
AGENDA:
Introduction to Data mining and Big Data
*
4
WHAT IS DATA?
*
5
WHAT IS DATA MINING?
WHAT IS DATA MINING (Cont’d)
or
WHAT IS DATA MINING (Cont’d)
Data Mining and Knowledge
WHAT IS DATA MINING?
Steps of Data Preprocessing
*
10
Preprocessing in Data Mining
*
11
WHAT IS DATA MINING?
Motivating Challenges:
Motivating Challenges:
Motivating Challenges:
Motivating Challenges:
Heterogeneous and Complex Data
Motivating Challenges:
Data Mining Techniques:
1. Association
Association analysis is the finding of association rules showing attribute-value conditions that occur frequently together in a given set of data. Association analysis is widely used for a market basket or transaction data analysis.
2. Classification
Classification is the processing of finding a set of models (or functions) that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown.
Data Mining has a different type of classifier:
*
18
3. Prediction
Data Prediction is a two-step process, similar to that of data classification. Although, for prediction, we do not utilize the phrasing of “Class label attribute” because the attribute for which values are being predicted is consistently valued(ordered) instead of categorical (discrete-esteemed and unordered).
4. Clustering:
It is a technique used to group similar data instances together based on their intrinsic characteristics or similarities. It aims to discover natural patterns or structures in the data without any predefined classes or labels.
5. Regression
It is employed to predict numeric or continuous values based on the relationship between input variables and a target variable. It aims to find a mathematical function or model that best fits the data to make accurate predictions.
*
19
6.Anomaly Detection
*
20
Data Mining Techniques:
Data Mining Techniques
Data Mining Techniques
Data Mining Techniques
Data Mining Techniques
Data Mining Techniques
Data Mining Techniques
Data Mining Techniques
Example: (Predicting the Type of a Flower)
Data Mining Techniques
Data Mining Techniques
The transactions data collected at the checkout counters of a grocery store.
Data Mining Techniques
Data Mining Techniques
Data Mining Techniques
Data Mining Techniques
�What is Big Data?�
*
35
*
36
The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.
Tabular Representation of various Memory Sizes
Name | Equal To | Size(In Bytes) |
Bit | 1 bit | 1/8 |
Nibble | 4 bits | 1/2 (rare) |
Byte | 8 bits | 1 |
Kilobyte | 1024 bytes | 1024 |
Megabyte | 1, 024kilobytes | 1, 048, 576 |
Gigabyt | 1, 024 megabytes | 1, 073, 741, 824 |
Terrabyte | 1, 024 gigabytes | 1, 099, 511, 627, 776 |
Petabyte | 1, 024 terrabytes | 1, 125, 899, 906, 842, 624 |
Exabyte | 1, 024 petabytes | 1, 152, 921, 504, 606, 846, 976 |
Zettabyte | 1, 024 exabytes | 1, 180, 591, 620, 717, 411, 303, 424 |
Yottabyte | 1, 024 zettabytes | 1, 208, 925, 819, 614, 629, 174, 706, 176 |
�Evolution of Big Data by Technology
�Big Data Characteristics�
There are five v's of Big Data that explains the characteristics.
5 V's of Big Data
*
40
*
41
Volume:
*
42
Variety:
*
43
The data is categorized as below:
*
44
Veracity:
For example, Facebook posts with hashtags.
Value:
Velocity:
*
45
TYPES OF BIG DATA
An 'Employee' table in a database is an example of Structured Data
Employee_I D | Employee_ Name | Gender | Department | Salary_In_la cs |
2365 | Rajesh Kulkarni | Male | Finance | 650000 |
3398 | Pratibha Joshi | Female | Admin | 650000 |
7465 | Shushil Roy | Male | Admin | 500000 |
7500 | Shubhojit Das | Male | Finance | 500000 |
7699 | Priya Sane | Female | Finance | 550000 |
The output returned by 'Google Search'
*
51
Challenges with Big Data
2. Big Data in Healthcare
3. Big Data in Education
4. Big Data in E-commerce
5. Big Data in Media and Entertainment
6. Big Data in Finance
BIG DATA vs. HADOOP
Big Data | Apache Hadoop |
Big Data is group of technologies. It is a collection of huge data which is multiplying continuously. | Apache Hadoop is a open source java based framework which involves some of the big data principles. |
It is a collection of assets which is quite complex, complicated and ambiguous. | It achieves a set of goals and objectives for dealing with the collection of assets. |
It is a complicated problem i.e. huge amount of raw data. | It is a solution being processing machine of those data. |
Big Data is harder to access. | It allows the data to be accessed and process faster. |
It is hard to store the huge amount of data as it consists all form of data. i.e. structured, unstructured and semi-structured. | It implements Hadoop Distributed File System (HDFS) which allows the storage of different variety of data. |
Big data has a wide range of applications in fields such as Telecommunication, the banking sector, Healthcare etc. | Hadoop is used for cluster resource management, parallel processing, and for data storage. |
| |
*
62
*
63
What is Analytics Architecture?
Key components of Analytics Architecture-
*
64
Limitations of Analytics Architecture :
There are several limitations to consider when designing and implementing an analytical architecture:
*
65
*
66