2 of 62

Syllabus

Data Mining:

Data–Types of Data–, Data Mining Functionalities– Interestingness Patterns–Classification of Data Mining systems– Data mining Task primitives –Integration of Data mining system with a Data warehouse–Major issues in Data Mining–Data Preprocessing.

3 of 62

What is Data Mining?

Extracting the information from large collection of data which is unknown to the user.

Characteristics of Data Mining:

Non-Trivial: should be relevant that the data to be required.

Novel: Unique all times- should give same results at all times., even apply different algorithm.

Useful: information which is retrieved should be useful for decision making.

Data mining is used in business to make better managerial decisions by:

Automatic summarization of data
Extracting essence of information stored.
Discovering patterns in raw data.

5 of 62

KDD Process in Data Mining- Knowledge Discovery in Databases

6 of 62

1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection.

Cleaning in case of Missing values.
Cleaning noisy data, where noise is a random or variance error.
Cleaning with Data discrepancy detection and Data transformation tools.

2. Data Integration: Data integration is defined as heterogeneous data from multiple sources combined in a common source(DataWarehouse).

Data integration using Data Migration tools.
Data integration using Data Synchronization tools.
Data integration using ETL(Extract-Load-Transformation) process.

3. Data Selection: Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the data collection. Data selection using Neural network.

Data selection using Decision Trees.
Data selection using Naive bayes.
Data selection using Clustering, Regression, etc

7 of 62

4. Data Transformation: Data Transformation is defined as the process of transforming data into appropriate form required by mining procedure.

5. Data Mining: Data mining is defined as clever techniques that are applied to extract patterns potentially useful.

Transforms task relevant data into patterns.
Decides purpose of model using classification or characterization.

6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based on given measures.

Find interestingness score of each pattern.
Uses summarization and Visualization to make data understandable by user.

7. Knowledge representation: Knowledge representation is defined as technique which utilizes visualization tools to represent data mining results.

Generate reports.
Generate tables.
Generate discriminant rules, classification rules, characterization rules, etc.

Note:

KDD is an iterative process where evaluation measures can be enhanced, mining can be refined, new data can be integrated and transformed in order to get different and more appropriate results.

8 of 62

Types of Data

Data: Data Is Information That Has Been Translated Into A Form That Is Efficient For Movement Or Processing.

The Most Basic Forms Of Data For Mining Applications Are

Database Data
Data Warehouse Data
Transactional Data
Other Kinds Of Data

9 of 62

1. Database Data

DBMS – database management system, contains a collection of interrelated databases

Ex: Faculty database, student database, publications database

Each database contains a collection of tables and functions to manage and access the data.

Ex: student_bio, student_parking

Each table contains columns and rows, with columns as attributes of data and rows as records.
Tables can be used to represent the relationships between or among multiple tables.

11 of 62

Through the use of relational queries, you can ask things like, “Show me a list of all items that were sold in the last quarter.” Relational languages also use aggregate functions such as sum, avg (average), count, max (maximum), and min (minimum). Using aggregates allows you to ask:

“Show me the total sales of the last month, grouped by branch,” or
“How many sales transactions occurred in the month of December?”
or “Which salesperson had the highest sales?”

When mining relational databases, we can go further by searching for trends or data patterns.

For example, data mining systems can analyze customer data to predict the credit risk of new customers based on their income, age, and previous credit information.

Data mining systems may also detect deviations—that is, items with sales that are far from those expected in comparison with the previous year.

12 of 62

2. Data Warehouse Data

A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and usually residing at a single site.
Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing.
A data warehouse is usually modeled by a multidimensional data structure, called a data cube, in which each dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure such as count or sum(sales amount). A data cube provides a multidimensional view of data and allows the precipitation and fast access of summarized data.

14 of 62

Data are organized around major subjects

e.g. customer, item, supplier and activity.

Provide information from a historical perspective

e.g. from the past 5 – 10 years

Typically summarized to a higher level

e.g. a summary of the transactions per item type for each store

User can perform drill-down or roll-up operation to view the data at different degrees of summarization.

15 of 62

3. Transactional Data

In general, each record in a transactional database captures a transaction, such as a customer’s purchase, a flight booking, or a user’s clicks on a web page.
A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the transaction, such as the items purchased in the transaction.
A transactional database may have additional tables, which contain other information related to the transactions, such as item description, information about the salesperson or the branch, and so on.

16 of 62

4. Other Kinds Data

Time-related Or Sequence Data (E.G., Historical Records, Stock Exchange Data, And Time-series And Biological Sequence Data),
Data Streams (E.G., Video Surveillance And Sensor Data, Which Are Continuously Transmitted),
Spatial Data (E.G., Maps),
Engineering Design Data (E.G., The Design Of Buildings, System Components, Or Integrated Circuits),
Hypertext And Multimedia Data (Including Text, Image, Video, And Audio Data),
Graph And Networked Data (E.G., Social And Information Networks),
The Web (A Huge, Widely Distributed Information Repository Made Available By The Internet).

17 of 62

A typical DM System Architecture

Data Mining Tasks:

Class/Concept Description

Mining of Frequent Patterns

Mining of Associations

Mining of Correlations

Mining of Clusters

18 of 62

Data Sources: Database, World Wide Web(WWW), and data warehouse are parts of data sources. The data in these sources may be in the form of plain text, spreadsheets, or other forms of media like photos or videos. WWW is one of the biggest sources of data.

Database Server: The database server contains the actual data ready to be processed. It performs the task of handling data retrieval as per the request of the user.

Data Mining Engine: It is one of the core components of the data mining architecture that performs all kinds of data mining techniques like association, classification, characterization, clustering, prediction, etc.

Pattern Evaluation Modules: They are responsible for finding interesting patterns in the data and sometimes they also interact with the database servers for producing the result of the user requests.

Graphic User Interface: Since the user cannot fully understand the complexity of the data mining process so graphical user interface helps the user to communicate effectively with the data mining system.

Knowledge Base: Knowledge Base is an important part of the data mining engine that is quite beneficial in guiding the search for the result patterns. Data mining engines may also sometimes get inputs from the knowledge base. This knowledge base may contain data from user experiences. The objective of the knowledge base is to make the result more accurate and reliable.

19 of 62

Data Mining Functionalities

Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks.

1. Descriptive mining

2. Predictive mining

2. Predictive Data Mining is the Analysis done to predict a future event or other data or trends.

Ex:

Predicting employee growth in HR.
Predicting performance in sports.
Forecasting patterns in weather.
Fraud Detection

1. Descriptive Data Mining is a data mining technique that identifies what happened in the past by analyzing the stored past data.

Ex: research, business, economics, social sciences, and healthcare.

20 of 62

Data Mining Functionalities

Classification: data into different classes
Clustering & Anomaly Detection (Outlier Change Detection): clustering groups a set of objects so that objects in the same group are more similar to each other than those in other groups & identifies unusual data records.
Regression: predicts a range of numeric values based on a continuous dataset
Association Rules: discovering interesting relations between variables in large databases
Decision Trees: a model that uses a tree-like graph of decisions and their possible consequences.
Neural Networks: neural networks are a series of algorithms that attempt to recognise underlying relationships in a data set.
Data Visualization: Turning complex data sets into graphical representations that are easy to understand and interpret.
Text Mining: Utilizing techniques to extract qualitative information from text data sources.

21 of 62

Interestingness Patterns

A data mining system has the potential to generate thousands or even millions of patterns, or rules. then “are all of the patterns interesting?” Typically, not—only a small fraction of the patterns potentially generated would be of interest to any given user.

What makes a pattern interesting? understood by humans, valid on new or test data with some degree of certainty, Potentially useful and novel.

Can a data mining system generate all the interesting patterns? refers to the completeness of a data mining algorithm.

Can a data mining system generate only interesting patterns? It is highly desirable for data mining systems to generate only interesting patterns.An interesting pattern represents knowledge.

22 of 62

Classification of Data Mining Systems

Classification of the data mining system helps users to understand the system and match their requirements with such systems.

24 of 62

Data mining systems can be categorized according to various criteria, as follows:

i) Classification according to the kinds of databases mined:

Database systems can be classified according to data models, we may have a relational, transactional, object-relational, or data warehouse mining system.
Each of which may require its own data mining technique.

ii) Classification according to the kinds of knowledge mined:

Data mining systems can be categorized according to the kinds of knowledge they mine, that is, based on data mining functionalities, such as characterization, discrimination, association and correlation analysis, classification, prediction, clustering, outlier analysis, and evolution analysis.

iii) Classification according to the kinds of techniques utilized:

Data mining systems can be categorized according to the underlying data mining techniques employed.
These techniques can be described according to the degree of user interaction involved (e.g., autonomous systems, interactive exploratory systems, query-driven systems).

25 of 62

IV) Classification according to the applications adapted:

Data mining systems can also be categorized according to the applications they adapt.
For example, data mining systems may be tailored specifically for finance, telecommunications, DNA, stock markets, e-mail, and so on

26 of 62

Data Mining Task primitives

A data mining task can be specified in the form of a data mining query, which is input to the data mining system.
A data mining query is defined in terms of data mining task primitives.
These primitives allow the user to interactively communicate with the data mining system during the mining process to discover interesting patterns.

27 of 62

Set of task relevant data to be mined

This specifies the portions of the database or the set of data in which the user is interested.

This portion includes the following

Database Attributes
Data Warehouse dimensions of interest

For example, suppose that you are a manager of All Electronics in charge of sales in the United States and Canada. You would like to study the buying trends of customers in Canada. Rather than mining on the entire database. These are referred to as relevant attributes.

Kind of knowledge to be mined

This specifies the data mining functions to be performed, such as Characterization& Discrimination

Association
Classification
Clustering
Prediction
Outlier analysis

For instance, if studying the buying habits of customers in Canada, you may choose to mine associations between customer profiles and the items that these customers like to buy.

28 of 62

Background knowledge to be used in discovery process: Users can specify background knowledge, or knowledge about the domain to be mined. This knowledge is useful for guiding the knowledge discovery process, and for evaluating the patterns found. User beliefs about relationship in the data.

Concept hierarchies are a popular form of background knowledge, which allow data to be mined at multiple levels of abstraction.

29 of 62

Interestingness measures and thresholds for pattern evaluation

The Interestingness measures are used to separate interesting and uninteresting patterns from the knowledge.
They may be used to guide the mining process, or after discovery, to evaluate the discovered patterns. Different kinds of knowledge may have different interestingness measures.

For example, interesting measures for association rules include support and confidence.

Representation for visualizing the discovered patterns

This refers to the form in which discovered patterns are to be displayed. Users can choose from different forms for knowledge presentation.

rules, tables, reports, charts, graphs, decision trees, and cubes.

30 of 62

Integration of Data mining system with a Data warehouse

The data mining system is integrated with a database or data warehouse system so that it can do its tasks in an effective mode. A data mining system operates in an environment that needs to communicate with other data systems like a Database or Dataware house system.

31 of 62

No Coupling

No coupling means that a Data Mining system will not utilize any function of a Data Base or Data Warehouse system.
It may fetch data from a particular source (such as a file system), process data using some data mining algorithms, and then store the mining results in another file.

Drawbacks of No Coupling

First, without using a Database/Data Warehouse system, a Data Mining system may spend a substantial amount of time finding, collecting, cleaning, and transforming data.
Second, there are many tested, scalable algorithms and data structures implemented in Database and Data Warehouse systems.

32 of 62

Loose Coupling

In this Loose coupling, the data mining system uses some facilities / services of a database or data warehouse system. The data is fetched from a data repository managed by these (DB/DW) systems.
Data mining approaches are used to process the data and then the processed data is saved either in a file or in a designated area in a database or data warehouse.
Loose coupling is better than no coupling because it can fetch any portion of data stored in Databases or Data Warehouses by using query processing, indexing, and other system facilities.

Drawbacks of Loose Coupling

It is difficult for loose coupling to achieve high scalability and good performance with large data sets.

33 of 62

Semi-Tight Coupling

Semi tight coupling means that besides linking a Data Mining system to a Data Base/Data Warehouse system, efficient implementations of a few essential data mining primitives can be provided in the DB/DW system.
These primitives can include sorting, indexing, aggregation, histogram analysis, multi way join, and pre-computation of some essential statistical measures, such as sum, count, max, min, standard deviation.

Advantage of Semi-Tight Coupling

This Coupling will enhance the performance of Data Mining systems

Tight Coupling

Tight coupling means that a Data Mining system is smoothly integrated into the Data Base/Data Warehousesystem. The data mining subsystem is treated as one functional component of information system. Data mining queries and functions are optimized based on mining query analysis, data structures, indexing schemes, and query processing methods of a DB or DW system.

34 of 62

Major Issues in Data Mining

35 of 62

Data Reduction

Data reduction is a process used in data processing and analysis to reduce the amount of data without significantly affecting its integrity or quality. The goal is to simplify or compress the dataset to make it easier to store, process, and analyze while retaining the essential information.
Mining on the reduced data set should be more efficient yet to produce same analytical results.

Ex: Image Processing

Techniques in data Reduction:�1. Data Compression

2. Dimensionality Reduction

3. Numeracity reduction

36 of 62

->The number of input features, variables, or columns present in a given dataset is known as dimensionality, and the process to reduce these features is called dimensionality reduction.�->Dimensionality reduction technique can be defined as, "It is a way of converting the higher dimensions dataset into lesser dimensions dataset ensuring that it provides similar information.“�Ex: speech recognition, signal processing, bioinformatics, etc. It can also be used for data visualization, noise reduction, cluster analysis, etc.

Dimensionality Reduction

37 of 62

Wavelet Transformation

The signal is represented by wavelets, which are small, oscillating functions that capture both time and frequency information.
Wavelets are decomposed a signal into a set of basis functions. These basic functions are called wavelets.
Wavelet Transformation works on both positive and negative areas.
Types of wavelet transforms
Continuous wavelet transforms
Discrete wavelet transforms

38 of 62

Continue

The discrete wavelet transform (DWT) is a signal processing technique that transforms linear signals.
The data vector X is transformed into a numerically different vector, Xo, of wavelet coefficients when the DWT is applied.
The two vectors X and Xo must be of the same length. When applying this technique to data reduction, we consider n-dimensional data tuple, that is, X = (x1,x2,…,xn), where n is the number of attributes present in the relation of the data set.

39 of 62

Pyramid Method

40 of 62

Pyramid Algorithm

The input data vector is of length L and L is an integer and is the power of 2. If length L is not the power of 2 then we can append the zeroes at the end of input data vector to make it as power of 2.
We apply two functions for each transform of the data vector . The first function is to perform the data smoothing, like finding the weighted average of the data vectors. The second function is to find the weighted difference and this retrieves the important features of the input vector .
We apply the two functions to the X axis pairs of the data points (x2i ,x2i+1). Two different data sets of length L/2 are obtained after applying the two functions. The first data set is the low-frequency version of the original data and the second one is the high frequency data set of it.
These two functions are applied to the data vectors recursively until the obtained resultant data vectors are of length 2.

The wavelet coefficients are assigned to the transformed data vectors finally.

41 of 62

Principle Component Analysis

Principal Component Analysis (PCA) is a data reduction technique used to simplify large datasets by reducing the number of variables (features) while preserving as much information as possible.
It does this by transforming the original variables into new, uncorrelated variables called principal components.

Steps of PCA:

Standardization: Ensure the data has a mean of 0 and standard deviation of 1.

Covariance Matrix: Calculate the covariance matrix to understand how variables are related.

Eigenvectors and Eigenvalues: Identify the principal components (eigenvectors) and the amount of variance they capture (eigenvalues).

Project Data: Transform the original data into the new principal components.

Applications:

Data compression: Reduce the number of features while keeping essential information.
Visualization: PCA can reduce complex datasets (e.g., 10 features) to 2 or 3 components for easy visualization.

42 of 62

Numeracity Reduction

It is the technique to replace the original data by alternative smaller forms of data representation.

Types:

Parametric
Non-Parameric

43 of 62

1. Parametric

This method assumes a model into which the data fits. Data model parameters are estimated, and only those parameters are stored, and the rest of the data is discarded.

Regression
Log-linear Regression

Regression: Regression can be a simple linear regression or multiple linear regression. When there is only a single independent attribute, such a regression model is called simple linear regression. If there are multiple independent attributes, then such regression models are called multiple linear regression.

Log-Linear Model: The Log-Linear model discovers the relationship between two or more discrete attributes

44 of 62

2. Non-Parametric

A non-parametric numerosity reduction technique does not assume any model.

Histogram
Clustering
Sampling
Data Cube Aggregation

45 of 62

Histogram

A histogram is the data representation in terms of frequency.
Histogram of attribute ‘A’ partitioned to the data into disjoint subsets referred as buckets of bins.

1. Single – ton Bucket 2. Equal- width bucket

46 of 62

1. Single ton Bucket

47 of 62

2. Equal Width Bucket

49 of 62

Data Discretization and Concept Hierarchy

method of converting a huge number of data values into smaller ones so that the evaluation and management of data become easy.
data discretization is a method of converting attributes values of continuous data into a finite set of intervals with minimum data loss.

50 of 62

Supervised discretization refers to a method in which the class data is used.

Unsupervised discretization refers to a method depending upon the way which operation proceeds.

51 of 62

Top-down Discretization -

If the process starts by first finding one or a few points called split points or cut points to split the entire attribute range and then repeat this recursively on the resulting intervals.

Bottom-up Discretization -

Starts by considering all of the continuous values as potential split-points.
Removes some by merging neighborhood values to form intervals, and then recursively applies this process to the resulting intervals.

52 of 62

Concept Hierarchies

Discretization can be performed rapidly on an attribute to provide a hierarchical partitioning of the attribute values, known as a Concept Hierarchy.
Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts with higher-level concepts.
In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of abstraction defined by concept hierarchies.
organization provides users with the flexibility to view data from different perspectives.
Data mining on a reduced data set means fewer input and output operations and is more efficient than mining on a larger data set.
Because of these benefits, discretization techniques and concept hierarchies are typically applied before data mining, rather than during mining.

53 of 62

Discretization and Concept Hierarchy Generation for Numerical Data

Binning
Histogram Analysis
Cluster Analysis
Discretization by Intuitive Partitioning

1] Binning

Binning is a top-down splitting technique based on a specified number of bins.
Binning is an unsupervised discretization technique because it does not use class information.
The sorted values are distributed into several buckets or bins and then replaced with each bin value by the bin mean or median.

54 of 62

2] Histogram Analysis

It is an unsupervised discretization technique because histogram analysis does not use class information.
Histograms partition the values for an attribute into disjoint ranges called buckets.

It is also further classified into

Equal-width histogram

Equal frequency histogram

The histogram analysis algorithm can be applied recursively to each partition to automatically generate a multilevel concept hierarchy, with the procedure terminating once a pre-specified number of concept levels has been reached.

3] Cluster Analysis

Cluster analysis is a popular data discretization method.
A clustering algorithm can be applied to discretize a numerical attribute of A by partitioning the values of A into clusters or groups.
Clustering considers the distribution of A, as well as the closeness of data points, and therefore can produce high-quality discretization results.
Each initial cluster or partition may be further decomposed into several subcultures, forming a lower level of the hierarchy.

55 of 62

4. Discretization by Intuitive Partitioning

Numerical ranges partitioned into relatively uniform, easy-to-read intervals that appear intuitive or “natural.”
The 3-4-5 rule can be used to segment numerical data into relatively uniform, naturalseeming intervals.
In general, the rule partitions a given range of data into 3, 4, or 5 relatively equal-width intervals, recursively and level by level, based on the value range at the most significant digit.

The rule is as follows:

If an interval covers 3, 6, 7, or 9 distinct values at the most significant digit, then partition the range into 3 intervals
If it covers 2, 4, or 8 distinct values at the most significant digit, then partition the range into 4 equal-width intervals.
If it covers 1, 5, or 10 distinct values at the most significant digit, then partition the range into 5 equal-width intervals.

56 of 62

Concept Hierarchy Generation for Nominal Data(Categorical Data )

Specification of a partial ordering of attributes explicitly at the schema level by users or experts
Specification of a portion of a hierarchy by explicit data grouping
Specification of a set of attributes, but not of their partial ordering
Specification of only a partial set of attributes

Categorical data are discrete data.

• Categorical attributes have a finite (but possibly large) number of distinct values, with no ordering among the values.

• Examples include geographic location, job category, and item type.

58 of 62

i) Specification of a partial ordering of attributes explicitly at the schema level by users or experts

Concept hierarchies for nominal attributes or dimensions typically involve a group of attributes.
A user or expert can easily define a concept hierarchy by specifying a partial or total ordering of the attributes at the schema level.
For example, suppose that a relational database contains the following group of attributes: street, city, province or state, and country
A hierarchy can be defined by specifying the total ordering among these attributes at the schema level such as street < city < province or state < country.
example of a partial order for the time dimension based on the attributes day, week, month, quarter, and year is “day <{month < quarter; week} < year

59 of 62

ii) Specification of a partial of a hierarchy by explicit data grouping

Concept hierarchies may also be defined by discretizing or grouping values for a given dimension or attribute, resulting in a set-grouping hierarchy. A total or partial order can be defined among groups of values.
Concept hierarchies may be provided manually by system users
An example of a set-grouping hierarchy is shown in Figure for the dimension price, where an interval ($X…$Y] denotes the range from $X (exclusive) to $Y (inclusive).

1 of 62