Data Mining
Unit-1
Syllabus
Data Mining:
Data–Types of Data–, Data Mining Functionalities– Interestingness Patterns–Classification of Data Mining systems– Data mining Task primitives –Integration of Data mining system with a Data warehouse–Major issues in Data Mining–Data Preprocessing.
What is Data Mining?
Characteristics of Data Mining:
Non-Trivial: should be relevant that the data to be required.
Novel: Unique all times- should give same results at all times., even apply different algorithm.
Useful: information which is retrieved should be useful for decision making.
Data mining is used in business to make better managerial decisions by:
KDD Process in Data Mining- Knowledge Discovery in Databases
1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection.
2. Data Integration: Data integration is defined as heterogeneous data from multiple sources combined in a common source(DataWarehouse).
3. Data Selection: Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the data collection. Data selection using Neural network.
4. Data Transformation: Data Transformation is defined as the process of transforming data into appropriate form required by mining procedure.
5. Data Mining: Data mining is defined as clever techniques that are applied to extract patterns potentially useful.
6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based on given measures.
7. Knowledge representation: Knowledge representation is defined as technique which utilizes visualization tools to represent data mining results.
Note:
KDD is an iterative process where evaluation measures can be enhanced, mining can be refined, new data can be integrated and transformed in order to get different and more appropriate results.
Types of Data
Data: Data Is Information That Has Been Translated Into A Form That Is Efficient For Movement Or Processing.
The Most Basic Forms Of Data For Mining Applications Are
1. Database Data
Ex: Faculty database, student database, publications database
Ex: student_bio, student_parking
Example
Through the use of relational queries, you can ask things like, “Show me a list of all items that were sold in the last quarter.” Relational languages also use aggregate functions such as sum, avg (average), count, max (maximum), and min (minimum). Using aggregates allows you to ask:
When mining relational databases, we can go further by searching for trends or data patterns.
For example, data mining systems can analyze customer data to predict the credit risk of new customers based on their income, age, and previous credit information.
Data mining systems may also detect deviations—that is, items with sales that are far from those expected in comparison with the previous year.
2. Data Warehouse Data
e.g. customer, item, supplier and activity.
e.g. from the past 5 – 10 years
e.g. a summary of the transactions per item type for each store
3. Transactional Data
4. Other Kinds Data
A typical DM System Architecture
Data Mining Tasks:
Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters
Data Sources: Database, World Wide Web(WWW), and data warehouse are parts of data sources. The data in these sources may be in the form of plain text, spreadsheets, or other forms of media like photos or videos. WWW is one of the biggest sources of data.
Database Server: The database server contains the actual data ready to be processed. It performs the task of handling data retrieval as per the request of the user.
Data Mining Engine: It is one of the core components of the data mining architecture that performs all kinds of data mining techniques like association, classification, characterization, clustering, prediction, etc.
Pattern Evaluation Modules: They are responsible for finding interesting patterns in the data and sometimes they also interact with the database servers for producing the result of the user requests.
Graphic User Interface: Since the user cannot fully understand the complexity of the data mining process so graphical user interface helps the user to communicate effectively with the data mining system.
Knowledge Base: Knowledge Base is an important part of the data mining engine that is quite beneficial in guiding the search for the result patterns. Data mining engines may also sometimes get inputs from the knowledge base. This knowledge base may contain data from user experiences. The objective of the knowledge base is to make the result more accurate and reliable.
Data Mining Functionalities
Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks.
1. Descriptive mining
2. Predictive mining
2. Predictive Data Mining is the Analysis done to predict a future event or other data or trends.
Ex:
1. Descriptive Data Mining is a data mining technique that identifies what happened in the past by analyzing the stored past data.
Ex: research, business, economics, social sciences, and healthcare.
Data Mining Functionalities
Interestingness Patterns
A data mining system has the potential to generate thousands or even millions of patterns, or rules. then “are all of the patterns interesting?” Typically, not—only a small fraction of the patterns potentially generated would be of interest to any given user.
What makes a pattern interesting? understood by humans, valid on new or test data with some degree of certainty, Potentially useful and novel.
Can a data mining system generate all the interesting patterns? refers to the completeness of a data mining algorithm.
Can a data mining system generate only interesting patterns? It is highly desirable for data mining systems to generate only interesting patterns.An interesting pattern represents knowledge.
Classification of Data Mining Systems
Classification of the data mining system helps users to understand the system and match their requirements with such systems.
Data mining systems can be categorized according to various criteria, as follows:
i) Classification according to the kinds of databases mined:
ii) Classification according to the kinds of knowledge mined:
iii) Classification according to the kinds of techniques utilized:
IV) Classification according to the applications adapted:
Data Mining Task primitives
Set of task relevant data to be mined
This specifies the portions of the database or the set of data in which the user is interested.
This portion includes the following
For example, suppose that you are a manager of All Electronics in charge of sales in the United States and Canada. You would like to study the buying trends of customers in Canada. Rather than mining on the entire database. These are referred to as relevant attributes.
Kind of knowledge to be mined
This specifies the data mining functions to be performed, such as Characterization& Discrimination
For instance, if studying the buying habits of customers in Canada, you may choose to mine associations between customer profiles and the items that these customers like to buy.
Background knowledge to be used in discovery process: Users can specify background knowledge, or knowledge about the domain to be mined. This knowledge is useful for guiding the knowledge discovery process, and for evaluating the patterns found. User beliefs about relationship in the data.
Interestingness measures and thresholds for pattern evaluation
For example, interesting measures for association rules include support and confidence.
Representation for visualizing the discovered patterns
rules, tables, reports, charts, graphs, decision trees, and cubes.
Integration of Data mining system with a Data warehouse
The data mining system is integrated with a database or data warehouse system so that it can do its tasks in an effective mode. A data mining system operates in an environment that needs to communicate with other data systems like a Database or Dataware house system.
No Coupling
Drawbacks of No Coupling
Loose Coupling
Drawbacks of Loose Coupling
Semi-Tight Coupling
Advantage of Semi-Tight Coupling
This Coupling will enhance the performance of Data Mining systems
Tight Coupling
Tight coupling means that a Data Mining system is smoothly integrated into the Data Base/Data Warehousesystem. The data mining subsystem is treated as one functional component of information system. Data mining queries and functions are optimized based on mining query analysis, data structures, indexing schemes, and query processing methods of a DB or DW system.
Major Issues in Data Mining
Data Reduction
Ex: Image Processing
Techniques in data Reduction:�1. Data Compression
2. Dimensionality Reduction
3. Numeracity reduction
->The number of input features, variables, or columns present in a given dataset is known as dimensionality, and the process to reduce these features is called dimensionality reduction.�->Dimensionality reduction technique can be defined as, "It is a way of converting the higher dimensions dataset into lesser dimensions dataset ensuring that it provides similar information.“�Ex: speech recognition, signal processing, bioinformatics, etc. It can also be used for data visualization, noise reduction, cluster analysis, etc.
Dimensionality Reduction
Wavelet Transformation
Continue
Pyramid Method
Pyramid Algorithm
The wavelet coefficients are assigned to the transformed data vectors finally.
Principle Component Analysis
Steps of PCA:
Standardization: Ensure the data has a mean of 0 and standard deviation of 1.
Covariance Matrix: Calculate the covariance matrix to understand how variables are related.
Eigenvectors and Eigenvalues: Identify the principal components (eigenvectors) and the amount of variance they capture (eigenvalues).
Project Data: Transform the original data into the new principal components.
Applications:
Numeracity Reduction
Types:
1. Parametric
This method assumes a model into which the data fits. Data model parameters are estimated, and only those parameters are stored, and the rest of the data is discarded.
Regression: Regression can be a simple linear regression or multiple linear regression. When there is only a single independent attribute, such a regression model is called simple linear regression. If there are multiple independent attributes, then such regression models are called multiple linear regression.
Log-Linear Model: The Log-Linear model discovers the relationship between two or more discrete attributes
2. Non-Parametric
A non-parametric numerosity reduction technique does not assume any model.
Histogram
1. Single – ton Bucket 2. Equal- width bucket
1. Single ton Bucket
2. Equal Width Bucket
Exercise
Data Discretization and Concept Hierarchy
Supervised discretization refers to a method in which the class data is used.
Unsupervised discretization refers to a method depending upon the way which operation proceeds.
Top-down Discretization -
If the process starts by first finding one or a few points called split points or cut points to split the entire attribute range and then repeat this recursively on the resulting intervals.
Bottom-up Discretization -
Concept Hierarchies
Discretization and Concept Hierarchy Generation for Numerical Data
1] Binning
2] Histogram Analysis
It is also further classified into
Equal-width histogram
Equal frequency histogram
The histogram analysis algorithm can be applied recursively to each partition to automatically generate a multilevel concept hierarchy, with the procedure terminating once a pre-specified number of concept levels has been reached.
3] Cluster Analysis
4. Discretization by Intuitive Partitioning
The rule is as follows:
Concept Hierarchy Generation for Nominal Data(Categorical Data )
Categorical data are discrete data.
• Categorical attributes have a finite (but possibly large) number of distinct values, with no ordering among the values.
• Examples include geographic location, job category, and item type.
i) Specification of a partial ordering of attributes explicitly at the schema level by users or experts
ii) Specification of a partial of a hierarchy by explicit data grouping
iii) Specification of a set of attributes, but not of their partial ordering
vi) Specification of only a partial set of attributes
Unit-1 Completed
THANK YOU