1 of 46

Origins of Data mining,

Data mining Tasks

Types of Data

Unit - II

DWDM

2 of 46

The Origins of Data Mining

Data mining draws upon ideas, such as

(1) sampling, estimation, and hypothesis testing from statistics and
(2) search algorithms, modeling techniques, and learning theories from artificial intelligence, pattern recognition, and machine learning.

3 of 46

The Origins of Data Mining

adopt ideas from other areas, including

optimization,
evolutionary computing,
information theory,
signal processing,
visualization, and
information retrieval

4 of 46

The Origins of Data Mining

An optimization algorithm is a procedure which is executed iteratively by comparing various solutions till an optimum or a satisfactory solution is found.
Evolutionary Computation is a field of optimization theory where instead of using classical numerical methods to solve optimization problems, we use inspiration from biological evolution to ‘evolve’ good solutions

Evolution can be described as a process by

which individuals become ‘fitter’ in different

environments through adaptation,

natural selection, and selective breeding.

picture of the famous finches Charles Darwin depicted in his journal

5 of 46

The Origins of Data Mining

Information theory is the scientific study of the quantification, storage, and communication of digital information.
The field was fundamentally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s.
The field is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering, and electrical engineering.

6 of 46

The Origins of� Data Mining

Other Key areas:

database systems

to provide support for efficient storage, indexing, and query processing.

Techniques from high performance (parallel) computing

addressing the massive size of some data sets.

Distributed techniques

also help address the issue of size and are essential when the data cannot be gathered in one location.

7 of 46

Data Mining Tasks

Data mining tasks are generally divided into two major categories:

Predictive tasks. - Use some variables to predict unknown or future values of other variables

Task Objective: predict the value of a particular attribute based on the values of other attributes.
Target/Dependent Variable: attribute to be predicted
Explanatory or independent variables: attributes used for making the prediction

Descriptive tasks. - Find human-interpretable patterns that describe the data.

Task objective: derive patterns (correlations, trends, clusters, trajectories, and anomalies) that summarize the underlying relationships in data.
Descriptive data mining tasks are often exploratory in nature and frequently require post processing techniques to validate and explain the results.

8 of 46

Data Mining Tasks

Correlation is a statistical term describing the degree to which two variables move in coordination with one another.
Trends: a general direction in which something is developing or changing.(meaning)
Clusters

Clustering is the task of

data points into a number of groups

such that data points in the same groups

are more similar to other data points

in the same group

than those in other groups

https://www.javatpoint.com/data-mining-cluster-analysis

Trajectory data mining enables to predict the moving location details of humans, vehicles, animals and so on.

Anomaly detection is a step in data mining that identifies data points, events, and/or observations that deviate from a dataset’s normal behavior.

9 of 46

Introduction to Data Mining, 2nd Edition Tan, Steinbach, Karpatne, Kumar

Predictive Modeling

Clustering

Association �Rules

Anomaly �Detection

Milk

Data

Data Mining Tasks …

10 of 46

Data Mining Tasks

11 of 46

Data Mining Tasks

Predictive modeling refers to the task of building a model for the target variable as a function of the explanatory variables.
2 types of predictive modeling tasks:

Classification: Used for discrete target variables
Regression: used for continuous target variables.

12 of 46

Data Mining Tasks

Predictive modeling refers to the task of building a model for the target variable as a function of the explanatory variables.
2 types of predictive modeling tasks:

Classification: Used for discrete target variables
Regression: used for continuous target variables.
Example:

Classification Task : predicting whether a Web user will make a purchase at an online bookstore is a classification task because the target variable is binary-valued.
Regression Task: forecasting the future price of a stock is a regression task because price is a continuous-valued attribute.

Goal of both tasks: learn a model that minimizes the error between the predicted and true values of the target variable.
Predictive modeling can be used to:

identify customers that will respond to a marketing campaign,
predict disturbances in the Earth’s ecosystem, or
judge whether a patient has a particular disease based on the results of medical tests.

13 of 46

Data Mining Tasks

Example: (Predicting the Type of a Flower): the task of predicting a species of flower based on the characteristics of the flower.
Iris species: Setosa, Versicolour, or Virginica.
Requirement: need a data set containing the characteristics of various flowers of these three species.
4 other attributes(dataset): sepal width, sepal length, petal length, and petal width.
Petal width is broken into the categories low, medium, and high, which correspond to the intervals [0, 0.75), [0.75, 1.75), [1.75, ∞), respectively.
Also, petal length is broken into categories low, medium, and high, which correspond to the intervals [0, 2.5), [2.5, 5), [5, ∞), respectively.
Based on these categories of petal width and length, the following rules can be derived:

Petal width low and petal length low implies Setosa.
Petal width medium and petal length medium implies Versicolour.
Petal width high and petal length high implies Virginica.

14 of 46

Data Mining Tasks

Example: (Predicting the Type of a Flower):

15 of 46

Data Mining Tasks

Example: (Predicting the Type of a Flower)

16 of 46

Data Mining Tasks

Association analysis

used to discover patterns that describe strongly associated features in the data.
Discovered patterns are represented in the form of implication rules or feature subsets.
Goal of association analysis:

To extract the most interesting patterns in an efficient manner.

Example

finding groups of genes that have related functionality,
identifying Web pages that are accessed together, or
understanding the relationships between different elements of Earth’s climate system.

17 of 46

Data Mining Tasks

Association analysis
Example (Market Basket Analysis).

AIM: find items that are frequently bought together by customers.
Association rule {Diapers} −→ {Milk},

suggests that customers who buy diapers also tend to buy milk.

This rule can be used to identify potential cross-selling opportunities among related items.

The transactions data collected at the checkout counters of a grocery store.

18 of 46

Data Mining Tasks

Cluster analysis

Cluster analysis seeks to find groups of closely related observations so that observations that belong to the same cluster are more similar than observations that belong to other clusters.
Clustering has been used to

group sets of related customers,
find areas of the ocean that have a significant impact on the Earth’s climate, and
compress data.

19 of 46

Data Mining Tasks

Cluster analysis

Example 1.3 (Document Clustering)
Each article is represented as a set of word-frequency pairs (w, c),

where w is a word and
c is the number of times the word appears in the article.

There are two natural clusters in the data set.
First cluster -> first four articles (news about the economy)
Second cluster-> last four articles ( news about health care)
A good clustering algorithm should be able to identify these two clusters based on the similarity between words that appear in the articles.

20 of 46

Data Mining Tasks

Anomaly Detection:

Task of identifying observations whose characteristics are significantly different from the rest of the data.
Such observations are known as anomalies or outliers.
A good anomaly detector must have a high detection rate and a low false alarm rate.
Applications of anomaly detection include

the detection of fraud,
network intrusions,
unusual patterns of disease, and
ecosystem disturbances

https://commons.wikimedia.org/wiki/File:Anomalous_Web_Traffic.png

21 of 46

Data Mining Tasks

Anomaly Detection:

Example 1.4 (Credit Card Fraud Detection).
A credit card company records the transactions made by every credit card holder, along with personal information such as credit limit, age, annual income, and address.
Since the number of fraudulent cases is relatively small compared to the number of legitimate transactions, anomaly detection techniques can be applied to build a profile of legitimate transactions for the users.
When a new transaction arrives, it is compared against the profile of the user. If the characteristics of the transaction are very different from the previously created profile, then the transaction is flagged as potentially fraudulent.

22 of 46

Types of Data

Data set - collection of data objects.
Other names for a data object are:-

record,
point,
vector,
pattern,
event,
case,
sample,
observation, or
entity.

23 of 46

Types of Data

Data objects are described by a number of attributes that capture the basic characteristics of an object.
Example:-

mass of a physical object or
time at which an event occurred.

Other names for an attribute are:-

variable,
characteristic,
field,
feature, or
dimension.

24 of 46

Types of Data

Example:-
Dataset - Student Information.
Each row corresponds to a student.
Each column is an attribute that describes some aspect of a student.

25 of 46

Types of Data

Attributes and Measurement

An attribute is a property or characteristic of an object that may vary, either from one object to another or from one time to another.
Example,

eye color varies from person to person, while the temperature of an object varies over time.

Eye color is a symbolic attribute with a small number of possible values {brown, black, blue, green, hazel, etc.},
Temperature is a numerical attribute with a potentially unlimited number of values.

26 of 46

Types of Data

Attributes and Measurement

A measurement scale is a rule (function) that associates a numerical or symbolic value with an attribute of an object.
process of measurement

application of a measurement scale to associate a value with a particular attribute of a specific object.

27 of 46

Properties of Attribute Values

The type of an attribute depends on which of the following properties it possesses:

Distinctness: = ≠
Order: < >
Addition: + ‐
Multiplication: * /

Nominal attribute: distinctness
Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties

28 of 46

Types of Data

Properties of Attribute Values

Nominal - attributes to differentiate between one object and another.

Roll, EmpID

Ordinal - attributes to order the objects.

Rankings, Grades, Height

Interval - measured on a scale of equal size units

no Zero point
Temperatures in C & F, Calendar Dates

Ratio - numeric attribute with an inherent zero-point.

value as being a multiple (or ratio) of another value.
Weight, No. of Staff, Income/Salary

29 of 46

Types of Data

Properties of Attribute Values

30 of 46

Types of Data

Properties of Attribute Values - Transformations

yielding the same results when the attribute is transformed using a transformation that preserves the attribute’s meaning.
Example:-

the average length of a set of objects is different when measured in meters rather than in feet, but both averages represent the same length.

31 of 46

Types of Data

Properties of Attribute Values - Transformations

32 of 46

Types of Data

Attribute Types

Data

Qualitative / Categorical

( no properties of integer)

Quantitative / Numeric

(properties of Integers)

Nominal

Ordinal

Interval

Ratio

33 of 46

Types of Data

Describing Attributes by the Number of Values

Discrete

finite or countably infinite set of values.
Categorical - zip codes or ID numbers, or
Numeric - counts.
Binary attributes (special case of discrete)

assume only two values,
e.g., true/false, yes/no, male/female, or 0/1.

Continuous

values are real numbers.
Ex:- temperature, height, or weight.

Any of the measurement scale types—nominal, ordinal, interval, and ratio—could be combined with any of the types based on the number of attribute values—binary, discrete, and continuous.

34 of 46

Types of Data - Types of Dataset

General Characteristics of Data Sets

3 characteristics that apply to many data sets are:-

dimensionality,
sparsity, and
resolution.

Dimensionality - number of attributes that the objects in the data set possess.

small number of dimensions more quality than moderate or high-dimensional data.
curse of dimensionality & dimensionality reduction.

Sparsity - data sets, with asymmetric features, most attributes of an object have values of 0;

fewer than 1% of the entries are non-zero.

Resolution - Data will be gathered at different levels of resolution

Example:- the surface of the Earth seems very uneven at a resolution of a few meters, but is relatively smooth at a resolution of tens of kilometers.

35 of 46

Types of Data - Types of Dataset

Record Data

data set is a collection of records (data objects), each of which consists of a fixed set of data fields (attributes).
No relationships b/w records
Same attributes for all records
Flat files or relational DB.

36 of 46

Types of Data - Types of Dataset

Transaction or Market Basket Data

special type of record data
Each record (transaction) involves a set of items.
Also called market basket data because the items in each record are the products in a person’s “market basket.”
Can be viewed as a set of records whose fields are asymmetric attributes.

37 of 46

Types of Data - Types of Dataset

Data Matrix / Pattern Matrix

fixed set of numeric attributes,
Data objects = points (vectors) in a multidimensional space
each dimension = a distinct attribute describing the object.
A set of such data objects can be interpreted as

an m by n matrix,

where there are
m rows, one for each object,
and n columns, one for each attribute.

Standard matrix operation can be applied to transform and manipulate the data.

38 of 46

Types of Data - Types of Dataset

Sparse Data Matrix:

Special case of a data matrix

attributes are of the

same type and
asymmetric; i.e., only non-zero values are important.

Example:-

Transaction data which has only 0–1 entries.
Document Term Matrix - collection of term vector

One Term vector represents - one document ( one row in matrix)
Attribute of vector - each term in the document ( one col in matrix)
value in term vector under an attribute is number of times the corresponding term occurs in the document.

39 of 46

Types of Data - Types of Dataset

Graph based Data:

Data can be represented in the form of Graph.
Graphs are used for 2 specific reasons

(1) the graph captures relationships among data objects and
(2) the data objects themselves are represented as graphs.

Data with Relationships among Objects

Relationships among objects also convey important information.
Relationships among objects are captured by the links between objects and link properties, such as direction and weight.
Example:

Web page in www contain both text and links to other pages.
Web search engines collect and process Web pages to extract their contents.
Links to and from each page provide a great deal of information about the relevance of a Web page to a query, and thus, must also be taken into consideration.

40 of 46

Types of Data - Types of Dataset

Graph based Data:

Data with Relationships among Objects

Example:

Web page in www contain both text and links to other pages.

41 of 46

Types of Data - Types of Dataset

Graph based Data:

Data with Objects That Are Graphs

When objects contain sub-objects that have relationships, then such objects are frequently represented as graphs.
Example:-Structure of chemical compounds

Atoms are - nodes
Chemical Bonds - links between nodes

ball-and-stick diagram of the chemical compound benzene, which contains atoms of carbon (black) and hydrogen (gray).

Substructure mining

42 of 46

Types of Data - Types of Dataset

Ordered Data:

In some data, the attributes have relationships that involve order in time or space.
Sequential Data

Sequential data / temporal data
extension of record data - each record has a time associated with it.
Ex:- Retail transaction data set - stores the time of transaction

time information used to find patterns

“candy sales peak before Halloween.”

Each attribute - also - time associated

Record - purchase history of a customer

with a listing of items purchased at different times.

find patterns

“people who buy DVD players tend to buy DVDs in the period immediately following the purchase.”

43 of 46

Types of Data - Types of Dataset

Ordered Data: Sequential

44 of 46

Types of Data - Types of Dataset

Ordered Data: Sequence Data

consists of a data set that is a sequence of individual entities,
Example

sequence of words or letters.

Example:

Genetic information of plants and animals can be represented in the form of sequences of nucleotides that are known as genes.
Predicting similarities in the structure and function of genes from similarities in nucleotide sequences.

Ex:- Human genetic code expressed using the four nucleotides from which all DNA is constructed: A, T, G, and C.

45 of 46

Types of Data - Types of Dataset

Ordered Data: Time Series Data

Special type of sequential data in which each record is a time series,
A series of measurements taken over time.
Example:

Financial data set might contain objects that are time series of the daily prices of various stocks.

Temporal autocorrelation; i.e., if two measurements are close in time, then the values of those measurements are often very similar.

Time series of the average monthly temperature for Minneapolis during the years 1982 to 1994.

46 of 46

Types of Data - Types of Dataset

Ordered Data: Spatial Data
Some objects have spatial attributes, such as positions or areas, as well as other types of attributes.
An example of spatial data is

weather data (precipitation, temperature, pressure) that is collected for a variety of geographical locations.

spatial autocorrelation; i.e., objects that are physically close tend to be similar in other ways as well.
Example

two points on the Earth that are close to each other usually have similar values for temperature and rainfall.

Average Monthly Temperature of land and ocean