1 of 46

Origins of Data mining,

Data mining Tasks

&

Types of Data

Unit - II

DWDM

2 of 46

The Origins of Data Mining

Data mining draws upon ideas, such as

  • (1) sampling, estimation, and hypothesis testing from statistics and
  • (2) search algorithms, modeling techniques, and learning theories from artificial intelligence, pattern recognition, and machine learning.

3 of 46

The Origins of Data Mining

  • adopt ideas from other areas, including
    • optimization,
    • evolutionary computing,
    • information theory,
    • signal processing,
    • visualization, and
    • information retrieval

4 of 46

The Origins of Data Mining

  • An optimization algorithm is a procedure which is executed iteratively by comparing various solutions till an optimum or a satisfactory solution is found.
  • Evolutionary Computation is a field of optimization theory where instead of using classical numerical methods to solve optimization problems, we use inspiration from biological evolution to ‘evolve’ good solutions 
    • Evolution can be described as a process by

which individuals become ‘fitter’ in different

environments through adaptation,

natural selection, and selective breeding.

picture of the famous finches Charles Darwin depicted in his journal 

5 of 46

The Origins of Data Mining

6 of 46

The Origins of� Data Mining

  • Other Key areas:
    • database systems
      • to provide support for efficient storage, indexing, and query processing.
    • Techniques from high performance (parallel) computing
      • addressing the massive size of some data sets.
    • Distributed techniques
      • also help address the issue of size and are essential when the data cannot be gathered in one location.

7 of 46

Data Mining Tasks

  • Data mining tasks are generally divided into two major categories:
    • Predictive tasks. - Use some variables to predict unknown or future values of other variables
      • Task Objective: predict the value of a particular attribute based on the values of other attributes.
      • Target/Dependent Variable: attribute to be predicted
      • Explanatory or independent variables: attributes used for making the prediction
    • Descriptive tasks. - Find human-interpretable patterns that describe the data.
      • Task objective: derive patterns (correlations, trends, clusters, trajectories, and anomalies) that summarize the underlying relationships in data.
      • Descriptive data mining tasks are often exploratory in nature and frequently require post processing techniques to validate and explain the results.

8 of 46

Data Mining Tasks

  • Correlation is a statistical term describing the degree to which two variables move in coordination with one another. 
  • Trends: a general direction in which something is developing or changing.(meaning)
  • Clusters
    • Clustering is the task of

data points into a number of groups

such that data points in the same groups

are more similar to other data points

in the same group

than those in other groups

https://www.javatpoint.com/data-mining-cluster-analysis

Trajectory data mining enables to predict the moving location details of humans, vehicles, animals and so on.

Anomaly detection is a step in data mining that identifies data points, events, and/or observations that deviate from a dataset’s normal behavior.

9 of 46

Introduction to Data Mining, 2nd Edition Tan, Steinbach, Karpatne, Kumar

Predictive Modeling

Clustering

Association �Rules

Anomaly �Detection

Milk

Data

Data Mining Tasks …

10 of 46

Data Mining Tasks

11 of 46

Data Mining Tasks

  • Predictive modeling refers to the task of building a model for the target variable as a function of the explanatory variables.
  • 2 types of predictive modeling tasks:
    • Classification: Used for discrete target variables
    • Regression: used for continuous target variables.

12 of 46

Data Mining Tasks

  • Predictive modeling refers to the task of building a model for the target variable as a function of the explanatory variables.
  • 2 types of predictive modeling tasks:
    • Classification: Used for discrete target variables
    • Regression: used for continuous target variables.
    • Example:
      • Classification Task : predicting whether a Web user will make a purchase at an online bookstore is a classification task because the target variable is binary-valued.
      • Regression Task: forecasting the future price of a stock is a regression task because price is a continuous-valued attribute.
    • Goal of both tasks: learn a model that minimizes the error between the predicted and true values of the target variable.
    • Predictive modeling can be used to:
      • identify customers that will respond to a marketing campaign,
      • predict disturbances in the Earth’s ecosystem, or
      • judge whether a patient has a particular disease based on the results of medical tests.

13 of 46

Data Mining Tasks

  • Example: (Predicting the Type of a Flower): the task of predicting a species of flower based on the characteristics of the flower.
  • Iris species: Setosa, Versicolour, or Virginica.
  • Requirement: need a data set containing the characteristics of various flowers of these three species.
  • 4 other attributes(dataset): sepal width, sepal length, petal length, and petal width.
  • Petal width is broken into the categories low, medium, and high, which correspond to the intervals [0, 0.75), [0.75, 1.75), [1.75, ∞), respectively.
  • Also, petal length is broken into categories low, medium, and high, which correspond to the intervals [0, 2.5), [2.5, 5), [5, ∞), respectively.
  • Based on these categories of petal width and length, the following rules can be derived:
    • Petal width low and petal length low implies Setosa.
    • Petal width medium and petal length medium implies Versicolour.
    • Petal width high and petal length high implies Virginica.

14 of 46

Data Mining Tasks

  • Example: (Predicting the Type of a Flower):

15 of 46

Data Mining Tasks

Example: (Predicting the Type of a Flower)

16 of 46

Data Mining Tasks

  • Association analysis
    • used to discover patterns that describe strongly associated features in the data.
    • Discovered patterns are represented in the form of implication rules or feature subsets.
    • Goal of association analysis:
      • To extract the most interesting patterns in an efficient manner.
    • Example
      • finding groups of genes that have related functionality,
      • identifying Web pages that are accessed together, or
      • understanding the relationships between different elements of Earth’s climate system.

17 of 46

Data Mining Tasks

  • Association analysis
  • Example (Market Basket Analysis).
    • AIM: find items that are frequently bought together by customers.
    • Association rule {Diapers} −→ {Milk},
      • suggests that customers who buy diapers also tend to buy milk.
  • This rule can be used to identify potential cross-selling opportunities among related items.

The transactions data collected at the checkout counters of a grocery store.

18 of 46

Data Mining Tasks

  • Cluster analysis
    • Cluster analysis seeks to find groups of closely related observations so that observations that belong to the same cluster are more similar than observations that belong to other clusters.
    • Clustering has been used to
      • group sets of related customers,
      • find areas of the ocean that have a significant impact on the Earth’s climate, and
      • compress data.

19 of 46

Data Mining Tasks

  • Cluster analysis
    • Example 1.3 (Document Clustering)
    • Each article is represented as a set of word-frequency pairs (w, c),
      • where w is a word and
      • c is the number of times the word appears in the article.
    • There are two natural clusters in the data set.
    • First cluster -> first four articles (news about the economy)
    • Second cluster-> last four articles ( news about health care)
    • A good clustering algorithm should be able to identify these two clusters based on the similarity between words that appear in the articles.

20 of 46

Data Mining Tasks

  • Anomaly Detection:
    • Task of identifying observations whose characteristics are significantly different from the rest of the data.
    • Such observations are known as anomalies or outliers.
    • A good anomaly detector must have a high detection rate and a low false alarm rate.
    • Applications of anomaly detection include
      • the detection of fraud,
      • network intrusions,
      • unusual patterns of disease, and
      • ecosystem disturbances

https://commons.wikimedia.org/wiki/File:Anomalous_Web_Traffic.png

21 of 46

Data Mining Tasks

  • Anomaly Detection:
    • Example 1.4 (Credit Card Fraud Detection).
    • A credit card company records the transactions made by every credit card holder, along with personal information such as credit limit, age, annual income, and address.
    • Since the number of fraudulent cases is relatively small compared to the number of legitimate transactions, anomaly detection techniques can be applied to build a profile of legitimate transactions for the users.
    • When a new transaction arrives, it is compared against the profile of the user. If the characteristics of the transaction are very different from the previously created profile, then the transaction is flagged as potentially fraudulent.

22 of 46

Types of Data

  • Data set - collection of data objects.
  • Other names for a data object are:-
    • record,
    • point,
    • vector,
    • pattern,
    • event,
    • case,
    • sample,
    • observation, or
    • entity.

23 of 46

Types of Data

  • Data objects are described by a number of attributes that capture the basic characteristics of an object.
  • Example:-
    • mass of a physical object or
    • time at which an event occurred.
  • Other names for an attribute are:-
    • variable,
    • characteristic,
    • field,
    • feature, or
    • dimension.

24 of 46

Types of Data

  • Example:-
  • Dataset - Student Information.
  • Each row corresponds to a student.
  • Each column is an attribute that describes some aspect of a student.

25 of 46

Types of Data

  • Attributes and Measurement
    • An attribute is a property or characteristic of an object that may vary, either from one object to another or from one time to another.
    • Example,
      • eye color varies from person to person, while the temperature of an object varies over time.
    • Eye color is a symbolic attribute with a small number of possible values {brown, black, blue, green, hazel, etc.},
    • Temperature is a numerical attribute with a potentially unlimited number of values.

26 of 46

Types of Data

  • Attributes and Measurement
    • A measurement scale is a rule (function) that associates a numerical or symbolic value with an attribute of an object.
    • process of measurement
      • application of a measurement scale to associate a value with a particular attribute of a specific object.

27 of 46

Properties of Attribute Values

  • The type of an attribute depends on which of the following properties it possesses:
      • Distinctness: = ≠
      • Order: < >
      • Addition: + ‐
      • Multiplication: * /

  • Nominal attribute: distinctness
  • Ordinal attribute: distinctness & order
  • Interval attribute: distinctness, order & addition
  • Ratio attribute: all 4 properties

28 of 46

Types of Data

  • Properties of Attribute Values
    • Nominal - attributes to differentiate between one object and another.
            • Roll, EmpID
    • Ordinal - attributes to order the objects.
            • Rankings, Grades, Height
    • Interval - measured on a scale of equal size units
            • no Zero point
            • Temperatures in C & F, Calendar Dates
    • Ratio - numeric attribute with an inherent zero-point.
            • value as being a multiple (or ratio) of another value.
            • Weight, No. of Staff, Income/Salary

29 of 46

Types of Data

Properties of Attribute Values

30 of 46

Types of Data

Properties of Attribute Values - Transformations

    • yielding the same results when the attribute is transformed using a transformation that preserves the attribute’s meaning.
    • Example:-
      • the average length of a set of objects is different when measured in meters rather than in feet, but both averages represent the same length.

31 of 46

Types of Data

Properties of Attribute Values - Transformations

32 of 46

Types of Data

Attribute Types

Data

Qualitative / Categorical

( no properties of integer)

Quantitative / Numeric

(properties of Integers)

Nominal

Ordinal

Interval

Ratio

33 of 46

Types of Data

  • Describing Attributes by the Number of Values
    1. Discrete
      • finite or countably infinite set of values.
      • Categorical - zip codes or ID numbers, or
      • Numeric - counts.
      • Binary attributes (special case of discrete)
        • assume only two values,
        • e.g., true/false, yes/no, male/female, or 0/1.
    2. Continuous
      • values are real numbers.
      • Ex:- temperature, height, or weight.

Any of the measurement scale types—nominal, ordinal, interval, and ratio—could be combined with any of the types based on the number of attribute values—binary, discrete, and continuous.

34 of 46

Types of Data - Types of Dataset

General Characteristics of Data Sets

  • 3 characteristics that apply to many data sets are:-
    • dimensionality,
    • sparsity, and
    • resolution.
  • Dimensionality - number of attributes that the objects in the data set possess.
    • small number of dimensions more quality than moderate or high-dimensional data.
    • curse of dimensionality & dimensionality reduction.
  • Sparsity - data sets, with asymmetric features, most attributes of an object have values of 0;
    • fewer than 1% of the entries are non-zero.
  • Resolution - Data will be gathered at different levels of resolution
    • Example:- the surface of the Earth seems very uneven at a resolution of a few meters, but is relatively smooth at a resolution of tens of kilometers.

35 of 46

Types of Data - Types of Dataset

  • Record Data
    • data set is a collection of records (data objects), each of which consists of a fixed set of data fields (attributes).
    • No relationships b/w records
    • Same attributes for all records
    • Flat files or relational DB.

36 of 46

Types of Data - Types of Dataset

  • Transaction or Market Basket Data
    • special type of record data
    • Each record (transaction) involves a set of items.
    • Also called market basket data because the items in each record are the products in a person’s “market basket.”
    • Can be viewed as a set of records whose fields are asymmetric attributes.

37 of 46

Types of Data - Types of Dataset

  • Data Matrix / Pattern Matrix
    • fixed set of numeric attributes,
    • Data objects = points (vectors) in a multidimensional space
    • each dimension = a distinct attribute describing the object.
    • A set of such data objects can be interpreted as
      • an m by n matrix,
        • where there are
        • m rows, one for each object,
        • and n columns, one for each attribute.
    • Standard matrix operation can be applied to transform and manipulate the data.

38 of 46

Types of Data - Types of Dataset

  • Sparse Data Matrix:
    • Special case of a data matrix

    • attributes are of the
      • same type and
      • asymmetric; i.e., only non-zero values are important.
    • Example:-
      • Transaction data which has only 0–1 entries.
      • Document Term Matrix - collection of term vector
        • One Term vector represents - one document ( one row in matrix)
        • Attribute of vector - each term in the document ( one col in matrix)
        • value in term vector under an attribute is number of times the corresponding term occurs in the document.

39 of 46

Types of Data - Types of Dataset

  • Graph based Data:
    • Data can be represented in the form of Graph.
    • Graphs are used for 2 specific reasons
      • (1) the graph captures relationships among data objects and
      • (2) the data objects themselves are represented as graphs.
    • Data with Relationships among Objects
      • Relationships among objects also convey important information.
      • Relationships among objects are captured by the links between objects and link properties, such as direction and weight.
      • Example:
        • Web page in www contain both text and links to other pages.
        • Web search engines collect and process Web pages to extract their contents.
        • Links to and from each page provide a great deal of information about the relevance of a Web page to a query, and thus, must also be taken into consideration.

40 of 46

Types of Data - Types of Dataset

  • Graph based Data:
    • Data with Relationships among Objects
      • Example:
        • Web page in www contain both text and links to other pages.

41 of 46

Types of Data - Types of Dataset

  • Graph based Data:
    • Data with Objects That Are Graphs
      • When objects contain sub-objects that have relationships, then such objects are frequently represented as graphs.
      • Example:-Structure of chemical compounds
          • Atoms are - nodes
          • Chemical Bonds - links between nodes
            • ball-and-stick diagram of the chemical compound benzene, which contains atoms of carbon (black) and hydrogen (gray).

Substructure mining

42 of 46

Types of Data - Types of Dataset

  • Ordered Data:
    • In some data, the attributes have relationships that involve order in time or space.
    • Sequential Data
      • Sequential data / temporal data
      • extension of record data - each record has a time associated with it.
      • Ex:- Retail transaction data set - stores the time of transaction
        • time information used to find patterns
          • candy sales peak before Halloween.”
      • Each attribute - also - time associated
        • Record - purchase history of a customer
          • with a listing of items purchased at different times.
        • find patterns
          • people who buy DVD players tend to buy DVDs in the period immediately following the purchase.”

43 of 46

Types of Data - Types of Dataset

  • Ordered Data: Sequential

44 of 46

Types of Data - Types of Dataset

  • Ordered Data: Sequence Data
    • consists of a data set that is a sequence of individual entities,
    • Example
      • sequence of words or letters.
    • Example:
      • Genetic information of plants and animals can be represented in the form of sequences of nucleotides that are known as genes.
      • Predicting similarities in the structure and function of genes from similarities in nucleotide sequences.
    • Ex:- Human genetic code expressed using the four nucleotides from which all DNA is constructed: A, T, G, and C.

45 of 46

Types of Data - Types of Dataset

  • Ordered Data: Time Series Data
    • Special type of sequential data in which each record is a time series,
    • A series of measurements taken over time.
    • Example:
      • Financial data set might contain objects that are time series of the daily prices of various stocks.
    • Temporal autocorrelation; i.e., if two measurements are close in time, then the values of those measurements are often very similar.

Time series of the average monthly temperature for Minneapolis during the years 1982 to 1994.

46 of 46

Types of Data - Types of Dataset

  • Ordered Data: Spatial Data
  • Some objects have spatial attributes, such as positions or areas, as well as other types of attributes.
  • An example of spatial data is
    • weather data (precipitation, temperature, pressure) that is collected for a variety of geographical locations.
  • spatial autocorrelation; i.e., objects that are physically close tend to be similar in other ways as well.
  • Example
    • two points on the Earth that are close to each other usually have similar values for temperature and rainfall.

Average Monthly Temperature of land and ocean