1 of 58

Predictive Data Mining

2 of 58

Introduction

  • An observation, or record, as the set of recorded values of variables associated with a single entity
  • Supervised learning: Data-mining methods for predicting an outcome based on a set of input variables, or features
  • Supervised learning can be used for:
    • Estimation of a continuous outcome
    • Classification of a categorical outcomes

2

3 of 58

Introduction

The data-mining process comprises the following steps:

    • Data sampling
    • Data preparation
    • Data partitioning
    • Data exploration
    • Model construction
    • Model assessment

4 of 58

Data Sampling

5 of 58

Data Sampling

  • When dealing with large volumes of data, it is best practice to extract a representative sample for analysis
  • A sample is representative, if the analyst can make the same conclusions from it as from the entire population of data
  • The sample of data must be large enough to contain significant information, yet small enough to be manipulated quickly
  • Data-mining algorithms typically are more effective given more data

5

6 of 58

Data Sampling

  • When obtaining a representative sample, it is generally best to include as many variables as possible in the sample
  • After exploring the data with descriptive statistics and visualization, the analyst can eliminate variables that are not of interest

6

7 of 58

Data Partitioning

8 of 58

Data Partitioning

  • Data-mining applications deal with an abundance of data that simplifies the process of assessing the accuracy of data-based estimates of variable effects
  • Model overfitting occurs when the analyst builds a model that does a great job of explaining the sample of data on which it is based, but fails to accurately predict outside the sample data
  • We can use the abundance of data to guard against the potential for overfitting by decomposing the data set into three partitions:
      • The training set
      • The validation set
      • The test set

8

9 of 58

Data Partitioning

  • Training set: Consists of the data used to build the candidate models
  • Validation set: The data set to which promising subset of models is applied to the to identify which model is the most accurate at predicting when applied to data that were not used to build the model
  • Test set: The data set to which the final model should be applied to estimate this model’s effectiveness when applied to data that have not been used to build or select the model

9

10 of 58

Accuracy Measures

Evaluating the Classification of Categorical Outcomes

Evaluating the Estimation of Continuous Outcomes

11 of 58

Accuracy Measures

Evaluating the Classification of Categorical Outcomes

  • By counting the classification errors on a sufficiently large validation set and/or test set that is representative of the population, we will generate an accurate measure of the model’s classification performance
  • Classification confusion matrix: Displays a model’s correct and incorrect classifications

11

12 of 58

Accuracy Measures

  •  

12

13 of 58

Accuracy Measures

  • One minus the overall error rate is often referred to as the accuracy of the model
  • While overall error rate conveys an aggregate measure of misclassification, it counts as misclassifying an actual Class 0 observation as a Class 1 observation (a false positive) the same as misclassifying an actual Class 1 observation as a Class 0 observation (a false negative)

14 of 58

Accuracy Measures

  •  

14

15 of 58

Accuracy Measures

  • Cumulative lift chart: Compares the number of actual Class 1 observations identified if considered in decreasing order of their estimated probability of being in Class 1 and compares this to the number of actual Class 1 observations identified if randomly selected
  • Decile-wise lift chart: Another way to view how much better a classifier is at identifying Class 1 observations than random classification
  • Observations are ordered in decreasing probability of Class 1 membership and then considered in 10 equal-sized groups

15

16 of 58

Table 9.2: Classification Probabilities

16

17 of 58

Table 9.3: Classification Confusion Matrices and Error Rates for Various Cutoff Values

17

18 of 58

Figure 9.1: �Classification Error Rates vs. Cutoff Value

18

19 of 58

Figure 9.2: �Cumulative and Decile-Wise Lift Charts

19

20 of 58

Accuracy Measures

  • The ability to correctly predict Class 1 (positive) observations is commonly expressed as sensitivity, or recall, and is calculated as:

  • The ability to correctly predict Class 0 (negative observations is commonly expressed as specificity and is calculated as:

21 of 58

Accuracy Measures

  • Precision is a measure that corresponds to the proportion of observations predicted to be Class 1 by a classifier that are actually in Class 1:

  • The F1 Score combines precision and sensitivity into a single measure and is defined as:

22 of 58

Figure 9.3: �Receiver Operating Characteristic (ROC) Curve

  • The receiver operating characteristic (ROC) curve is an alternative graphical approach for displaying the tradeoff between a classifier’s ability to correctly identify Class 1 observations and its Class 0 error rate

23 of 58

Accuracy Measures

  •  

24 of 58

Table 9.4: Computer Error in Estimates of Average Balance for 10 Customers

25 of 58

Logistic Regression

26 of 58

Logistic Regression

  • Logistic regression attempts to classify a categorical outcome (y = 0 or 1) as a linear function of explanatory variables
  • A linear regression model fails to appropriately explain a categorical outcome variable

26

27 of 58

Figure 9.4: Scatter Chart and Simple Linear Regression Fit for Oscars Example

27

28 of 58

Figure 9.5: Residuals for Simple Linear Regression on Oscars Data

An unmistakable pattern of systematic misprediction suggests that the simple linear regression model is not appropriate

29 of 58

Logistic Regression

  •  

29

30 of 58

Logistic Regression

  •  

30

31 of 58

Figure 9.6: �Logistic S-Curve for Oscars Example

32 of 58

Logistic Regression

  • Logistic regression classifies an observation by using the logistic function to computer the probability of an observation belonging to Class 1 and the comparing this probability to a cutoff value
  • If the probability exceeds the cutoff value, the observation is classified as Class 1 and otherwise it is classified as Class 0
  • While a logistic regression model used for prediction should ultimately be judged based on its classification accuracy on validation and test sets, Mallow’s C statistic is a measure commonly computed by statistical software that can be used to identify models with promising sets of variables

33 of 58

Table 9.5: Predicted Probabilities by Logistic Regression for Oscars Example

34 of 58

k-Nearest Neighbors

Classifying Categorical Outcomes with k-Nearest Neighbors

Estimating Continuous Outcomes with k-Nearest Neighbors

35 of 58

k-Nearest Neighbors

  • k-Nearest Neighbors (k-NN): this method can be used either to classify an outcome category or predict a continuous outcome
  • k-NN uses the k most similar observations from the training set, where similarity is typically measured with Euclidean distance

35

36 of 58

k-Nearest Neighbors

Classifying Categorical Outcomes with k-Nearest Neighbors

  • A nearest-neighbor is a “lazy learner” that instead directly uses the entire training set to classify observations in the validation and test sets
  • The value of k can plausibly range from 1 to n, the number of observations in the training set

37 of 58

Table 9.6: �Training Set Observations for k-NN Classifier

38 of 58

Figure 9.7: �Scatter Chart for k-NN Classification

39 of 58

Table 9.7: Classification of Observation with Average Balance=900 and Age=28 for Different Values of k

40 of 58

k-Nearest Neighbors

Estimating Continuous Outcomes with k-Nearest Neighbors

  • When k-NN is used to estimate a continuous outcome, a new observation’s outcome value is predicted to be the average of the outcome values of it k nearest neighbors in the training set
  • The value of k can plausibly range from 1 to n, the number of observations in the training set

Figure 9.8: Scatter Chart for k-NN Estimation

41 of 58

Table 9.8: Estimation Average Balance for Observation with Age=28 for Different Values of k

42 of 58

Classification and Regression Trees

Classifying Categorical Outcomes with Classification Tree

Estimating Continuous Outcomes with a Regression Tree

Ensemble Methods

43 of 58

Classification and Regression Trees

  • Classification and regression trees (CART) successively partition a data set of observations into increasingly smaller and more homogeneous subsets
  • At each iteration of the CART method, a subset of observations is split into two new subsets based on the values of a single variable
  • Series of questions that successively narrow down observations into smaller and smaller groups of decreasing impurity

43

44 of 58

Classification and Regression Trees

Classifying Categorical Outcomes with a Classification Tree

  • Classification trees: The impurity of a group of observations is based on the proportion of observations belonging to the same class
  • After a final tree is constructed, the classification of a new observation is then based on the final partition into which the new observation belongs

44

45 of 58

Classification and Regression Trees

Classifying a Categorical Outcome with a Classification Tree

  • To explain how a classification tree categorizes observations:
    • We use a small sample of data from HHI consisting of 46 observations
    • Only two variables from HHI—percentage of the $ character (denoted Percent_$) and percentage of the ! character (Percent_!)

45

46 of 58

Figure 9.9: Construction Sequence of Branches in a Classification Tree

47 of 58

Figure 9.10: Geometric Illustration of Classification Tree Partitions

The final partitioning resulting from the sequence of variable splits

47

48 of 58

Figure 9.11: �Classification Tree with One Pruned Branch

48

49 of 58

Table 9.9: Classification Error Rates on Sequence of Pruned Trees�Figure 9.12: Best-Pruned Classification Tree

49

50 of 58

Classification and Regression Trees

Estimating Continuous Outcomes with a Regression Tree

  • A regression tree successively partitions observations of the training set into smaller and smaller groups in a similar fashion as a classification tree
  • The differences are:
    • A regression tree bases the impurity of a partition based on the variance of the outcome value for the observations in the group
    • After a final tree is constructed, the predicted outcome value of an observation is based on the mean outcome value of the partition into which the new observation belongs

51 of 58

Figure 9.13: Geometric Illustration of First Six Rules of a Regression Tree

52 of 58

Classification and Regression Trees

Ensemble Methods

  • In an ensemble method, predictions are made based on the combination of a collection of models
  • Two necessary conditions for an ensemble to perform better than a single model:
    1. Individual base models are constructed independently of each other
    2. Individual models perform better than just randomly guessing

53 of 58

Classification and Regression Trees

  • Two primary steps to an ensemble approach:
    1. The development of a committee of individual base models
    2. The combination of the individual base models’ predictions to form a composite prediction
  • A classification or estimation method is unstable if relatively small changes in the training set cause its predictions to fluctuate
  • Three different ways to construct an ensemble of classification or regression trees:
    • Bagging
    • Boosting
    • Random forests

54 of 58

Classification and Regression Trees

In the bagging approach, the committee of individual base models is generated by first constructing multiple training sets by repeated random sampling of the n observations in the original data with replacement

Table 9.10: Original 10-Observation Training Data

55 of 58

Classification and Regression Trees

  • The boosting method generates is committee of individual base models by sampling multiple training sets
  • Boosting iteratively adapts how it samples the original data when constructing a new training set based on the prediction error of the models constructed on the previous training sets
  • Random forests (also called random trees) can be viewed as a variation of bagging specifically tailored for use with classification or regression trees

56 of 58

Table 9.11: Bagging: Generation of 10 New Training Sets and Corresponding Classification Trees

57 of 58

Table 9.12: Classification of 10 Observations from Validation Set with Bagging Ensemble

58 of 58

Classification and Regression Trees

  • For most problems, the predictive accuracy of boosting ensembles exceeds the predictive accuracy of bagging ensembles
  • Boosting achieves its performance advantage because:
    • It evolves its committee of models by focusing on observations that are mispredicted
    • The member models’ votes are weighted by their accuracy
  • Boosting is more computationally expensive than bagging
  • There is no adaptive feedback in a bagging approach, so all m training sets are corresponding models can be implemented simultaneously
  • Random forests approach has performance similar to boosting, but maintains the computational simplicity of bagging