1 of 58

Predictive Data Mining

2 of 58

Introduction

An observation, or record, as the set of recorded values of variables associated with a single entity
Supervised learning: Data-mining methods for predicting an outcome based on a set of input variables, or features
Supervised learning can be used for:

Estimation of a continuous outcome
Classification of a categorical outcomes

2

3 of 58

Introduction

The data-mining process comprises the following steps:

Data sampling
Data preparation
Data partitioning
Data exploration
Model construction
Model assessment

4 of 58

Data Sampling

5 of 58

Data Sampling

When dealing with large volumes of data, it is best practice to extract a representative sample for analysis
A sample is representative, if the analyst can make the same conclusions from it as from the entire population of data
The sample of data must be large enough to contain significant information, yet small enough to be manipulated quickly
Data-mining algorithms typically are more effective given more data

5

6 of 58

Data Sampling

When obtaining a representative sample, it is generally best to include as many variables as possible in the sample
After exploring the data with descriptive statistics and visualization, the analyst can eliminate variables that are not of interest

6

7 of 58

Data Partitioning

8 of 58

Data Partitioning

Data-mining applications deal with an abundance of data that simplifies the process of assessing the accuracy of data-based estimates of variable effects
Model overfitting occurs when the analyst builds a model that does a great job of explaining the sample of data on which it is based, but fails to accurately predict outside the sample data
We can use the abundance of data to guard against the potential for overfitting by decomposing the data set into three partitions:

The training set
The validation set
The test set

8

9 of 58

Data Partitioning

Training set: Consists of the data used to build the candidate models
Validation set: The data set to which promising subset of models is applied to the to identify which model is the most accurate at predicting when applied to data that were not used to build the model
Test set: The data set to which the final model should be applied to estimate this model’s effectiveness when applied to data that have not been used to build or select the model

9

10 of 58

Accuracy Measures

Evaluating the Classification of Categorical Outcomes

Evaluating the Estimation of Continuous Outcomes

11 of 58

Accuracy Measures

Evaluating the Classification of Categorical Outcomes

By counting the classification errors on a sufficiently large validation set and/or test set that is representative of the population, we will generate an accurate measure of the model’s classification performance
Classification confusion matrix: Displays a model’s correct and incorrect classifications

11

12 of 58

Accuracy Measures

12

Table 9.1 illustrates a classification confusion matrix resulting from an attempt to classify the customer observations in a subset of data from the file Optiva.
In this table, Class 1 = loan default and Class 0 = no default.
The classification confusion matrix is a crosstabulation of the actual class of each observation and the predicted class of each observation.
From the first row of the matrix in Table 9.1, we see that 146 observations corresponding to loan defaults were correctly identified as such, but another 89 actual loan defaults were classified as nondefault observations.
From the second row, we observe that 5244 actual nondefault observations were incorrectly classified as loan defaults, whereas 7479 nondefaults were correctly identified.

Although overall error rate conveys an aggregate measure of misclassification, it counts misclassifying an actual Class 0 observation as a Class 1 observation (a false positive) the same as misclassifying an actual Class 1 observation as a Class 0 observation (a false negative).

13 of 58

Accuracy Measures

One minus the overall error rate is often referred to as the accuracy of the model
While overall error rate conveys an aggregate measure of misclassification, it counts as misclassifying an actual Class 0 observation as a Class 1 observation (a false positive) the same as misclassifying an actual Class 1 observation as a Class 0 observation (a false negative)

14 of 58

Accuracy Measures

14

15 of 58

Accuracy Measures

Cumulative lift chart: Compares the number of actual Class 1 observations identified if considered in decreasing order of their estimated probability of being in Class 1 and compares this to the number of actual Class 1 observations identified if randomly selected
Decile-wise lift chart: Another way to view how much better a classifier is at identifying Class 1 observations than random classification
Observations are ordered in decreasing probability of Class 1 membership and then considered in 10 equal-sized groups

15

16 of 58

Table 9.2: Classification Probabilities

16

17 of 58

Table 9.3: Classification Confusion Matrices and Error Rates for Various Cutoff Values

17

18 of 58

Figure 9.1: �Classification Error Rates vs. Cutoff Value

18

19 of 58

Figure 9.2: �Cumulative and Decile-Wise Lift Charts

19

Cumulative lift chart

The left panel of Figure 9.2 illustrates a cumulative lift chart.
The point (10, 5) on the light blue curve means that if the 10 observations with the largest estimated probabilities of being in Class 1 were selected, 5 of these observations correspond to actual Class 1 members.
In contrast, the point (10, 2.2) on the dark blue diagonal line means that if 10 observations were randomly selected, an average of only (11/50) × 10 = 2.2 of these observations would be Class 1 members.

Decile-wise lift chart.

For the data in Table 9.3, the first decile group corresponds to the 0.1 × 50 = 5 observations most likely to be in Class 1, the second decile group corresponds to the sixth through the tenth observations most likely to be in Class 1, and so on.
For each of these decile groups, the decile-wise lift chart compares the number of actual Class 1 observations to the number of Class 1 responders in a randomly selected group of 0.1 × 50 = 5 observations.
In the first decile group (top 10 percent of observations most likely to be in Class 1) are three Class 1 observations.
A random sample of 5 observations would be expected to have 5 × (11/50) = 1.1 observations in Class 1.
Thus the first-decile lift of this classification is 3/1.1 5 2.73, which corresponds to the height of the first bar in the chart in the right panel of Figure 9.2.
Visually, the taller the bar in a decile-wise lift chart, the better the classifier is at identifying responders in the respective decile group.

20 of 58

Accuracy Measures

The ability to correctly predict Class 1 (positive) observations is commonly expressed as sensitivity, or recall, and is calculated as:

The ability to correctly predict Class 0 (negative observations is commonly expressed as specificity and is calculated as:

21 of 58

Accuracy Measures

Precision is a measure that corresponds to the proportion of observations predicted to be Class 1 by a classifier that are actually in Class 1:

The F1 Score combines precision and sensitivity into a single measure and is defined as:

22 of 58

Figure 9.3: �Receiver Operating Characteristic (ROC) Curve

The receiver operating characteristic (ROC) curve is an alternative graphical approach for displaying the tradeoff between a classifier’s ability to correctly identify Class 1 observations and its Class 0 error rate

23 of 58

Accuracy Measures

24 of 58

Table 9.4: Computer Error in Estimates of Average Balance for 10 Customers

25 of 58

Logistic Regression

26 of 58

Logistic Regression

Logistic regression attempts to classify a categorical outcome (y = 0 or 1) as a linear function of explanatory variables
A linear regression model fails to appropriately explain a categorical outcome variable

26

27 of 58

Figure 9.4: Scatter Chart and Simple Linear Regression Fit for Oscars Example

27

28 of 58

Figure 9.5: Residuals for Simple Linear Regression on Oscars Data

An unmistakable pattern of systematic misprediction suggests that the simple linear regression model is not appropriate

29 of 58

Logistic Regression

29

30 of 58

Logistic Regression

30

31 of 58

Figure 9.6: �Logistic S-Curve for Oscars Example

32 of 58

Logistic Regression

Logistic regression classifies an observation by using the logistic function to computer the probability of an observation belonging to Class 1 and the comparing this probability to a cutoff value
If the probability exceeds the cutoff value, the observation is classified as Class 1 and otherwise it is classified as Class 0
While a logistic regression model used for prediction should ultimately be judged based on its classification accuracy on validation and test sets, Mallow’s C statistic is a measure commonly computed by statistical software that can be used to identify models with promising sets of variables

33 of 58

Table 9.5: Predicted Probabilities by Logistic Regression for Oscars Example

34 of 58

k-Nearest Neighbors

Classifying Categorical Outcomes with k-Nearest Neighbors

Estimating Continuous Outcomes with k-Nearest Neighbors

35 of 58

k-Nearest Neighbors

k-Nearest Neighbors (k-NN): this method can be used either to classify an outcome category or predict a continuous outcome
k-NN uses the k most similar observations from the training set, where similarity is typically measured with Euclidean distance

35

When k-NN is used as a classification method, a new observation is classified as Class 1 if the percentage of its k nearest neighbors in Class 1 is greater than or equal to a specified cutoff value (the default value is 0.5 in XLMiner).
When k-NN is used as a prediction method, a new observation’s outcome value is predicted to be the average of the outcome values of its k-nearest neighbors.

The value of k can plausibly range from 1 to n, the number of observations in the training set.
If k = 1, then the classification or prediction of a new observation is based solely on the single most similar observation from the training set.
At the other extreme, if k = n, then the new observation’s class is naïvely assigned to the most common class in the training set, or analogously, the new observation’s prediction is set to the average outcome value over the entire training set.

36 of 58

k-Nearest Neighbors

Classifying Categorical Outcomes with k-Nearest Neighbors

A nearest-neighbor is a “lazy learner” that instead directly uses the entire training set to classify observations in the validation and test sets
The value of k can plausibly range from 1 to n, the number of observations in the training set

37 of 58

Table 9.6: �Training Set Observations for k-NN Classifier

38 of 58

Figure 9.7: �Scatter Chart for k-NN Classification

39 of 58

Table 9.7: Classification of Observation with Average Balance=900 and Age=28 for Different Values of k

40 of 58

k-Nearest Neighbors

Estimating Continuous Outcomes with k-Nearest Neighbors

When k-NN is used to estimate a continuous outcome, a new observation’s outcome value is predicted to be the average of the outcome values of it k nearest neighbors in the training set
The value of k can plausibly range from 1 to n, the number of observations in the training set

Figure 9.8: Scatter Chart for k-NN Estimation

41 of 58

Table 9.8: Estimation Average Balance for Observation with Age=28 for Different Values of k

42 of 58

Classification and Regression Trees

Classifying Categorical Outcomes with Classification Tree

Estimating Continuous Outcomes with a Regression Tree

Ensemble Methods

43 of 58

Classification and Regression Trees

Classification and regression trees (CART) successively partition a data set of observations into increasingly smaller and more homogeneous subsets
At each iteration of the CART method, a subset of observations is split into two new subsets based on the values of a single variable
Series of questions that successively narrow down observations into smaller and smaller groups of decreasing impurity

43

44 of 58

Classification and Regression Trees

Classifying Categorical Outcomes with a Classification Tree

Classification trees: The impurity of a group of observations is based on the proportion of observations belonging to the same class
After a final tree is constructed, the classification of a new observation is then based on the final partition into which the new observation belongs

44

To demonstrate CART, we consider an example involving Hawaiian Ham Inc. (HHI), a company that specializes in the development of software that filters out unwanted e-mail messages (often referred to as spam). HHI has collected data on 4601 e-mail messages.

For each of these 4601 observations, the file HawaiianHam contains the following variables:

● The frequency of 48 different words (expressed as the percentage of words)

● The frequency of six characters (expressed as the percentage of characters)

● The average length of the sequences of capital letters

● The longest sequence of capital letters

● The total number of sequences with capital letters

● Whether the e-mail was spam

HHI would like to use these variables to classify e-mail messages as either “spam” (Class 1) or “not spam” (Class 0).

45 of 58

Classification and Regression Trees

Classifying a Categorical Outcome with a Classification Tree

To explain how a classification tree categorizes observations:

We use a small sample of data from HHI consisting of 46 observations
Only two variables from HHI—percentage of the $ character (denoted Percent_$) and percentage of the ! character (Percent_!)

45

46 of 58

Figure 9.9: Construction Sequence of Branches in a Classification Tree

47 of 58

Figure 9.10: Geometric Illustration of Classification Tree Partitions

The final partitioning resulting from the sequence of variable splits

47

48 of 58

Figure 9.11: �Classification Tree with One Pruned Branch

48

Figure 9.11 illustrates the tree resulting from pruning the last variable splitting rule (Percent_! < 0.5605 or Percent_! > 0.5605) from Figure 9.9.
By pruning this rule, we obtain a partition defined by Percent_$ < 0.056, Percent_! > 0.0735, and Percent_! > 0.2665 that contains three observations.
Two of these observations are Class 1 (spam) and one is Class 0 (nonspam).
As Figure 9.11 shows, this pruned tree classifies observations in the partition defined by Percent_$ < 0.056, Percent_! > 0.0735, and Percent_! > 0.2665 as Class 1 observations because the proportion of Class 1 observations in this partition—which we know is (2/3) from Figure 9.9—exceeds the default cutoff value of 0.5.
Therefore, the classification error of this pruned tree on the training set is no longer zero.
However, this pruned tree is less likely to be overfit to the training set and may classify the validation set more accurately than the full tree that perfectly classifies the training set.

49 of 58

Table 9.9: Classification Error Rates on Sequence of Pruned Trees�Figure 9.12: Best-Pruned Classification Tree

49

Table 9.9 shows that the classification error on the training set decreases as more decision nodes splitting the observations into smaller partitions are added.
However, while adding decision nodes at first decreases the classification error on the validation set, too many decision nodes overfits the classification tree to the training data and results in increased error on the validation set.
Table 9.9 suggests that a classification tree partitioning observations into two subsets with a single rule (Percent_! < 0.5605 or Percent_! > 0.5605) is just as reliable at classifying the validation data as any other tree.

As Figure 9.12 shows, this classification tree classifies e-mails with ! accounting for less than 0.5605 percent of the characters as nonspam and e-mails with ! accounting for more than 0.5605 percent of the characters as spam, which results in a classification error of 20.9 percent on the validation set.

50 of 58

Classification and Regression Trees

Estimating Continuous Outcomes with a Regression Tree

A regression tree successively partitions observations of the training set into smaller and smaller groups in a similar fashion as a classification tree
The differences are:

A regression tree bases the impurity of a partition based on the variance of the outcome value for the observations in the group
After a final tree is constructed, the predicted outcome value of an observation is based on the mean outcome value of the partition into which the new observation belongs

51 of 58

Figure 9.13: Geometric Illustration of First Six Rules of a Regression Tree

52 of 58

Classification and Regression Trees

Ensemble Methods

In an ensemble method, predictions are made based on the combination of a collection of models
Two necessary conditions for an ensemble to perform better than a single model:

Individual base models are constructed independently of each other
Individual models perform better than just randomly guessing

53 of 58

Classification and Regression Trees

Two primary steps to an ensemble approach:

The development of a committee of individual base models
The combination of the individual base models’ predictions to form a composite prediction

A classification or estimation method is unstable if relatively small changes in the training set cause its predictions to fluctuate
Three different ways to construct an ensemble of classification or regression trees:

Bagging
Boosting
Random forests

54 of 58

Classification and Regression Trees

In the bagging approach, the committee of individual base models is generated by first constructing multiple training sets by repeated random sampling of the n observations in the original data with replacement

Table 9.10: Original 10-Observation Training Data

55 of 58

Classification and Regression Trees

The boosting method generates is committee of individual base models by sampling multiple training sets
Boosting iteratively adapts how it samples the original data when constructing a new training set based on the prediction error of the models constructed on the previous training sets
Random forests (also called random trees) can be viewed as a variation of bagging specifically tailored for use with classification or regression trees

56 of 58

Table 9.11: Bagging: Generation of 10 New Training Sets and Corresponding Classification Trees

57 of 58

Table 9.12: Classification of 10 Observations from Validation Set with Bagging Ensemble

58 of 58

Classification and Regression Trees

For most problems, the predictive accuracy of boosting ensembles exceeds the predictive accuracy of bagging ensembles
Boosting achieves its performance advantage because:

It evolves its committee of models by focusing on observations that are mispredicted
The member models’ votes are weighted by their accuracy

Boosting is more computationally expensive than bagging
There is no adaptive feedback in a bagging approach, so all m training sets are corresponding models can be implemented simultaneously
Random forests approach has performance similar to boosting, but maintains the computational simplicity of bagging