Predictive Data Mining
Introduction
2
Introduction
The data-mining process comprises the following steps:
Data Sampling
Data Sampling
5
Data Sampling
6
Data Partitioning
Data Partitioning
8
Data Partitioning
9
Accuracy Measures
Evaluating the Classification of Categorical Outcomes
Evaluating the Estimation of Continuous Outcomes
Accuracy Measures
Evaluating the Classification of Categorical Outcomes
11
Accuracy Measures
12
Accuracy Measures
Accuracy Measures
14
Accuracy Measures
15
Table 9.2: Classification Probabilities
16
Table 9.3: Classification Confusion Matrices and Error Rates for Various Cutoff Values
17
Figure 9.1: �Classification Error Rates vs. Cutoff Value
18
Figure 9.2: �Cumulative and Decile-Wise Lift Charts
19
Accuracy Measures
Accuracy Measures
Figure 9.3: �Receiver Operating Characteristic (ROC) Curve
Accuracy Measures
Table 9.4: Computer Error in Estimates of Average Balance for 10 Customers
Logistic Regression
Logistic Regression
26
Figure 9.4: Scatter Chart and Simple Linear Regression Fit for Oscars Example
27
Figure 9.5: Residuals for Simple Linear Regression on Oscars Data
An unmistakable pattern of systematic misprediction suggests that the simple linear regression model is not appropriate
Logistic Regression
29
Logistic Regression
30
Figure 9.6: �Logistic S-Curve for Oscars Example
Logistic Regression
Table 9.5: Predicted Probabilities by Logistic Regression for Oscars Example
k-Nearest Neighbors
Classifying Categorical Outcomes with k-Nearest Neighbors
Estimating Continuous Outcomes with k-Nearest Neighbors
k-Nearest Neighbors
35
k-Nearest Neighbors
Classifying Categorical Outcomes with k-Nearest Neighbors
Table 9.6: �Training Set Observations for k-NN Classifier
Figure 9.7: �Scatter Chart for k-NN Classification
Table 9.7: Classification of Observation with Average Balance=900 and Age=28 for Different Values of k
k-Nearest Neighbors
Estimating Continuous Outcomes with k-Nearest Neighbors
Figure 9.8: Scatter Chart for k-NN Estimation
Table 9.8: Estimation Average Balance for Observation with Age=28 for Different Values of k
Classification and Regression Trees
Classifying Categorical Outcomes with Classification Tree
Estimating Continuous Outcomes with a Regression Tree
Ensemble Methods
Classification and Regression Trees
43
Classification and Regression Trees
Classifying Categorical Outcomes with a Classification Tree
44
Classification and Regression Trees
Classifying a Categorical Outcome with a Classification Tree
45
Figure 9.9: Construction Sequence of Branches in a Classification Tree
Figure 9.10: Geometric Illustration of Classification Tree Partitions
The final partitioning resulting from the sequence of variable splits
47
Figure 9.11: �Classification Tree with One Pruned Branch
48
Table 9.9: Classification Error Rates on Sequence of Pruned Trees�Figure 9.12: Best-Pruned Classification Tree
49
Classification and Regression Trees
Estimating Continuous Outcomes with a Regression Tree
Figure 9.13: Geometric Illustration of First Six Rules of a Regression Tree
Classification and Regression Trees
Ensemble Methods
Classification and Regression Trees
Classification and Regression Trees
In the bagging approach, the committee of individual base models is generated by first constructing multiple training sets by repeated random sampling of the n observations in the original data with replacement
Table 9.10: Original 10-Observation Training Data
Classification and Regression Trees
Table 9.11: Bagging: Generation of 10 New Training Sets and Corresponding Classification Trees
Table 9.12: Classification of 10 Observations from Validation Set with Bagging Ensemble
Classification and Regression Trees