Machine Learning I:
Model Assessment and Selection
Patrick Hall
Visiting Faculty, Department of Decision Sciences
George Washington University
Lecture 3 Agenda
Where are we in the modeling lifecycle?
Data Collection �& ETL
Feature Selection & Engineering
Supervised�Learning
Unsupervised�Learning
Deployment
Cost Intensive
Revenue
Generating
Assessment & Validation
�Regression Assessment
Sum of Squares, RMSE, & R2
One-Variable Linear Regression
6.5
7.0
7.5
8.0
8.5
(Logarithm of) Price
15.0 15.5 16.0 16.5 17.0 17.5
Avg Growing Season Temp (Celsius)
SSE = 10.15
SSE = 6.03
SSE = 5.73
Adapted from MIT Sloan Analytics Edge
SSM = Σ(ŷ(i) - ȳ)2�Estimates how different�current model is from ȳ, �the “naive” or “null” model
SSE = Σ(y(i) - ŷ(i))2 �Estimates how well current �model fits the training data, �minimized in OLS regression ��SST = SSE + SSM
Error Measures & Assessment
Adapted from MIT Sloan Analytics Edge
Adapted from MIT Sloan Analytics Edge
Coefficient of Determination: R2
6.5
7.0
7.5
8.0
8.5
(Logarithm of) Price
15.0 15.5 16.0 16.5 17.0 17.5
Avg Growing Season Temp (Celsius)
Adapted from MIT Sloan Analytics Edge
Adapted from MIT Sloan Analytics Edge
Interpreting R2: Goodness of Fit
R2 captures value added from using a linear model:
Unitless and universally interpretable, R2 :
Adapted from MIT Sloan Analytics Edge
Adapted from MIT Sloan Analytics Edge
�Classification Assessment
Confusion Matrix, ROC and AUC, & Lift
Confusion Matrix
From: https://en.wikipedia.org/wiki/Confusion_matrix
Source: https://en.wikipedia.org/wiki/Confusion_matrix
Confusion Matrix Metrics
From: https://en.wikipedia.org/wiki/Confusion_matrix
Source: https://en.wikipedia.org/wiki/Confusion_matrix
ROC Curve
Image: https://en.wikipedia.org/wiki/Receiver_operating_characteristic#/media/File:Roccurves.png
“
“
ROC Calculation
Actual | Predicted | Cutoff | Predicted | Outcome |
1 | 0.85 | 0.0 | 1 | TP |
1 | 0.75 | 0.0 | 1 | TP |
1 | 0.7 | 0.0 | 1 | TP |
1 | 0.65 | 0.0 | 1 | TP |
0 | 0.65 | 0.0 | 1 | FP |
1 | 0.55 | 0.0 | 1 | TP |
0 | 0.55 | 0.0 | 1 | FP |
0 | 0.45 | 0.0 | 1 | FP |
0 | 0.3 | 0.0 | 1 | FP |
0 | 0.1 | 0.0 | 1 | FP |
Outcome | Calculation |
TP | 5 |
FP | 5 |
FN | 0 |
TN | 0 |
Sensitivity | 5/5=1 |
Specificity | 0/5=0 |
1-Specificity | 1-0=1 |
Sensitivity (TP/(TP+FN))
Specificity (TN/(TN+FP))
ROC Calculation
Actual | Predicted | Cutoff | Predicted | Outcome |
1 | 0.85 | 0.2 | 1 | TP |
1 | 0.75 | 0.2 | 1 | TP |
1 | 0.7 | 0.2 | 1 | TP |
1 | 0.65 | 0.2 | 1 | TP |
0 | 0.65 | 0.2 | 1 | FP |
1 | 0.55 | 0.2 | 1 | TP |
0 | 0.55 | 0.2 | 1 | FP |
0 | 0.45 | 0.2 | 1 | FP |
0 | 0.3 | 0.2 | 1 | FP |
0 | 0.1 | 0.2 | 0 | TN |
Outcome | Calculation |
TP | 5 |
FP | 4 |
FN | 0 |
TN | 1 |
Sensitivity | 5/5=1 |
Specificity | 1/5=0.2 |
1-Specificity | 1-0.2=0.8 |
Sensitivity (TP/(TP+FN))
Specificity (TN/(TN+FP))
ROC Calculation
Actual | Predicted | Cutoff | Predicted | Outcome |
1 | 0.85 | 0.4 | 1 | TP |
1 | 0.75 | 0.4 | 1 | TP |
1 | 0.7 | 0.4 | 1 | TP |
1 | 0.65 | 0.4 | 1 | TP |
0 | 0.65 | 0.4 | 1 | FP |
1 | 0.55 | 0.4 | 1 | TP |
0 | 0.55 | 0.4 | 1 | FP |
0 | 0.45 | 0.4 | 1 | FP |
0 | 0.3 | 0.4 | 0 | TN |
0 | 0.1 | 0.4 | 0 | TN |
Outcome | Calculation |
TP | 5 |
FP | 1 |
FN | 3 |
TN | 2 |
Sensitivity | 5/5=1 |
Specificity | 2/5=0.4 |
1-Specificity | 1-0.4=0.6 |
Sensitivity (TP/(TP+FN))
Specificity (TN/(TN+FP))
ROC Calculation
Actual | Predicted | Cutoff | Predicted | Outcome |
1 | 0.85 | 0.6 | 1 | TP |
1 | 0.75 | 0.6 | 1 | TP |
1 | 0.7 | 0.6 | 1 | TP |
1 | 0.65 | 0.6 | 1 | TP |
0 | 0.65 | 0.6 | 1 | FP |
1 | 0.55 | 0.6 | 0 | FN |
0 | 0.55 | 0.6 | 0 | TN |
0 | 0.45 | 0.6 | 0 | TN |
0 | 0.3 | 0.6 | 0 | TN |
0 | 0.1 | 0.6 | 0 | TN |
Outcome | Calculation |
TP | 4 |
FP | 1 |
FN | 1 |
TN | 4 |
Sensitivity | 4/5=0.8 |
Specificity | 4/5=0.8 |
1-Specificity | 1-0.8=0.2 |
Sensitivity (TP/(TP+FN))
Specificity (TN/(TN+FP))
ROC Calculation
Actual | Predicted | Cutoff | Predicted | Outcome |
1 | 0.85 | 0.8 | 1 | TP |
1 | 0.75 | 0.8 | 0 | FN |
1 | 0.7 | 0.8 | 0 | FN |
1 | 0.65 | 0.8 | 0 | FN |
0 | 0.65 | 0.8 | 0 | TN |
1 | 0.55 | 0.8 | 0 | FN |
0 | 0.55 | 0.8 | 0 | TN |
0 | 0.45 | 0.8 | 0 | TN |
0 | 0.3 | 0.8 | 0 | TN |
0 | 0.1 | 0.8 | 0 | TN |
Outcome | Calculation |
TP | 1 |
FP | 0 |
FN | 4 |
TN | 5 |
Sensitivity | 1/5=0.2 |
Specificity | 5/5=1 |
1-Specificity | 1-1=0 |
Sensitivity (TP/(TP+FN))
Specificity (TN/(TN+FP))
ROC Calculation
Actual | Predicted | Cutoff | Predicted | Outcome |
1 | 0.85 | 1 | 0 | FN |
1 | 0.75 | 1 | 0 | FN |
1 | 0.7 | 1 | 0 | FN |
1 | 0.65 | 1 | 0 | FN |
0 | 0.65 | 1 | 0 | TN |
1 | 0.55 | 1 | 0 | FN |
0 | 0.55 | 1 | 0 | TN |
0 | 0.45 | 1 | 0 | TN |
0 | 0.3 | 1 | 0 | TN |
0 | 0.1 | 1 | 0 | TN |
Outcome | Calculation |
TP | 0 |
FP | 0 |
FN | 5 |
TN | 5 |
Sensitivity | 0/5=0 |
Specificity | 5/5=1 |
1-Specificity | 1-1=0 |
Sensitivity (TP/(TP+FN))
Specificity (TN/(TN+FP))
ROC CURVE and AUC
Cutoff | 1-Specificity | Sensitivity |
0.0 | 1 | 1 |
0.2 | 0.8 | 1 |
0.4 | 0.6 | 1 |
0.6 | 0.2 | 0.8 |
0.8 | 0 | 0.2 |
1.0 | 0 | 0 |
LIFT Calculation and Plot
Source: http://www2.cs.uregina.ca/~dbd/cs831/notes/lift_chart/lift_chart.html
Source: https://www.geeksforgeeks.org/understanding-gain-chart-and-lift-chart/
LIFT PLOT CALCULATIONS
Actual | Predicted |
1 | 0.85 |
1 | 0.75 |
1 | 0.7 |
1 | 0.65 |
0 | 0.65 |
1 | 0.55 |
0 | 0.55 |
0 | 0.45 |
0 | 0.3 |
0 | 0.1 |
Depth | Lift | Cumulative Lift |
20 | ((1+1)/5)/0.2 | ((1+1)/5)/0.2 |
40 | | |
60 | | |
80 | | |
100 | | |
Actual | Predicted |
1 | 0.85 |
1 | 0.75 |
1 | 0.7 |
1 | 0.65 |
0 | 0.65 |
1 | 0.55 |
0 | 0.55 |
0 | 0.45 |
0 | 0.3 |
0 | 0.1 |
LIFT PLOT CALCULATIONS
Actual | Predicted |
1 | 0.85 |
1 | 0.75 |
1 | 0.7 |
1 | 0.65 |
0 | 0.65 |
1 | 0.55 |
0 | 0.55 |
0 | 0.45 |
0 | 0.3 |
0 | 0.1 |
Depth | Lift | Cumulative Lift |
20 | ((1+1)/5)/0.2 | ((1+1)/5)/0.2 |
40 | ((1+1)/5)/0.2 | ((1+1+1+1)/5)/0.4 |
60 | | |
80 | | |
100 | | |
Actual | Predicted |
1 | 0.85 |
1 | 0.75 |
1 | 0.7 |
1 | 0.65 |
0 | 0.65 |
1 | 0.55 |
0 | 0.55 |
0 | 0.45 |
0 | 0.3 |
0 | 0.1 |
LIFT PLOT CALCULATIONS
Actual | Predicted |
1 | 0.85 |
1 | 0.75 |
1 | 0.7 |
1 | 0.65 |
0 | 0.65 |
1 | 0.55 |
0 | 0.55 |
0 | 0.45 |
0 | 0.3 |
0 | 0.1 |
Depth | Lift | Cumulative Lift |
20 | ((1+1)/5)/0.2 | ((1+1)/5)/0.2 |
40 | ((1+1)/5)/0.2 | ((1+1+1+1)/5)/0.4 |
60 | ((0+1)/5)/0.2 | ((1+1+1+1+0+1)/5)/0.6 |
80 | | |
100 | | |
Actual | Predicted |
1 | 0.85 |
1 | 0.75 |
1 | 0.7 |
1 | 0.65 |
0 | 0.65 |
1 | 0.55 |
0 | 0.55 |
0 | 0.45 |
0 | 0.3 |
0 | 0.1 |
LIFT PLOT CALCULATIONS
Actual | Predicted |
1 | 0.85 |
1 | 0.75 |
1 | 0.7 |
1 | 0.65 |
0 | 0.65 |
1 | 0.55 |
0 | 0.55 |
0 | 0.45 |
0 | 0.3 |
0 | 0.1 |
Depth | Lift | Cumulative Lift |
20 | ((1+1)/5)/0.2 | ((1+1)/5)/0.2 |
40 | ((1+1)/5)/0.2 | ((1+1+1+1)/5)/0.4 |
60 | ((0+1)/5)/0.2 | ((1+1+1+1+0+1)/5)/0.6 |
80 | ((0+0)/5)/0.2 | ((1+1+1+1+0+1+0+0)/5)/0.8 |
100 | | |
Actual | Predicted |
1 | 0.85 |
1 | 0.75 |
1 | 0.7 |
1 | 0.65 |
0 | 0.65 |
1 | 0.55 |
0 | 0.55 |
0 | 0.45 |
0 | 0.3 |
0 | 0.1 |
LIFT PLOT CALCULATIONS
Actual | Predicted |
1 | 0.85 |
1 | 0.75 |
1 | 0.7 |
1 | 0.65 |
0 | 0.65 |
1 | 0.55 |
0 | 0.55 |
0 | 0.45 |
0 | 0.3 |
0 | 0.1 |
Depth | Lift | Cumulative Lift |
20 | ((1+1)/5)/0.2 | ((1+1)/5)/0.2 |
40 | ((1+1)/5)/0.2 | ((1+1+1+1)/5)/0.4 |
60 | ((0+1)/5)/0.2 | ((1+1+1+1+0+1)/5)/0.6 |
80 | ((0+0)/5)/0.2 | ((1+1+1+1+0+1+0+0)/5)/0.8 |
100 | ((0+0)/5)/0.2 | ((1+1+1+1+0+1+0+0+0+0)/5)/1.0 |
Actual | Predicted |
1 | 0.85 |
1 | 0.75 |
1 | 0.7 |
1 | 0.65 |
0 | 0.65 |
1 | 0.55 |
0 | 0.55 |
0 | 0.45 |
0 | 0.3 |
0 | 0.1 |
LIFT PLOT
Depth | Lift | Cumulative Lift |
20 | 2 | 2 |
40 | 2 | 2 |
60 | 1 | 1.67 |
80 | 0 | 1.25 |
100 | 0 | 1 |
�Model Selection
Via the Bias-Variance Trade-off
Model Selection and Assessment
Adapted from An Introduction to Statistical Learning
Bias-Variance Decomposition
Adapted from An Introduction to Statistical Learning
The Bias-Variance Trade-off
Adapted from An Introduction to Statistical Learning
The Bias-Variance Trade-off
Number of parameters/rules
Validation Error
Variance
Bias
Random Error
Best model
Reported Error�(sum of below)
Decomposed Error
Model Performance and Assessment
Adapted from Introduction to Statistical Learning
Number of Input Variables
Error
Iteration Plot
Training Data
Validation Data
Test Data
Best number of variables in validation data
Best guess at real-world performance based on test data
Bias-Variance Trade-off in Practice: Honest assessment
Train
Validate
Test
Test
Train and
Cross – Validate
Available labeled data
Available labeled data
Estimate parameters or rules
Model selection and hyper-parameter tuning
Final honest assessment
Final honest assessment
Estimate parameters or rules
Model selection and hyper-parameter tuning
Best suited for big data.
Nearly always a more generalizable approach, but computationally intensive.
Leakage between partitions results in overly optimistic test error measurements
Leakage between partitions results in overly optimistic test error measurements
Reading