1 of 34

Machine Learning I:

Model Assessment and Selection

Patrick Hall

Visiting Faculty, Department of Decision Sciences

George Washington University

2 of 34

Lecture 3 Agenda

  • Regression Assessment
  • Classification Assessment
  • Model Selection via the Bias-Variance Trade-off
  • Reading

3 of 34

Where are we in the modeling lifecycle?

Data Collection �& ETL

Feature Selection & Engineering

Supervised�Learning

Unsupervised�Learning

Deployment

Cost Intensive

Revenue

Generating

Assessment & Validation

4 of 34

�Regression Assessment

Sum of Squares, RMSE, & R2

5 of 34

One-Variable Linear Regression

6.5

7.0

7.5

8.0

8.5

(Logarithm of) Price

15.0 15.5 16.0 16.5 17.0 17.5

Avg Growing Season Temp (Celsius)

SSE = 10.15

SSE = 6.03

SSE = 5.73

Adapted from MIT Sloan Analytics Edge

SSM = Σ(ŷ(i) - ȳ)2�Estimates how different�current model is from ȳ, �the “naive” or “null” model

SSE = Σ(y(i) - ŷ(i))2 �Estimates how well current �model fits the training data, �minimized in OLS regression ��SST = SSE + SSM

6 of 34

Error Measures & Assessment

  • SSE can be hard to interpret
    • Depends on N
    • Units are hard to understand

  • Root-Mean-Square Error (RMSE)

  • Normalized by N, units of dependent variable

  • What if RMSE = $50 for the wine model?

Adapted from MIT Sloan Analytics Edge

Adapted from MIT Sloan Analytics Edge

7 of 34

Coefficient of Determination: R2

  • 1 – (SSE/TSS)

  • Compares the best model error (SSE) to a “baseline” model error (TSS)

  • The baseline model does not use any variables, just the average of the target

  • Predicts same outcome (price) regardless of the independent variable (temperature)

6.5

7.0

7.5

8.0

8.5

(Logarithm of) Price

15.0 15.5 16.0 16.5 17.0 17.5

Avg Growing Season Temp (Celsius)

Adapted from MIT Sloan Analytics Edge

Adapted from MIT Sloan Analytics Edge

8 of 34

Interpreting R2: Goodness of Fit

R2 captures value added from using a linear model:

    • R2 = 0 means no improvement over baseline
    • R2 = 1 means a perfect predictive model

Unitless and universally interpretable, R2 :

    • Can still be hard to compare between problems
    • Useful models for easy problems will have R2 ≈ 1
    • Useful models for hard problems can still have R2 ≈ 0

Adapted from MIT Sloan Analytics Edge

Adapted from MIT Sloan Analytics Edge

9 of 34

Classification Assessment

Confusion Matrix, ROC and AUC, & Lift

10 of 34

Confusion Matrix

From: https://en.wikipedia.org/wiki/Confusion_matrix

Source: https://en.wikipedia.org/wiki/Confusion_matrix

11 of 34

Confusion Matrix Metrics

  • Actual Condition
    • P: Presence of actual/true condition
    • N: Absence of actual/true condition
  • Prediction Condition (at a specified cutoff rate)
    • PP: Model predicts presence of a condition
    • PN: Model predicts absence of a condition
  • True Positive (TP) - model correctly predicts the presence of the actual condition
  • True Negative (TN) - model correctly predicts the absence of the condition
  • False Positive (FP) - model incorrectly predicts presence of the actual condition
  • False Negative (FN) - model incorrectly predicts the absence of actual condition
  • True positive rate (TPR) or sensitivity or recall or hit rate…
    • TP/P = TP/(TP+FN)
  • True negative rate (TNR) or specificity or selectivity…
    • TN/N = TN/(TN+FP)

From: https://en.wikipedia.org/wiki/Confusion_matrix

Source: https://en.wikipedia.org/wiki/Confusion_matrix

12 of 34

ROC Curve

  • Graphical assessment of a binary classifier as its decision threshold is varied.
  • Construct the ROC curve by plotting 1-specificity (false positive rate) on the x-axis and sensitivity (true positive rate) on the y-axis at various probability cutoff thresholds.
  • ROC AUC (area under the curve) is used for a measure of aggregated classification performance and comparison.
  • AUC ranges in value from 0.5 to 1. A model that whose predictions are completely accurate has AUC of 1.0 to one whose predictions are completely random to 0.5.

Image: https://en.wikipedia.org/wiki/Receiver_operating_characteristic#/media/File:Roccurves.png

13 of 34

ROC Calculation

Actual

Predicted

Cutoff

Predicted

Outcome

1

0.85

0.0

1

TP

1

0.75

0.0

1

TP

1

0.7

0.0

1

TP

1

0.65

0.0

1

TP

0

0.65

0.0

1

FP

1

0.55

0.0

1

TP

0

0.55

0.0

1

FP

0

0.45

0.0

1

FP

0

0.3

0.0

1

FP

0

0.1

0.0

1

FP

Outcome

Calculation

TP

5

FP

5

FN

0

TN

0

Sensitivity

5/5=1

Specificity

0/5=0

1-Specificity

1-0=1

Sensitivity (TP/(TP+FN))

Specificity (TN/(TN+FP))

14 of 34

ROC Calculation

Actual

Predicted

Cutoff

Predicted

Outcome

1

0.85

0.2

1

TP

1

0.75

0.2

1

TP

1

0.7

0.2

1

TP

1

0.65

0.2

1

TP

0

0.65

0.2

1

FP

1

0.55

0.2

1

TP

0

0.55

0.2

1

FP

0

0.45

0.2

1

FP

0

0.3

0.2

1

FP

0

0.1

0.2

0

TN

Outcome

Calculation

TP

5

FP

4

FN

0

TN

1

Sensitivity

5/5=1

Specificity

1/5=0.2

1-Specificity

1-0.2=0.8

Sensitivity (TP/(TP+FN))

Specificity (TN/(TN+FP))

15 of 34

ROC Calculation

Actual

Predicted

Cutoff

Predicted

Outcome

1

0.85

0.4

1

TP

1

0.75

0.4

1

TP

1

0.7

0.4

1

TP

1

0.65

0.4

1

TP

0

0.65

0.4

1

FP

1

0.55

0.4

1

TP

0

0.55

0.4

1

FP

0

0.45

0.4

1

FP

0

0.3

0.4

0

TN

0

0.1

0.4

0

TN

Outcome

Calculation

TP

5

FP

1

FN

3

TN

2

Sensitivity

5/5=1

Specificity

2/5=0.4

1-Specificity

1-0.4=0.6

Sensitivity (TP/(TP+FN))

Specificity (TN/(TN+FP))

16 of 34

ROC Calculation

Actual

Predicted

Cutoff

Predicted

Outcome

1

0.85

0.6

1

TP

1

0.75

0.6

1

TP

1

0.7

0.6

1

TP

1

0.65

0.6

1

TP

0

0.65

0.6

1

FP

1

0.55

0.6

0

FN

0

0.55

0.6

0

TN

0

0.45

0.6

0

TN

0

0.3

0.6

0

TN

0

0.1

0.6

0

TN

Outcome

Calculation

TP

4

FP

1

FN

1

TN

4

Sensitivity

4/5=0.8

Specificity

4/5=0.8

1-Specificity

1-0.8=0.2

Sensitivity (TP/(TP+FN))

Specificity (TN/(TN+FP))

17 of 34

ROC Calculation

Actual

Predicted

Cutoff

Predicted

Outcome

1

0.85

0.8

1

TP

1

0.75

0.8

0

FN

1

0.7

0.8

0

FN

1

0.65

0.8

0

FN

0

0.65

0.8

0

TN

1

0.55

0.8

0

FN

0

0.55

0.8

0

TN

0

0.45

0.8

0

TN

0

0.3

0.8

0

TN

0

0.1

0.8

0

TN

Outcome

Calculation

TP

1

FP

0

FN

4

TN

5

Sensitivity

1/5=0.2

Specificity

5/5=1

1-Specificity

1-1=0

Sensitivity (TP/(TP+FN))

Specificity (TN/(TN+FP))

18 of 34

ROC Calculation

Actual

Predicted

Cutoff

Predicted

Outcome

1

0.85

1

0

FN

1

0.75

1

0

FN

1

0.7

1

0

FN

1

0.65

1

0

FN

0

0.65

1

0

TN

1

0.55

1

0

FN

0

0.55

1

0

TN

0

0.45

1

0

TN

0

0.3

1

0

TN

0

0.1

1

0

TN

Outcome

Calculation

TP

0

FP

0

FN

5

TN

5

Sensitivity

0/5=0

Specificity

5/5=1

1-Specificity

1-1=0

Sensitivity (TP/(TP+FN))

Specificity (TN/(TN+FP))

19 of 34

ROC CURVE and AUC

Cutoff

1-Specificity

Sensitivity

0.0

1

1

0.2

0.8

1

0.4

0.6

1

0.6

0.2

0.8

0.8

0

0.2

1.0

0

0

  • ROC AUC is bounded between 0 and 1. Values below and including 0.5 indicates serious problems with the model, and above 0.5, as they approach 1, indicates a better model.
  • AUC = 0.7 Interpretation: “The expectation that this models ranks a uniformly drawn random positive before a uniformly drawn random negative is 70%.”

20 of 34

LIFT Calculation and Plot

  • Lift is a measure of the effectiveness of a classifier when compared against a random guess.
  • Lift plot displays model performance against a baseline.
    • The greater the area between the lift curve and the baseline, the better the model.
  • Lift calculation - will used following quantile example to illustrate.
    • Arrange the observation in the decreasing order of predicted probability.
    • Divide the data sets into quantiles. Calculate the number of positives in each quantile and the cumulative number of positives up to a quantile.
    • Lift is the ratio of the number of positive observations up to a quantile using the model to the expected number of positives up to that quantile based on a random model.
    • Lift plot is the chart between the lift on the vertical axis and the corresponding quantile on the horizontal axis.

Source: http://www2.cs.uregina.ca/~dbd/cs831/notes/lift_chart/lift_chart.html

Source: https://www.geeksforgeeks.org/understanding-gain-chart-and-lift-chart/

21 of 34

LIFT PLOT CALCULATIONS

Actual

Predicted

1

0.85

1

0.75

1

0.7

1

0.65

0

0.65

1

0.55

0

0.55

0

0.45

0

0.3

0

0.1

Depth

Lift

Cumulative Lift

20

((1+1)/5)/0.2

((1+1)/5)/0.2

40

60

80

100

Actual

Predicted

1

0.85

1

0.75

1

0.7

1

0.65

0

0.65

1

0.55

0

0.55

0

0.45

0

0.3

0

0.1

  • In this example, divisor for lift is 0.2 because depth is increased by 20 percentile. That is, if I draw from the data randomly, I expect to get 20% of the total responses in 20% of the data
  • The divisor for cumulative lift changes because we are accumulating the expected responses and the actual responses

22 of 34

LIFT PLOT CALCULATIONS

Actual

Predicted

1

0.85

1

0.75

1

0.7

1

0.65

0

0.65

1

0.55

0

0.55

0

0.45

0

0.3

0

0.1

Depth

Lift

Cumulative Lift

20

((1+1)/5)/0.2

((1+1)/5)/0.2

40

((1+1)/5)/0.2

((1+1+1+1)/5)/0.4

60

80

100

Actual

Predicted

1

0.85

1

0.75

1

0.7

1

0.65

0

0.65

1

0.55

0

0.55

0

0.45

0

0.3

0

0.1

23 of 34

LIFT PLOT CALCULATIONS

Actual

Predicted

1

0.85

1

0.75

1

0.7

1

0.65

0

0.65

1

0.55

0

0.55

0

0.45

0

0.3

0

0.1

Depth

Lift

Cumulative Lift

20

((1+1)/5)/0.2

((1+1)/5)/0.2

40

((1+1)/5)/0.2

((1+1+1+1)/5)/0.4

60

((0+1)/5)/0.2

((1+1+1+1+0+1)/5)/0.6

80

100

Actual

Predicted

1

0.85

1

0.75

1

0.7

1

0.65

0

0.65

1

0.55

0

0.55

0

0.45

0

0.3

0

0.1

24 of 34

LIFT PLOT CALCULATIONS

Actual

Predicted

1

0.85

1

0.75

1

0.7

1

0.65

0

0.65

1

0.55

0

0.55

0

0.45

0

0.3

0

0.1

Depth

Lift

Cumulative Lift

20

((1+1)/5)/0.2

((1+1)/5)/0.2

40

((1+1)/5)/0.2

((1+1+1+1)/5)/0.4

60

((0+1)/5)/0.2

((1+1+1+1+0+1)/5)/0.6

80

((0+0)/5)/0.2

((1+1+1+1+0+1+0+0)/5)/0.8

100

Actual

Predicted

1

0.85

1

0.75

1

0.7

1

0.65

0

0.65

1

0.55

0

0.55

0

0.45

0

0.3

0

0.1

25 of 34

LIFT PLOT CALCULATIONS

Actual

Predicted

1

0.85

1

0.75

1

0.7

1

0.65

0

0.65

1

0.55

0

0.55

0

0.45

0

0.3

0

0.1

Depth

Lift

Cumulative Lift

20

((1+1)/5)/0.2

((1+1)/5)/0.2

40

((1+1)/5)/0.2

((1+1+1+1)/5)/0.4

60

((0+1)/5)/0.2

((1+1+1+1+0+1)/5)/0.6

80

((0+0)/5)/0.2

((1+1+1+1+0+1+0+0)/5)/0.8

100

((0+0)/5)/0.2

((1+1+1+1+0+1+0+0+0+0)/5)/1.0

Actual

Predicted

1

0.85

1

0.75

1

0.7

1

0.65

0

0.65

1

0.55

0

0.55

0

0.45

0

0.3

0

0.1

26 of 34

LIFT PLOT

Depth

Lift

Cumulative Lift

20

2

2

40

2

2

60

1

1.67

80

0

1.25

100

0

1

  • Lift will decrease to 0 and cumulative lift will decrease to 1.
  • The better the model, the higher the lift, particularly at low depths.
  • In this example, lift is measured at the 20th percentile.
  • Interpretation: “In the 20th percentile of highest predicted probabilities, this model predicted 2 times more events correctly than in a random selection of 20% of the data.”

27 of 34

Model Selection

Via the Bias-Variance Trade-off

28 of 34

Model Selection and Assessment

  • The generalization performance of a learning method relates to its prediction capability on new, unseen data. We approximate this with independent test data.
  • Performance assessment is extremely important in practice, since it guides the choice of learning methods or model, and gives us a measure of the quality of chosen model.
  • Two Separate goals:
    • Model selection: Estimating the performance of different models in order to choose the best one.
    • Model assessment: Having chosen a final model, estimate its prediction error (generalization error) on new data.

Adapted from An Introduction to Statistical Learning

29 of 34

Bias-Variance Decomposition

  • Randomly divide the dataset into three parts: a training (fit the model), validation (estimate prediction error for model selection), test (assessment of the generalization error of the final chosen model) set.
  • Typical split:
    • 50% training, 25% validation, and 25% testing
  • Validation methods:
    • Analytical: AUC, R2, RMSE
    • Sample reuse: cross-validation

Adapted from An Introduction to Statistical Learning

30 of 34

The Bias-Variance Trade-off

  • In order to minimize the expected test error, we need to select a method that simultaneously achieves low variance and low bias.
  • Variance is amount by which our model would change if we estimate it using a different training data set. In general, more flexible methods have higher variance.
  • Bias is an error that is introduced by approximating a real-life problem by a much simpler model. In general, simple methods result in more bias.
  • Generally, more flexible methods will increase variance and decrease bias. Simpler methods will decrease variance and increase bias.
  • The bias-variance trade-off enables us to pick useful models that balance simplicity and complexity, and so in a quantitative manner.

Adapted from An Introduction to Statistical Learning

31 of 34

The Bias-Variance Trade-off

Number of parameters/rules

Validation Error

 

Variance

Bias

Random Error

 

 

Best model

Reported Error�(sum of below)

Decomposed Error

32 of 34

Model Performance and Assessment

Adapted from Introduction to Statistical Learning

Number of Input Variables

Error

Iteration Plot

Training Data

Validation Data

Test Data

Best number of variables in validation data

Best guess at real-world performance based on test data

33 of 34

Bias-Variance Trade-off in Practice: Honest assessment

Train

Validate

Test

Test

Train and

Cross – Validate

Available labeled data

Available labeled data

Estimate parameters or rules

Model selection and hyper-parameter tuning

Final honest assessment

Final honest assessment

Estimate parameters or rules

Model selection and hyper-parameter tuning

Best suited for big data.

Nearly always a more generalizable approach, but computationally intensive.

Leakage between partitions results in overly optimistic test error measurements

Leakage between partitions results in overly optimistic test error measurements

34 of 34

Reading

  • Elements of Statistical Learning:
    • Sections 7.1 - 7.5
    • Section 7.10
  • Introduction to Data Mining
    • Sections 3.4 - 3.6
  • KDD-Cup 2004: Results and Analysis