5 of 34

One-Variable Linear Regression

6.5

7.0

7.5

8.0

8.5

(Logarithm of) Price

15.0 15.5 16.0 16.5 17.0 17.5

Avg Growing Season Temp (Celsius)

SSE = 10.15

SSE = 6.03

SSE = 5.73

Adapted from MIT Sloan Analytics Edge

SSM = Σ(ŷ⁽ⁱ⁾ - ȳ)²�Estimates how different�current model is from ȳ, �the “naive” or “null” model

SSE = Σ(y⁽ⁱ⁾ - ŷ⁽ⁱ⁾)² �Estimates how well current �model fits the training data, �minimized in OLS regression ��SST = SSE + SSM

6 of 34

Error Measures & Assessment

SSE can be hard to interpret

Depends on N
Units are hard to understand

Root-Mean-Square Error (RMSE)

Normalized by N, units of dependent variable

What if RMSE = $50 for the wine model?

Adapted from MIT Sloan Analytics Edge

7 of 34

Coefficient of Determination: R²

1 – (SSE/TSS)

Compares the best model error (SSE) to a “baseline” model error (TSS)

The baseline model does not use any variables, just the average of the target

Predicts same outcome (price) regardless of the independent variable (temperature)

6.5

7.0

7.5

8.0

8.5

(Logarithm of) Price

15.0 15.5 16.0 16.5 17.0 17.5

Avg Growing Season Temp (Celsius)

Adapted from MIT Sloan Analytics Edge

8 of 34

Interpreting R²: Goodness of Fit

R²captures value added from using a linear model:

R²= 0 means no improvement over baseline
R²= 1 means a perfect predictive model

Unitless and universally interpretable, R² :

Can still be hard to compare between problems
Useful models for easy problems will have R²≈ 1
Useful models for hard problems can still have R²≈ 0

Adapted from MIT Sloan Analytics Edge

9 of 34

�Classification Assessment

Confusion Matrix, ROC and AUC, & Lift

10 of 34

Confusion Matrix

From: https://en.wikipedia.org/wiki/Confusion_matrix

Source: https://en.wikipedia.org/wiki/Confusion_matrix

11 of 34

Confusion Matrix Metrics

Actual Condition

P: Presence of actual/true condition
N: Absence of actual/true condition

Prediction Condition (at a specified cutoff rate)

PP: Model predicts presence of a condition
PN: Model predicts absence of a condition

True Positive (TP) - model correctly predicts the presence of the actual condition
True Negative (TN) - model correctly predicts the absence of the condition
False Positive (FP) - model incorrectly predicts presence of the actual condition
False Negative (FN) - model incorrectly predicts the absence of actual condition

True positive rate (TPR) or sensitivity or recall or hit rate…

TP/P = TP/(TP+FN)

True negative rate (TNR) or specificity or selectivity…

TN/N = TN/(TN+FP)

From: https://en.wikipedia.org/wiki/Confusion_matrix

Source: https://en.wikipedia.org/wiki/Confusion_matrix

12 of 34

ROC Curve

Graphical assessment of a binary classifier as its decision threshold is varied.
Construct the ROC curve by plotting 1-specificity (false positive rate) on the x-axis and sensitivity (true positive rate) on the y-axis at various probability cutoff thresholds.
ROC AUC (area under the curve) is used for a measure of aggregated classification performance and comparison.
AUC ranges in value from 0.5 to 1. A model that whose predictions are completely accurate has AUC of 1.0 to one whose predictions are completely random to 0.5.

Image: https://en.wikipedia.org/wiki/Receiver_operating_characteristic#/media/File:Roccurves.png

“

13 of 34

ROC Calculation

Actual	Predicted	Cutoff	Predicted	Outcome
1	0.85	0.0	1	TP
1	0.75	0.0	1	TP
1	0.7	0.0	1	TP
1	0.65	0.0	1	TP
0	0.65	0.0	1	FP
1	0.55	0.0	1	TP
0	0.55	0.0	1	FP
0	0.45	0.0	1	FP
0	0.3	0.0	1	FP
0	0.1	0.0	1	FP

Outcome	Calculation
TP	5
FP	5
FN	0
TN	0
Sensitivity	5/5=1
Specificity	0/5=0
1-Specificity	1-0=1

Sensitivity (TP/(TP+FN))

Specificity (TN/(TN+FP))

14 of 34

ROC Calculation

Actual	Predicted	Cutoff	Predicted	Outcome
1	0.85	0.2	1	TP
1	0.75	0.2	1	TP
1	0.7	0.2	1	TP
1	0.65	0.2	1	TP
0	0.65	0.2	1	FP
1	0.55	0.2	1	TP
0	0.55	0.2	1	FP
0	0.45	0.2	1	FP
0	0.3	0.2	1	FP
0	0.1	0.2	0	TN

Outcome	Calculation
TP	5
FP	4
FN	0
TN	1
Sensitivity	5/5=1
Specificity	1/5=0.2
1-Specificity	1-0.2=0.8

Sensitivity (TP/(TP+FN))

Specificity (TN/(TN+FP))

15 of 34

ROC Calculation

Actual	Predicted	Cutoff	Predicted	Outcome
1	0.85	0.4	1	TP
1	0.75	0.4	1	TP
1	0.7	0.4	1	TP
1	0.65	0.4	1	TP
0	0.65	0.4	1	FP
1	0.55	0.4	1	TP
0	0.55	0.4	1	FP
0	0.45	0.4	1	FP
0	0.3	0.4	0	TN
0	0.1	0.4	0	TN

Outcome	Calculation
TP	5
FP	1
FN	3
TN	2
Sensitivity	5/5=1
Specificity	2/5=0.4
1-Specificity	1-0.4=0.6

Sensitivity (TP/(TP+FN))

Specificity (TN/(TN+FP))

16 of 34

ROC Calculation

Actual	Predicted	Cutoff	Predicted	Outcome
1	0.85	0.6	1	TP
1	0.75	0.6	1	TP
1	0.7	0.6	1	TP
1	0.65	0.6	1	TP
0	0.65	0.6	1	FP
1	0.55	0.6	0	FN
0	0.55	0.6	0	TN
0	0.45	0.6	0	TN
0	0.3	0.6	0	TN
0	0.1	0.6	0	TN

Outcome	Calculation
TP	4
FP	1
FN	1
TN	4
Sensitivity	4/5=0.8
Specificity	4/5=0.8
1-Specificity	1-0.8=0.2

Sensitivity (TP/(TP+FN))

Specificity (TN/(TN+FP))

17 of 34

ROC Calculation

Actual	Predicted	Cutoff	Predicted	Outcome
1	0.85	0.8	1	TP
1	0.75	0.8	0	FN
1	0.7	0.8	0	FN
1	0.65	0.8	0	FN
0	0.65	0.8	0	TN
1	0.55	0.8	0	FN
0	0.55	0.8	0	TN
0	0.45	0.8	0	TN
0	0.3	0.8	0	TN
0	0.1	0.8	0	TN

Outcome	Calculation
TP	1
FP	0
FN	4
TN	5
Sensitivity	1/5=0.2
Specificity	5/5=1
1-Specificity	1-1=0

Sensitivity (TP/(TP+FN))

Specificity (TN/(TN+FP))

18 of 34

ROC Calculation

Actual	Predicted	Cutoff	Predicted	Outcome
1	0.85	1	0	FN
1	0.75	1	0	FN
1	0.7	1	0	FN
1	0.65	1	0	FN
0	0.65	1	0	TN
1	0.55	1	0	FN
0	0.55	1	0	TN
0	0.45	1	0	TN
0	0.3	1	0	TN
0	0.1	1	0	TN

Outcome	Calculation
TP	0
FP	0
FN	5
TN	5
Sensitivity	0/5=0
Specificity	5/5=1
1-Specificity	1-1=0

Sensitivity (TP/(TP+FN))

Specificity (TN/(TN+FP))

19 of 34

ROC CURVE and AUC

Cutoff	1-Specificity	Sensitivity
0.0	1	1
0.2	0.8	1
0.4	0.6	1
0.6	0.2	0.8
0.8	0	0.2
1.0	0	0

ROC AUC is bounded between 0 and 1. Values below and including 0.5 indicates serious problems with the model, and above 0.5, as they approach 1, indicates a better model.
AUC = 0.7 Interpretation: “The expectation that this models ranks a uniformly drawn random positive before a uniformly drawn random negative is 70%.”

20 of 34

LIFT Calculation and Plot

Lift is a measure of the effectiveness of a classifier when compared against a random guess.
Lift plot displays model performance against a baseline.

The greater the area between the lift curve and the baseline, the better the model.

Lift calculation - will used following quantile example to illustrate.

Arrange the observation in the decreasing order of predicted probability.
Divide the data sets into quantiles. Calculate the number of positives in each quantile and the cumulative number of positives up to a quantile.
Lift is the ratio of the number of positive observations up to a quantile using the model to the expected number of positives up to that quantile based on a random model.
Lift plot is the chart between the lift on the vertical axis and the corresponding quantile on the horizontal axis.

Source: http://www2.cs.uregina.ca/~dbd/cs831/notes/lift_chart/lift_chart.html

Source: https://www.geeksforgeeks.org/understanding-gain-chart-and-lift-chart/

21 of 34

LIFT PLOT CALCULATIONS

Actual	Predicted
1	0.85
1	0.75
1	0.7
1	0.65
0	0.65
1	0.55
0	0.55
0	0.45
0	0.3
0	0.1

Depth	Lift	Cumulative Lift
20	((1+1)/5)/0.2	((1+1)/5)/0.2
40
60
80
100

Actual	Predicted
1	0.85
1	0.75
1	0.7
1	0.65
0	0.65
1	0.55
0	0.55
0	0.45
0	0.3
0	0.1

In this example, divisor for lift is 0.2 because depth is increased by 20 percentile. That is, if I draw from the data randomly, I expect to get 20% of the total responses in 20% of the data
The divisor for cumulative lift changes because we are accumulating the expected responses and the actual responses

22 of 34

LIFT PLOT CALCULATIONS

Actual	Predicted
1	0.85
1	0.75
1	0.7
1	0.65
0	0.65
1	0.55
0	0.55
0	0.45
0	0.3
0	0.1

Depth	Lift	Cumulative Lift
20	((1+1)/5)/0.2	((1+1)/5)/0.2
40	((1+1)/5)/0.2	((1+1+1+1)/5)/0.4
60
80
100

Actual	Predicted
1	0.85
1	0.75
1	0.7
1	0.65
0	0.65
1	0.55
0	0.55
0	0.45
0	0.3
0	0.1

23 of 34

LIFT PLOT CALCULATIONS

Actual	Predicted
1	0.85
1	0.75
1	0.7
1	0.65
0	0.65
1	0.55
0	0.55
0	0.45
0	0.3
0	0.1

Depth	Lift	Cumulative Lift
20	((1+1)/5)/0.2	((1+1)/5)/0.2
40	((1+1)/5)/0.2	((1+1+1+1)/5)/0.4
60	((0+1)/5)/0.2	((1+1+1+1+0+1)/5)/0.6
80
100

Actual	Predicted
1	0.85
1	0.75
1	0.7
1	0.65
0	0.65
1	0.55
0	0.55
0	0.45
0	0.3
0	0.1

24 of 34

LIFT PLOT CALCULATIONS

Actual	Predicted
1	0.85
1	0.75
1	0.7
1	0.65
0	0.65
1	0.55
0	0.55
0	0.45
0	0.3
0	0.1

Depth	Lift	Cumulative Lift
20	((1+1)/5)/0.2	((1+1)/5)/0.2
40	((1+1)/5)/0.2	((1+1+1+1)/5)/0.4
60	((0+1)/5)/0.2	((1+1+1+1+0+1)/5)/0.6
80	((0+0)/5)/0.2	((1+1+1+1+0+1+0+0)/5)/0.8
100

Actual	Predicted
1	0.85
1	0.75
1	0.7
1	0.65
0	0.65
1	0.55
0	0.55
0	0.45
0	0.3
0	0.1

25 of 34

LIFT PLOT CALCULATIONS

Actual	Predicted
1	0.85
1	0.75
1	0.7
1	0.65
0	0.65
1	0.55
0	0.55
0	0.45
0	0.3
0	0.1

Depth	Lift	Cumulative Lift
20	((1+1)/5)/0.2	((1+1)/5)/0.2
40	((1+1)/5)/0.2	((1+1+1+1)/5)/0.4
60	((0+1)/5)/0.2	((1+1+1+1+0+1)/5)/0.6
80	((0+0)/5)/0.2	((1+1+1+1+0+1+0+0)/5)/0.8
100	((0+0)/5)/0.2	((1+1+1+1+0+1+0+0+0+0)/5)/1.0

Actual	Predicted
1	0.85
1	0.75
1	0.7
1	0.65
0	0.65
1	0.55
0	0.55
0	0.45
0	0.3
0	0.1

26 of 34

LIFT PLOT

Depth	Lift	Cumulative Lift
20	2	2
40	2	2
60	1	1.67
80	0	1.25
100	0	1

Lift will decrease to 0 and cumulative lift will decrease to 1.
The better the model, the higher the lift, particularly at low depths.
In this example, lift is measured at the 20th percentile.
Interpretation: “In the 20th percentile of highest predicted probabilities, this model predicted 2 times more events correctly than in a random selection of 20% of the data.”

27 of 34

�Model Selection

Via the Bias-Variance Trade-off

28 of 34

Model Selection and Assessment

The generalization performance of a learning method relates to its prediction capability on new, unseen data. We approximate this with independent test data.
Performance assessment is extremely important in practice, since it guides the choice of learning methods or model, and gives us a measure of the quality of chosen model.
Two Separate goals:

Model selection: Estimating the performance of different models in order to choose the best one.
Model assessment: Having chosen a final model, estimate its prediction error (generalization error) on new data.

Adapted from An Introduction to Statistical Learning

29 of 34

Bias-Variance Decomposition

Randomly divide the dataset into three parts: a training (fit the model), validation (estimate prediction error for model selection), test (assessment of the generalization error of the final chosen model) set.
Typical split:

50% training, 25% validation, and 25% testing

Validation methods:

Analytical: AUC, R², RMSE
Sample reuse: cross-validation

Adapted from An Introduction to Statistical Learning

30 of 34

The Bias-Variance Trade-off

In order to minimize the expected test error, we need to select a method that simultaneously achieves low variance and low bias.
Variance is amount by which our model would change if we estimate it using a different training data set. In general, more flexible methods have higher variance.
Bias is an error that is introduced by approximating a real-life problem by a much simpler model. In general, simple methods result in more bias.
Generally, more flexible methods will increase variance and decrease bias. Simpler methods will decrease variance and increase bias.
The bias-variance trade-off enables us to pick useful models that balance simplicity and complexity, and so in a quantitative manner.

Adapted from An Introduction to Statistical Learning

31 of 34

The Bias-Variance Trade-off

Number of parameters/rules

Validation Error

Variance

Bias

Random Error

Best model

Reported Error�(sum of below)

Decomposed Error

32 of 34

Model Performance and Assessment

Adapted from Introduction to Statistical Learning

Number of Input Variables

Error

Iteration Plot

Training Data

Validation Data

Test Data

Best number of variables in validation data

Best guess at real-world performance based on test data

33 of 34

Bias-Variance Trade-off in Practice: Honest assessment

Train

Validate

Test

Train and

Cross – Validate

Available labeled data

Estimate parameters or rules

Model selection and hyper-parameter tuning

Final honest assessment

Estimate parameters or rules

Model selection and hyper-parameter tuning

Best suited for big data.

Nearly always a more generalizable approach, but computationally intensive.

Leakage between partitions results in overly optimistic test error measurements

1 of 34

2 of 34

3 of 34

4 of 34

5 of 34

6 of 34

7 of 34

8 of 34

9 of 34

10 of 34

11 of 34

12 of 34

13 of 34

14 of 34

15 of 34

16 of 34

17 of 34

18 of 34

19 of 34

20 of 34

21 of 34

22 of 34

23 of 34

24 of 34

25 of 34

26 of 34

27 of 34

28 of 34

29 of 34

30 of 34

31 of 34

32 of 34

33 of 34

34 of 34