1 of 19

Lecture 33

Residuals

DATA 8

Fall 2020

2 of 19

Regression roadmap

Monday:

Least squares: finding the “best” line for a dataset

Wednesday: holiday - no class
Today:

Residuals: analyzing mistakes and errors

Monday:

Regression inference: understanding uncertainty

3 of 19

Errors and Residuals

4 of 19

Error in Estimation

error = actual value - estimate

Some errors are positive and some negative

To measure the rough size of the errors

square the errors to eliminate cancellation
take the mean of the squared errors
take the square root to fix the units
root mean square error (rmse)

5 of 19

Residuals

Error in regression estimate

One residual corresponding to each point (x, y)

residual

= observed y - regression estimate of y

= observed y - height of regression line at x

= vertical distance between the point and the best line

(Demo)

6 of 19

Regression Diagnostics

7 of 19

Example: Dugongs

(Demo)

8 of 19

Residual Plot

A scatter diagram of residuals

Should look like an unassociated blob for linear relations
But will show patterns for non-linear relations
Used to check whether linear regression is appropriate
Look for curves, trends, changes in spread, outliers, or any other patterns

(Demo)

9 of 19

Properties of residuals

Residuals from a linear regression always have

Zero mean

(so rmse = SD of residuals)

Zero correlation with x
Zero correlation with the fitted values

These are all true no matter what the data look like

Just like deviations from mean are zero on average

(Demo)

10 of 19

Discussion Questions

How would we adjust our regression line…

if the average residual were 10?

if the residuals were positively correlated with x?

if the residuals were above 0 in the middle and below 0 on the left and right?

11 of 19

A Measure of Clustering

12 of 19

Correlation, Revisited

“The correlation coefficient measures how clustered the points are about a straight line.”

We can now quantify this statement.

(Demo)

13 of 19

SD of Fitted Values

SD of fitted values

---------------------------- = |r|

SD of y

SD of fitted values = |r| * (SD of y)

14 of 19

Variance of Fitted Values

Variance = Square of the SD

= Mean Square of the Deviations

Variance has weird units, but good math properties

Variance of fitted values

--------------------------------- = r²

Variance of y

15 of 19

A Variance Decomposition

By definition,

y = fitted values + residuals

Tempting (but wrong) to think that:

SD(y) = SD(fitted values) + SD(residuals)

But it is true that:

Var(y) = Var(fitted values) + Var(residuals)

(a result of the Pythagorean theorem!)

16 of 19

A Variance Decomposition

Var(y) = Var(fitted values) + Var(residuals)

Variance of fitted values

--------------------------------- = r²

Variance of y

Variance of residuals

--------------------------------- = 1 - r²

Variance of y

17 of 19

Residual Average and SD

The average of residuals is always 0

Variance of residuals

--------------------------------- = 1 - r²

Variance of y

SD of residuals = √(1 - r²) SD of y

(Demo)

18 of 19

Discussion Question 1

Midterm: Average 70, SD 10

Final: Average 60, SD 15

r = 0.6

Fill in the blank:

The SD of the residuals is _______.

19 of 19

Discussion Question 2

Midterm: Average 70, SD 10

Final: Average 60, SD 15

r = 0.6

Fill in the blank:

For at least 75% of the students, the regression estimate of final score based on midterm score will be correct to within ___________ points.