1 of 19

Lecture 33

Residuals

DATA 8

Fall 2020

2 of 19

Regression roadmap

  • Monday:
    • Least squares: finding the “best” line for a dataset
  • Wednesday: holiday - no class
  • Today:
    • Residuals: analyzing mistakes and errors
  • Monday:
    • Regression inference: understanding uncertainty

3 of 19

Errors and Residuals

4 of 19

Error in Estimation

  • error = actual value - estimate

  • Some errors are positive and some negative

  • To measure the rough size of the errors
    • square the errors to eliminate cancellation
    • take the mean of the squared errors
    • take the square root to fix the units
    • root mean square error (rmse)

5 of 19

Residuals

  • Error in regression estimate

  • One residual corresponding to each point (x, y)

  • residual

= observed y - regression estimate of y

= observed y - height of regression line at x

= vertical distance between the point and the best line

(Demo)

6 of 19

Regression Diagnostics

7 of 19

Example: Dugongs

(Demo)

8 of 19

Residual Plot

A scatter diagram of residuals

  • Should look like an unassociated blob for linear relations
  • But will show patterns for non-linear relations
  • Used to check whether linear regression is appropriate
  • Look for curves, trends, changes in spread, outliers, or any other patterns

(Demo)

9 of 19

Properties of residuals

  • Residuals from a linear regression always have
    • Zero mean
      • (so rmse = SD of residuals)
    • Zero correlation with x
    • Zero correlation with the fitted values

  • These are all true no matter what the data look like
    • Just like deviations from mean are zero on average

(Demo)

10 of 19

Discussion Questions

How would we adjust our regression line…

  • if the average residual were 10?

  • if the residuals were positively correlated with x?

  • if the residuals were above 0 in the middle and below 0 on the left and right?

11 of 19

A Measure of Clustering

12 of 19

Correlation, Revisited

  • “The correlation coefficient measures how clustered the points are about a straight line.”

  • We can now quantify this statement.

(Demo)

13 of 19

SD of Fitted Values

  • SD of fitted values

---------------------------- = |r|

SD of y

  • SD of fitted values = |r| * (SD of y)

14 of 19

Variance of Fitted Values

  • Variance = Square of the SD

= Mean Square of the Deviations

  • Variance has weird units, but good math properties

  • Variance of fitted values

--------------------------------- = r²

Variance of y

15 of 19

A Variance Decomposition

By definition,

y = fitted values + residuals

Tempting (but wrong) to think that:

SD(y) = SD(fitted values) + SD(residuals)

But it is true that:

Var(y) = Var(fitted values) + Var(residuals)

(a result of the Pythagorean theorem!)

16 of 19

A Variance Decomposition

Var(y) = Var(fitted values) + Var(residuals)

  • Variance of fitted values

--------------------------------- = r²

Variance of y

  • Variance of residuals

--------------------------------- = 1 - r²

Variance of y

17 of 19

Residual Average and SD

  • The average of residuals is always 0

  • Variance of residuals

--------------------------------- = 1 - r²

Variance of y

  • SD of residuals = √(1 - ) SD of y

(Demo)

18 of 19

Discussion Question 1

Midterm: Average 70, SD 10

Final: Average 60, SD 15

r = 0.6

Fill in the blank:

The SD of the residuals is _______.

19 of 19

Discussion Question 2

Midterm: Average 70, SD 10

Final: Average 60, SD 15

r = 0.6

Fill in the blank:

For at least 75% of the students, the regression estimate of final score based on midterm score will be correct to within ___________ points.