1 of 26

Lecture 31

Linear Regression

DATA 8

Fall 2020

2 of 26

Regression roadmap

  • Monday:
    • Association and correlation
  • Today:
    • Prediction, scatterplots and lines
  • Next Monday:
    • Least squares: finding the “best” line for a dataset
  • Next Wednesday: no class
  • Next Friday:
    • Residuals: analyzing mistakes and errors

3 of 26

Correlation (Review)

4 of 26

The Correlation Coefficient r

  • Measures linear association
  • Based on standard units
  • -1 ≤ r ≤ 1
    • r = 1: scatter is perfect straight line sloping up
    • r = -1: scatter is perfect straight line sloping down
  • r = 0: No linear association; uncorrelated

r = 0

r = 0.2

r = 0.5

r = 0.8

r = 0.99

r = -0.5

5 of 26

Definition of r

average of

product of

x in standard units

and

y in standard units

Correlation Coefficient (r) =

Measures how clustered the scatter is around a straight line

6 of 26

Discussion Question

For each pair, which one will have a higher* value of r?

a)

b)

c)

d)

* here, “higher” means “bigger on the number line”

7 of 26

Care in Interpretation

8 of 26

Watch Out For ...

  • False conclusions of causation
  • Nonlinearity
  • Outliers
  • Ecological Correlations

(Demo)

9 of 26

Chocolate and Nobel Prizes

10 of 26

11 of 26

Discussion question

True or False?

  1. If x and y have a correlation of 1, then one must cause the other.

  • If the correlation of x and y is close to 0, then knowing one will never help us predict the other.

  • If x and y have a correlation of -0.8, then they have a negative association.

12 of 26

Prediction

13 of 26

Galton's Heights

  • Oval shaped

  • Moderate positive correlation

  • How can we predict child height from mid-parent height?

14 of 26

Galton's Heights

15 of 26

Galton's Heights

16 of 26

Nearest Neighbor Regression

A method for prediction:

  • Group each x with similar (nearby) x values
  • Average the corresponding y values for each group

For each x value, the prediction is the average of the y values in its nearby group.

The graph of these predictions is the “graph of averages”.

If the association between x and y is linear, then points in the graph of averages tend to fall on a line.

17 of 26

Where is the prediction line?

r = 0.99

18 of 26

Where is the prediction line?

r = 0.0

(Demo)

19 of 26

Linear Regression

20 of 26

Linear Regression

A statement about x and y pairs

  • Measured in standard units
  • Describing the deviation of x from 0 (the average of x's)
  • And the deviation of y from 0 (the average of y's)

On average, y deviates from 0 less than x deviates from 0

Not true for all points — a statement about averages

Regression Line

Correlation

21 of 26

Slope & Intercept

22 of 26

Regression Line Equation

In original units, the regression line has this equation:

Lines can be expressed by slope & intercept

estimated y in standard units

x in standard units

23 of 26

Regression Line

Standard Units

(0, 0)

1

r

Original Units

(Average x,� Average y)

SD x

r * SD y

24 of 26

Slope and Intercept

estimate of y = slope * x + intercept

(Demo)

25 of 26

Discussion Question

Suppose we use linear regression to predict candy prices (in dollars) from sugar content (in grams). What are the units of each of the following?

  • r

  • The slope

  • The intercept

26 of 26

Discussion Question

A course has a midterm (average 70; standard deviation 10)�and a really hard final (average 50; standard deviation 12)

If the scatter diagram comparing midterm & final scores for students has an oval shape with correlation 0.75, then...

What do you expect the average final score would be for students who scored 90 on the midterm?

How about 60 on the midterm?

(Demo)