1 of 18

Lecture 30

Linear Regression

DATA 8

Spring 2022

2 of 18

Announcements

3 of 18

Correlation Coefficient

4 of 18

The Correlation Coefficient r

  • Measures linear association
  • Based on standard units
  • -1 ≤ r ≤ 1
    • r = 1: scatter is perfect straight line sloping up
    • r = -1: scatter is perfect straight line sloping down
  • r = 0: No linear association; uncorrelated

r = 0

r = 0.2

r = 0.5

r = 0.8

r = 0.99

r = -0.5

5 of 18

Definition of r

average of

product of

x in standard units

and

y in standard units

Correlation Coefficient (r) =

Measures how clustered the scatter is around a straight line

6 of 18

Care in Interpretation

7 of 18

Watch Out For ...

  • False conclusions of causation
  • Nonlinearity
  • Outliers
  • Ecological Correlations

(Demo)

8 of 18

Discussion question

True or False?

If the correlation of x and y is close to 0, then knowing one cannot help us predict the other.

9 of 18

Chocolate and Nobel Prizes

https://www.biostat.jhsph.edu/courses/bio621/misc/Chocolate%20consumption%20cognitive%20function%20and%20nobel%20laurates%20(NEJM).pdf

10 of 18

Prediction

11 of 18

Predicting Heights

  • Oval shaped

  • Moderate positive correlation

  • How can we predict child height from the parents’ average height?

Average of parents’ heights

Child’s (adult) height

12 of 18

Approach to Prediction

Average of parents’ heights

Child’s (adult) height

13 of 18

Predicted Heights

Average of parents’ heights

Child’s (adult) height

14 of 18

Nearest Neighbor Regression

A method for prediction:

  • Group each x with similar (nearby) x values
  • Average the corresponding y values for each group

For each x value, the prediction is the average of the y values in its nearby group.

The graph of these predictions is the “graph of averages”.

If the association between x and y is linear, then points in the graph of averages tend to fall on a line.

15 of 18

Where is the prediction line?

r = 0.99

16 of 18

Where is the prediction line?

r = 0.0

(Demo)

17 of 18

Linear Regression

18 of 18

Linear Regression

A statement about x and y pairs

  • Measured in standard units (su)
  • Describing the deviation of x from 0 (the average of x's)
  • And the deviation of the corresponding y from 0 (the average of y's)

On average, y deviates from 0 less than x deviates from 0

Not true for all points — a statement about averages

Regression line

correlation