1 of 28

Lecture 30

Linear Regression

DATA 8

Fall 2017

Slides created by John DeNero (denero@berkeley.edu) and Ani Adhikari (adhikari@berkeley.edu)

2 of 28

Announcements

3 of 28

Correlation (Review)

4 of 28

The Correlation Coefficient r

  • Measures linear association
  • Based on standard units
  • -1 ≤ r ≤ 1
    • r = 1: scatter is perfect straight line sloping up
    • r = -1: scatter is perfect straight line sloping down
  • r = 0: No linear association; uncorrelated

r = 0

r = 0.2

r = 0.5

r = 0.8

r = 0.99

r = -0.5

5 of 28

Definition of r

average of

product of

x in standard units

and

y in standard units

Correlation Coefficient (r) =

Measures how clustered the scatter is around a straight line

6 of 28

Properties of Correlation

7 of 28

Properties of r

  • r is a pure number, with no units
  • r is not affected by changing units of measurement
  • r is not affected by switching the horizontal and vertical axes

8 of 28

Interpreting r

Watch out for:

  • Jumping to conclusions about causality
  • Non-linearity
  • Outliers
  • Ecological correlations, based on aggregates or averaged data

9 of 28

Interpreting r

Don't jump to conclusions about causality

10 of 28

Interpreting r

Watch out for non-linearity.

r = 0.0

11 of 28

Interpreting r

Watch out for outliers.

r = 0.0

12 of 28

Interpreting r

Watch out for ecological correlations, based on aggregates or averaged data.

r = 0.98

13 of 28

Attendance

14 of 28

Prediction

15 of 28

Galton's Heights

16 of 28

Galton's Heights

17 of 28

Galton's Heights

18 of 28

Where is the prediction line?

r = 0.99

19 of 28

Where is the prediction line?

r = 0.0

20 of 28

Where is the prediction line?

r = 0.5

21 of 28

Where is the prediction line?

r = 0.2

22 of 28

Nearest Neighbor Regression

A method for prediction:

  • Group each x with a representative x value (rounding)
  • Average the corresponding y values for each group

For each representative x value, the corresponding prediction is the average of the y values in the group.

Graph these predictions.

If the association between x and y is linear, then points in the graph of averages tend to fall on the regression line.

23 of 28

Regression to the Mean

A statement about x and y pairs

  • Measured in standard units
  • Describing the deviation of x from 0 (the average of x's)
  • And the deviation of y from 0 (the average of y's)

On average, y deviates from 0 less than x deviates from 0

Not true for all points — a statement about averages

Regression Line

Correlation

24 of 28

Linear Regression

(Demo)

25 of 28

Slope & Intercept

26 of 28

Regression Line Equation

In original units, the regression line has this equation:

Lines can be expressed by slope & intercept

y in standard units

x in standard units

27 of 28

Regression Line

Standard Units

(0, 0)

1

r

Original Units

(Average x,� Average y)

SD x

r * SD y

28 of 28

Slope and Intercept

estimate of y = slope * x + intercept

(Demo)