Lecture 30
Linear Regression
DATA 8
Fall 2017
Slides created by John DeNero (denero@berkeley.edu) and Ani Adhikari (adhikari@berkeley.edu)
Announcements
Correlation (Review)
The Correlation Coefficient r
r = 0
r = 0.2
r = 0.5
r = 0.8
r = 0.99
r = -0.5
Definition of r
average of |
product of |
x in standard units |
and |
y in standard units |
Correlation Coefficient (r) =
Measures how clustered the scatter is around a straight line
Properties of Correlation
Properties of r
Interpreting r
Watch out for:
Interpreting r
Don't jump to conclusions about causality
Interpreting r
Watch out for non-linearity.
r = 0.0
Interpreting r
Watch out for outliers.
r = 0.0
Interpreting r
Watch out for ecological correlations, based on aggregates or averaged data.
r = 0.98
Attendance
Prediction
Galton's Heights
Galton's Heights
Galton's Heights
Where is the prediction line?
r = 0.99
Where is the prediction line?
r = 0.0
Where is the prediction line?
r = 0.5
Where is the prediction line?
r = 0.2
Nearest Neighbor Regression
A method for prediction:
For each representative x value, the corresponding prediction is the average of the y values in the group.
Graph these predictions.
If the association between x and y is linear, then points in the graph of averages tend to fall on the regression line.
Regression to the Mean
A statement about x and y pairs
On average, y deviates from 0 less than x deviates from 0
Not true for all points — a statement about averages
Regression Line
Correlation
Linear Regression
(Demo)
Slope & Intercept
Regression Line Equation
In original units, the regression line has this equation:
Lines can be expressed by slope & intercept
y in standard units
x in standard units
Regression Line
Standard Units
(0, 0)
1
r
Original Units
(Average x,� Average y)
SD x
r * SD y
Slope and Intercept
estimate of y = slope * x + intercept
(Demo)