Inference: Linear Regression
Data Analysis
Independent Variable
Dependent Variable
Linear regression can be used to describe this relationship
Independent Variable
Dependent Variable
y-axis
x-axis
Independent Variable
Dependent Variable
best-fitting line
Independent Variable
Dependent Variable
Independent Variable
Dependent Variable
Best-fitting line
NOT a best-fitting line
Independent Variable
Dependent Variable
positive relationship
The larger the independent variable value, the larger the dependent variable tends to be
The smaller the independent variable value, the smaller the dependent variable tends to be
Father’s Height
Son’s Height
positive relationship
The taller the father, the taller the son tends to be
The shorter the father, the shorter the son tends to be
Independent Variable
Dependent Variable
negative relationship
The larger the independent variable value, the smaller the dependent variable tends to be
The smaller the independent variable value, the larger the dependent variable tends to be
# of absences
students’ grades
negative relationship
The more absences, the lower the students’ grades
The lower the number of absences, the higher the students’ grades tend to be
Independent Variable
Dependent Variable
Independent Variable
Dependent Variable
weaker relationship
stronger relationship
stronger relationship =
higher correlation
Time
Temperature
Not a linear relationship.
Do not use linear regression
Independent Variable
Dependent Variable
Not homoscedastic: points at this end are much further from the line than at the other end
Do not use linear regression
Independent Variable
Dependent Variable
Residuals (green vertical line) are the distances from each observed data point (purple closed circles) to the regression line (the pink solid line)
residual
Linear regression assumes normality of residuals
residuals
Independent Variable
Dependent Variable
Independent Variable
Dependent Variable
Independent Variable
Dependent Variable
Independent Variable
Dependent Variable
Increasing 𝛃
Positive 𝛃
Negative 𝛃
Independent Variable
Dependent Variable
𝛃 = 0
Independent Variable
Dependent Variable
Independent Variable
Dependent Variable
increasing standard error (SE)
The closer the points are to the regression line, the less uncertain we are in our estimate
p-values : the probability of getting the observed results (or results more extreme) by chance alone
ggplot(trees) +
geom_point(aes(Height, Girth))
ggplot(trees, aes(Height, Girth)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## fit the regression
fit <- lm(Girth ~ Height,
data = trees)
par(mfrow = c(2, 2))
plot(fit)
Looking for a horizontal red line that does not deviate from the dotted gray line
points should be random throughout the plot with no clear pattern or clustering
Looking for a horizontal red line to indicate the data are homoscedastic (equal variance across the distribution)
Points falling along the gray line would indicate that the residuals are Normally distributed
library(ggplot2)
ggplot(fit, aes(fit$residuals)) +
geom_histogram(bins = 5)
Like the QQ Plot, the histogram shows the residuals are not Normally distributed
Standardized residuals greater than 3 or less than -3 indicate possible outliers in your dataset
𝛃 estimate
SE
p-value
For every one inch increase in height, the girth will increase by 0.255 inches
Describes the strength of the correlation