1 of 30

Inference: Linear Regression

Data Analysis

2 of 30

Independent Variable

Dependent Variable

Linear regression can be used to describe this relationship

3 of 30

Independent Variable

Dependent Variable

y-axis

x-axis

4 of 30

Independent Variable

Dependent Variable

best-fitting line

5 of 30

Independent Variable

Dependent Variable

Independent Variable

Dependent Variable

Best-fitting line

NOT a best-fitting line

6 of 30

Independent Variable

Dependent Variable

positive relationship

The larger the independent variable value, the larger the dependent variable tends to be

The smaller the independent variable value, the smaller the dependent variable tends to be

7 of 30

Father’s Height

Son’s Height

positive relationship

The taller the father, the taller the son tends to be

The shorter the father, the shorter the son tends to be

8 of 30

Independent Variable

Dependent Variable

negative relationship

The larger the independent variable value, the smaller the dependent variable tends to be

The smaller the independent variable value, the larger the dependent variable tends to be

9 of 30

# of absences

students’ grades

negative relationship

The more absences, the lower the students’ grades

The lower the number of absences, the higher the students’ grades tend to be

10 of 30

Independent Variable

Dependent Variable

Independent Variable

Dependent Variable

weaker relationship

stronger relationship

stronger relationship =

higher correlation

11 of 30

Time

Temperature

Not a linear relationship.

Do not use linear regression

12 of 30

Independent Variable

Dependent Variable

Not homoscedastic: points at this end are much further from the line than at the other end

Do not use linear regression

13 of 30

Independent Variable

Dependent Variable

Residuals (green vertical line) are the distances from each observed data point (purple closed circles) to the regression line (the pink solid line)

residual

14 of 30

Linear regression assumes normality of residuals

residuals

15 of 30

Independent Variable

Dependent Variable

Independent Variable

Dependent Variable

Independent Variable

Dependent Variable

Independent Variable

Dependent Variable

Increasing 𝛃

Positive 𝛃

Negative 𝛃

Independent Variable

Dependent Variable

𝛃 = 0

16 of 30

Independent Variable

Dependent Variable

Independent Variable

Dependent Variable

increasing standard error (SE)

The closer the points are to the regression line, the less uncertain we are in our estimate

17 of 30

p-values : the probability of getting the observed results (or results more extreme) by chance alone

18 of 30

19 of 30

ggplot(trees) +

geom_point(aes(Height, Girth))

20 of 30

ggplot(trees, aes(Height, Girth)) +

geom_point() +

geom_smooth(method = "lm", se = FALSE)

21 of 30

## fit the regression

fit <- lm(Girth ~ Height,

data = trees)

22 of 30

par(mfrow = c(2, 2))

plot(fit)

23 of 30

Looking for a horizontal red line that does not deviate from the dotted gray line

points should be random throughout the plot with no clear pattern or clustering

24 of 30

Looking for a horizontal red line to indicate the data are homoscedastic (equal variance across the distribution)

25 of 30

Points falling along the gray line would indicate that the residuals are Normally distributed

26 of 30

library(ggplot2)

ggplot(fit, aes(fit$residuals)) +

geom_histogram(bins = 5)

Like the QQ Plot, the histogram shows the residuals are not Normally distributed

27 of 30

Standardized residuals greater than 3 or less than -3 indicate possible outliers in your dataset

28 of 30

𝛃 estimate

SE

p-value

For every one inch increase in height, the girth will increase by 0.255 inches

29 of 30

Describes the strength of the correlation

30 of 30