1 of 70

Meeting #16: Introduction to regression

Tom Sasani

December 2, 2019

2 of 70

In biology, we’re often interested in the relationship between two (or more) variables

Relationship between mid-parent height and child's height

Data from "Regression toward mediocrity in hereditary stature," by Francis Galton (1885)

3 of 70

To determine whether these relationships are “interesting,” quantify them, and use them for prediction, we fit models

Relationship between mid-parent height and child's height

Data from "Regression toward mediocrity in hereditary stature," by Francis Galton (1885)

4 of 70

Goals for today’s talk

Covariance and correlation

Fitting linear models in R

Interpreting model outputs and p-values

5 of 70

Goals for today’s talk

Covariance and correlation

Fitting linear models in R

Interpreting model outputs and p-values

6 of 70

Covariance measures the "joint variability of two random variables”

Do large values of one variable tend to correspond with large values of the other variable?

Do they “co-vary?”

7 of 70

Covariance measures the "joint variability of two random variables”

Do large values of one variable tend to correspond with large values of the other variable?

Do they “co-vary?”

Expected (mean) value of X

Expected (mean) value of Y

8 of 70

Covariance is easy to calculate in R: first, define some X and Y values

set.seed(123)

X = rnorm(10, c(1:10)) # take one draw from a normal distribution

Y = rnorm(10, c(1:10)) # centered on all integers 1 through 10

(Note: the set.seed(123) line is necessary to generate the same "draws" from a normal distribution shown in my slides. This function is used to define the "seed" number used in R's pseudo-random number generation.)

9 of 70

set.seed(123)

X = rnorm(10, c(1:10))

Y = rnorm(10, c(1:10))

X_resid = X - mean(X)

Y_resid = Y - mean(Y)

Next, calculate the “residuals” for each variable: the difference between each individual value and the mean of the values

10 of 70

set.seed(123)

X = rnorm(10, c(1:10))

Y = rnorm(10, c(1:10))

X_resid = X - mean(X)

Y_resid = Y - mean(Y)

> X_resid

> Y_resid

Next, calculate the “residuals” for each variable: the difference between each individual value and the mean of the values

11 of 70

Then, take the product of the residuals

set.seed(123)

X = rnorm(10, c(1:10))

Y = rnorm(10, c(1:10))

X_resid = X - mean(X)

Y_resid = Y - mean(Y)

prod = X_resid * Y_resid

12 of 70

What are we actually assessing by taking the product of the residuals?

If smaller-than-average X values correspond with smaller-than-average Y values (and vice versa), the products of the residuals will be positive!

set.seed(123)

X = rnorm(10, c(1:10))

Y = rnorm(10, c(1:10))

X_resid = X - mean(X)

Y_resid = Y - mean(Y)

prod = X_resid * Y_resid

> prod

13 of 70

Finally, take the product of the residuals and find the “unbiased” mean

The “unbiased” mean is the sum of N values divided by (N - 1) rather than N

set.seed(123)

X = rnorm(10, c(1:10))

Y = rnorm(10, c(1:10))

X_resid = X - mean(X)

Y_resid = Y - mean(Y)

prod = X_resid * Y_resid

covariance = sum(prod) / (length(prod) - 1)

14 of 70

Our calculation is identical to the one performed by the “cov” function in R

set.seed(123)

X = rnorm(10, c(1:10))

Y = rnorm(10, c(1:10))

X_resid = X - mean(X)

Y_resid = Y - mean(Y)

prod = X_resid * Y_resid

covariance = sum(prod) / (length(prod) - 1)

> covariance

[1] 7.808618

> cov(X, Y)

[1] 7.808618

15 of 70

Our calculation is identical to the one performed by the “cov” function in R

set.seed(123)

X = rnorm(10, c(1:10))

Y = rnorm(10, c(1:10))

X_resid = X - mean(X)

Y_resid = Y - mean(Y)

prod = X_resid * Y_resid

covariance = sum(prod) / (length(prod) - 1)

> covariance

[1] 7.808618

> cov(X, Y)

[1] 7.808618

Positive covariance values indicate that the variables show "similar behavior," and negative values indicate that they show "opposing behavior."

16 of 70

Covariance is not an “adjusted” or “normalized” metric: if we multiply our X and Y values by a constant, the covariance changes

X_new = X * 2

Y_new = Y * 2

X_resid_new = X_new - mean(X_new)

Y_resid_new = Y_new - mean(Y_new)

prod_new = X_resid_new * Y_resid_new

covariance_new = sum(prod_new) / (length(prod_new) - 1)

> covariance_new

[1] 31.23447

> cov(X_new, Y_new)

[1] 31.23447

17 of 70

The Pearson’s correlation coefficient is a normalized version of the covariance

Standard deviation of X

Standard deviation of Y

18 of 70

Again, we can calculate the Pearson’s correlation coefficient by hand, or with a built-in R function

pearson_cc = covariance / (sd(X) * sd(Y))

> pearson_cc

[1] 0.9495822

> cor(X, Y)

[1] 0.9495822

19 of 70

Again, we can calculate the Pearson’s correlation coefficient by hand, or with a built-in R function

pearson_cc = covariance / (sd(X) * sd(Y))

> pearson_cc

[1] 0.9495822

> cor(X, Y)

[1] 0.9495822

Pearson's CC can take on any value from -1 to 1. The former indicates perfect "negative correlation" and the latter indicates perfect "positive correlation."

20 of 70

A quick summary

Covariance: do two variables co-vary?

Correlation: to what degree do those variables co-vary?

21 of 70

A quick summary

But what if we want to better understand and “model” that relationship?

For example, maybe we want to predict future Y values based on the data we already have...

Covariance: do two variables co-vary?

Correlation: to what degree do those variables co-vary?

22 of 70

Goals for today’s talk

Covariance and correlation

Fitting linear models in R

Interpreting model outputs and p-values

23 of 70

Basic linear regression in R requires that the data we’re modeling are continuous

If we want to model data that come in other flavors...

count data
binary data
discrete data

...we’ll turn to generalized linear models.

(But that will come in a later lecture)

24 of 70

We’ll focus on the “mtcars” dataset for a regression introduction

Dataset contains 11 measurements of various auto metrics (horsepower, MPG, etc.) for 32 makes and models

Mazda RX4

Datsun 710

Chrysler Valiant

25 of 70

The “mtcars” dataset contains both continuous variables (e.g., car weight) and discrete variables (e.g., # of cylinders)

Mazda RX4

Datsun 710

Chrysler Valiant

Basic linear regression in R requires that the data we’re modeling are continuous

26 of 70

Can we predict a car’s fuel efficiency by measuring its weight?

27 of 70

The “mtcars” dataset is included with R

data(mtcars)

head(mtcars)

> head(mtcars)

mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4

Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4

Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1

Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Descriptions of each column can be found here: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html

28 of 70

data(mtcars)

library(ggplot2)

library(cowplot)

ggplot(mtcars, aes(x=wt, y=mpg)) +

geom_point()

Can we predict a car’s fuel efficiency by measuring its weight?

29 of 70

We can easily add regression lines to data using ggplot2

data(mtcars)

library(ggplot2)

library(cowplot)

ggplot(mtcars, aes(x=wt, y=mpg)) +

geom_point() +

geom_smooth(method="lm")

30 of 70

We can easily add regression lines to data using ggplot2

data(mtcars)

library(ggplot2)

library(cowplot)

ggplot(mtcars, aes(x=wt, y=mpg)) +

geom_point() +

geom_smooth(method="lm")

(But how is that regression line created, and how do we interpret it?)

31 of 70

Adding geom_smooth(method="lm") to a plot invokes “ordinary least-squares” regression

In OLS, we’re trying to fit a line to our data such that the real data points deviate as little as possible from the line.

The “deviation” is calculated using residuals.

32 of 70

Adding geom_smooth(method="lm") to a plot invokes “ordinary least-squares” regression

Each of our “real” dependent variable measurements is denoted .

Each of our predictions is denoted .

To get an estimate of how well our line fits, we calculate:

In plain(ish) English: the sum of the squared differences between each of our predicted values and its corresponding "true" value.

33 of 70

Visually, calculating the residual sum of squares looks something like this:

We want to draw the line that minimizes the RSS (hence, “least-squares”).

The segments between the real data and the predicted line of best fit are our “residuals.”

The residual sum of squares (RSS) comes from squaring the difference between “real” and “predicted” values for each point, and summing the resulting values.

34 of 70

Fitting linear models in R is very straightforward

ols_mod = lm(mpg ~ wt, data=mtcars)

35 of 70

(not so) hot take alert: regression is machine learning

By fitting linear models, we are:

Training a model on existing data

Using the model to predict new values of Y (the response variable)

36 of 70

Goals for today’s talk

Covariance and correlation

Fitting linear models in R

Interpreting model outputs and p-values

37 of 70

The model summary tells us how well our model performed

> summary(ols_mod)

Call:

lm(formula = mpg ~ wt, data = mtcars)

Residuals:

Min 1Q Median 3Q Max

-4.5432 -2.3647 -0.1252 1.4096 6.8727

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom

Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446

F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

38 of 70

First, we reiterate the exact model we specified

> summary(ols_mod)

Call:

lm(formula = mpg ~ wt, data = mtcars)

Residuals:

Min 1Q Median 3Q Max

-4.5432 -2.3647 -0.1252 1.4096 6.8727

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom

Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446

F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

39 of 70

Next, a summary of the residuals calculated from our data (the lengths of the lines between the “real” and “predicted” data)

> summary(ols_mod)

Call:

lm(formula = mpg ~ wt, data = mtcars)

Residuals:

Min 1Q Median 3Q Max

-4.5432 -2.3647 -0.1252 1.4096 6.8727

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom

Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446

F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

40 of 70

We then see the slope and intercept point estimates for our regression

> summary(ols_mod)

Call:

lm(formula = mpg ~ wt, data = mtcars)

Residuals:

Min 1Q Median 3Q Max

-4.5432 -2.3647 -0.1252 1.4096 6.8727

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom

Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446

F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

41 of 70

Finally, there’s a more detailed summary of the model fit

> summary(ols_mod)

Call:

lm(formula = mpg ~ wt, data = mtcars)

Residuals:

Min 1Q Median 3Q Max

-4.5432 -2.3647 -0.1252 1.4096 6.8727

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom

Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446

F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

42 of 70

Let’s focus a bit on the coefficients and significance tests

> summary(ols_mod)

Call:

lm(formula = mpg ~ wt, data = mtcars)

Residuals:

Min 1Q Median 3Q Max

-4.5432 -2.3647 -0.1252 1.4096 6.8727

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom

Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446

F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

43 of 70

Let’s focus a bit on the coefficients and significance tests

> summary(ols_mod)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

...

44 of 70

> summary(ols_mod)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

...

Let’s focus a bit on the coefficients and significance tests

We can see that the point estimate of the slope is -5.3445. For each unit increase in "wt," we see a decrease of about 5.3 units of "mpg."

45 of 70

> summary(ols_mod)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

...

Let’s focus a bit on the coefficients and significance tests

The standard error of the slope point estimate is a measure of our uncertainty surrounding the estimate.

46 of 70

> summary(ols_mod)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

...

Regression is a signal to noise problem

The signal is our slope:

The noise is the standard error associated with that slope:

47 of 70

> summary(ols_mod)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

...

The reported t-value is a measure of signal to noise

To measure the ratio between the signal and the noise:

48 of 70

> summary(ols_mod)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

...

The reported t-value is a measure of signal to noise

Then, we can ask if the “signal-to-noise” ratio we observe is more extreme than expected, given our null hypothesis.

49 of 70

> summary(ols_mod)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

...

Our null hypothesis is that the signal-to-noise ratio is 0

Our null hypothesis: the slope point estimate is 0. (No relationship between x and y)

If the slope point estimate were 0 (or extremely close to 0):

50 of 70

> summary(ols_mod)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

...

Given the null hypothesis, is our t-value more extreme than expected by chance?

Given the noise in our slope estimate, is the slope significantly different from 0?

Let’s look up our t-value on a t-distribution.

51 of 70

We need to calculate our t-distribution based on the degrees of freedom in our regression

There are 32 pairs of (x,y) values in our dataset.

Degrees of freedom represents the number of variables whose values are free to vary in our dataset. So, 32 DoF?

> length(mtcars$mpg)

[1] 32

> length(mtcars$wt)

[1] 32

52 of 70

A brief note: how do we calculate degrees of freedom in regression analysis?

In regression, degrees of freedom are calculated as n-k, where n is the number of (x,y) pairs and k is the number of parameters we’re estimating.

Why?

If we know k parameters (in this case, slope and intercept), only n-k of the y values are “free to vary.”

This is because we can always infer the values for k dependent variable measurements as long as we know:

a) the values of k parameters

b) the values of n-k y-values.

53 of 70

First, let’s visualize a t-distribution on 30 DoF

54 of 70

In our null hypothesis, the slope point estimate is 0, and the t-value is 0

A t-value of 0 is expected if our slope estimate is 0.

55 of 70

The observed t-value in our regression is well outside the bulk of the t-distribution (on 30 DoF)

But our observed t-value is at the fringes of the t-distribution.

56 of 70

> summary(ols_mod)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

...

The p-value reported in the summary reflects the probability we’d observe a t-value this extreme given the null

If the null is …

And thus...

...then our observed t-value is much more extreme than expected by chance.

57 of 70

> summary(ols_mod)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

...

The p-value reported in the summary is simply the result of a t-test

In fact, we can see that the p-value in our model summary is reported as the "probability of observing a t-value this extreme given the null."

58 of 70

> summary(ols_mod)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***

wt -5.3445 0.5591 -9.559 1.29e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

...

The same exact calculation is done for the y-intercept

A small note: in R output, the smallest p-value reported is 2e-16!

59 of 70

What about that final part of the model output?

> summary(ols_mod)

...

Residual standard error: 3.046 on 30 degrees of freedom

Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446

F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

60 of 70

What about that final part of the model output?

> summary(ols_mod)

...

Residual standard error: 3.046 on 30 degrees of freedom

Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446

F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

The multiple R-squared is simply the squared Pearson’s correlation coefficient.

61 of 70

What about that final part of the model output?

> summary(ols_mod)

...

Residual standard error: 3.046 on 30 degrees of freedom

Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446

F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

The multiple R-squared is simply the squared Pearson’s correlation coefficient.

62 of 70

What about that final part of the model output?

> summary(ols_mod)

...

Residual standard error: 3.046 on 30 degrees of freedom

Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446

F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

The multiple R-squared is simply the squared Pearson’s correlation coefficient.

> cor(mtcars$wt, mtcars$mpg) ** 2

[1] 0.7528328

63 of 70

What about that final part of the model output?

> summary(ols_mod)

...

Residual standard error: 3.046 on 30 degrees of freedom

Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446

F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

The multiple R-squared is simply the squared Pearson’s correlation coefficient.

> cor(mtcars$wt, mtcars$mpg) ** 2

[1] 0.7528328

The multiple R-squared is also the fraction of variance in the dependent variable that's explained by the independent variables.

64 of 70

Adjusted R-squared corrects for the number of predictors in the model

> summary(ols_mod)

...

Residual standard error: 3.046 on 30 degrees of freedom

Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446

F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

We won’t get into the details of the adjusted R-squared, but it can be useful for comparing two models (when the second model uses more predictors than the first).

Generally, as more predictors are added, the “raw” multiple R-squared will improve. But if the adjusted R-squared doesn’t improve, then the additional predictors aren’t that great.

65 of 70

Finally, there is an F-test for the hypothesis that the slope is 0

> summary(ols_mod)

...

Residual standard error: 3.046 on 30 degrees of freedom

Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446

F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

In a simple linear regression, the F-statistic is simply the squared t-statistic!

The F-statistic is essentially another measure of signal to noise.

66 of 70

> m1 = lm(mpg ~ 1, data=mtcars) # model in which the “line of best fit”

# is a straight horizontal line extending

# out from the mean mtcars$mpg value

> m2 = lm(mpg ~ wt, data=mtcars) # the “true” model

> anova(m2, m1)

The F-test is an ANOVA comparing two models

67 of 70

> m1 = lm(mpg ~ 1, data=mtcars) # model in which the “line of best fit”

# is a straight horizontal line extending

# out from the mean mtcars$mpg value

> m2 = lm(mpg ~ wt, data=mtcars) # the “true” model

> anova(m2, m1)

Analysis of Variance Table

Model 1: mpg ~ wt

Model 2: mpg ~ 1

Res.Df RSS Df Sum of Sq F Pr(>F)

1 30 278.32

2 31 1126.05 -1 -847.73 91.375 1.294e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Essentially, we’re asking whether the reduction in the residual sum of squares between the two models is “significant”

68 of 70

Essentially, we’re asking whether the reduction in the residual sum of squares between the two models is “significant”

m1 = lm(mpg ~ 1, data=mtcars)

m2 = lm(mpg ~ wt, data=mtcars)

69 of 70

Closing notes

Covariance tells us whether two variables co-vary, correlation tells us to what extent

Ordinary least squares regression is machine learning

The "significance" of a predictor in a linear model is determined using a simple t-test

Regression, like a t-test, is a signal-to-noise problem

70 of 70

Next time:

Multivariate regression

ANOVA for model comparison and selection

Generalized linear modeling (?)