Meeting #16: Introduction to regression
Tom Sasani
December 2, 2019
In biology, we’re often interested in the relationship between two (or more) variables
Relationship between mid-parent height and child's height
Data from "Regression toward mediocrity in hereditary stature," by Francis Galton (1885)
To determine whether these relationships are “interesting,” quantify them, and use them for prediction, we fit models
Relationship between mid-parent height and child's height
Data from "Regression toward mediocrity in hereditary stature," by Francis Galton (1885)
Goals for today’s talk
Goals for today’s talk
Covariance measures the "joint variability of two random variables”
Do large values of one variable tend to correspond with large values of the other variable?
Do they “co-vary?”
Covariance measures the "joint variability of two random variables”
Do large values of one variable tend to correspond with large values of the other variable?
Do they “co-vary?”
Expected (mean) value of X
Expected (mean) value of Y
Covariance is easy to calculate in R: first, define some X and Y values
set.seed(123)
X = rnorm(10, c(1:10)) # take one draw from a normal distribution
Y = rnorm(10, c(1:10)) # centered on all integers 1 through 10
(Note: the set.seed(123) line is necessary to generate the same "draws" from a normal distribution shown in my slides. This function is used to define the "seed" number used in R's pseudo-random number generation.)
set.seed(123)
X = rnorm(10, c(1:10))
Y = rnorm(10, c(1:10))
X_resid = X - mean(X)
Y_resid = Y - mean(Y)
Next, calculate the “residuals” for each variable: the difference between each individual value and the mean of the values
set.seed(123)
X = rnorm(10, c(1:10))
Y = rnorm(10, c(1:10))
X_resid = X - mean(X)
Y_resid = Y - mean(Y)
> X_resid
> Y_resid
Next, calculate the “residuals” for each variable: the difference between each individual value and the mean of the values
Then, take the product of the residuals
set.seed(123)
X = rnorm(10, c(1:10))
Y = rnorm(10, c(1:10))
X_resid = X - mean(X)
Y_resid = Y - mean(Y)
prod = X_resid * Y_resid
What are we actually assessing by taking the product of the residuals?
If smaller-than-average X values correspond with smaller-than-average Y values (and vice versa), the products of the residuals will be positive!
set.seed(123)
X = rnorm(10, c(1:10))
Y = rnorm(10, c(1:10))
X_resid = X - mean(X)
Y_resid = Y - mean(Y)
prod = X_resid * Y_resid
> prod
Finally, take the product of the residuals and find the “unbiased” mean
The “unbiased” mean is the sum of N values divided by (N - 1) rather than N
set.seed(123)
X = rnorm(10, c(1:10))
Y = rnorm(10, c(1:10))
X_resid = X - mean(X)
Y_resid = Y - mean(Y)
prod = X_resid * Y_resid
covariance = sum(prod) / (length(prod) - 1)
Our calculation is identical to the one performed by the “cov” function in R
set.seed(123)
X = rnorm(10, c(1:10))
Y = rnorm(10, c(1:10))
X_resid = X - mean(X)
Y_resid = Y - mean(Y)
prod = X_resid * Y_resid
covariance = sum(prod) / (length(prod) - 1)
> covariance
[1] 7.808618
> cov(X, Y)
[1] 7.808618
Our calculation is identical to the one performed by the “cov” function in R
set.seed(123)
X = rnorm(10, c(1:10))
Y = rnorm(10, c(1:10))
X_resid = X - mean(X)
Y_resid = Y - mean(Y)
prod = X_resid * Y_resid
covariance = sum(prod) / (length(prod) - 1)
> covariance
[1] 7.808618
> cov(X, Y)
[1] 7.808618
Positive covariance values indicate that the variables show "similar behavior," and negative values indicate that they show "opposing behavior."
Covariance is not an “adjusted” or “normalized” metric: if we multiply our X and Y values by a constant, the covariance changes
X_new = X * 2
Y_new = Y * 2
X_resid_new = X_new - mean(X_new)
Y_resid_new = Y_new - mean(Y_new)
prod_new = X_resid_new * Y_resid_new
covariance_new = sum(prod_new) / (length(prod_new) - 1)
> covariance_new
[1] 31.23447
> cov(X_new, Y_new)
[1] 31.23447
The Pearson’s correlation coefficient is a normalized version of the covariance
Standard deviation of X
Standard deviation of Y
Again, we can calculate the Pearson’s correlation coefficient by hand, or with a built-in R function
pearson_cc = covariance / (sd(X) * sd(Y))
> pearson_cc
[1] 0.9495822
> cor(X, Y)
[1] 0.9495822
Again, we can calculate the Pearson’s correlation coefficient by hand, or with a built-in R function
pearson_cc = covariance / (sd(X) * sd(Y))
> pearson_cc
[1] 0.9495822
> cor(X, Y)
[1] 0.9495822
Pearson's CC can take on any value from -1 to 1. The former indicates perfect "negative correlation" and the latter indicates perfect "positive correlation."
A quick summary
Covariance: do two variables co-vary?
Correlation: to what degree do those variables co-vary?
A quick summary
But what if we want to better understand and “model” that relationship?
For example, maybe we want to predict future Y values based on the data we already have...
Covariance: do two variables co-vary?
Correlation: to what degree do those variables co-vary?
Goals for today’s talk
Basic linear regression in R requires that the data we’re modeling are continuous
If we want to model data that come in other flavors...
...we’ll turn to generalized linear models.
(But that will come in a later lecture)
We’ll focus on the “mtcars” dataset for a regression introduction
Dataset contains 11 measurements of various auto metrics (horsepower, MPG, etc.) for 32 makes and models
Mazda RX4
Datsun 710
Chrysler Valiant
The “mtcars” dataset contains both continuous variables (e.g., car weight) and discrete variables (e.g., # of cylinders)
Mazda RX4
Datsun 710
Chrysler Valiant
Basic linear regression in R requires that the data we’re modeling are continuous
Can we predict a car’s fuel efficiency by measuring its weight?
The “mtcars” dataset is included with R
data(mtcars)
head(mtcars)
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Descriptions of each column can be found here: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html
data(mtcars)
library(ggplot2)
library(cowplot)
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point()
Can we predict a car’s fuel efficiency by measuring its weight?
We can easily add regression lines to data using ggplot2
data(mtcars)
library(ggplot2)
library(cowplot)
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point() +
geom_smooth(method="lm")
We can easily add regression lines to data using ggplot2
data(mtcars)
library(ggplot2)
library(cowplot)
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point() +
geom_smooth(method="lm")
(But how is that regression line created, and how do we interpret it?)
Adding geom_smooth(method="lm") to a plot invokes “ordinary least-squares” regression
In OLS, we’re trying to fit a line to our data such that the real data points deviate as little as possible from the line.
The “deviation” is calculated using residuals.
Adding geom_smooth(method="lm") to a plot invokes “ordinary least-squares” regression
Each of our “real” dependent variable measurements is denoted .
Each of our predictions is denoted .
To get an estimate of how well our line fits, we calculate:
In plain(ish) English: the sum of the squared differences between each of our predicted values and its corresponding "true" value.
Visually, calculating the residual sum of squares looks something like this:
We want to draw the line that minimizes the RSS (hence, “least-squares”).
The segments between the real data and the predicted line of best fit are our “residuals.”
The residual sum of squares (RSS) comes from squaring the difference between “real” and “predicted” values for each point, and summing the resulting values.
Fitting linear models in R is very straightforward
ols_mod = lm(mpg ~ wt, data=mtcars)
(not so) hot take alert: regression is machine learning
By fitting linear models, we are:
Goals for today’s talk
The model summary tells us how well our model performed
> summary(ols_mod)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
First, we reiterate the exact model we specified
> summary(ols_mod)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Next, a summary of the residuals calculated from our data (the lengths of the lines between the “real” and “predicted” data)
> summary(ols_mod)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
We then see the slope and intercept point estimates for our regression
> summary(ols_mod)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Finally, there’s a more detailed summary of the model fit
> summary(ols_mod)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Let’s focus a bit on the coefficients and significance tests
> summary(ols_mod)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Let’s focus a bit on the coefficients and significance tests
> summary(ols_mod)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
...
> summary(ols_mod)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
...
Let’s focus a bit on the coefficients and significance tests
We can see that the point estimate of the slope is -5.3445. For each unit increase in "wt," we see a decrease of about 5.3 units of "mpg."
> summary(ols_mod)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
...
Let’s focus a bit on the coefficients and significance tests
The standard error of the slope point estimate is a measure of our uncertainty surrounding the estimate.
> summary(ols_mod)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
...
Regression is a signal to noise problem
The signal is our slope:
The noise is the standard error associated with that slope:
> summary(ols_mod)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
...
The reported t-value is a measure of signal to noise
To measure the ratio between the signal and the noise:
> summary(ols_mod)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
...
The reported t-value is a measure of signal to noise
Then, we can ask if the “signal-to-noise” ratio we observe is more extreme than expected, given our null hypothesis.
> summary(ols_mod)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
...
Our null hypothesis is that the signal-to-noise ratio is 0
Our null hypothesis: the slope point estimate is 0. (No relationship between x and y)
If the slope point estimate were 0 (or extremely close to 0):
> summary(ols_mod)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
...
Given the null hypothesis, is our t-value more extreme than expected by chance?
Given the noise in our slope estimate, is the slope significantly different from 0?
Let’s look up our t-value on a t-distribution.
We need to calculate our t-distribution based on the degrees of freedom in our regression
There are 32 pairs of (x,y) values in our dataset.
Degrees of freedom represents the number of variables whose values are free to vary in our dataset. So, 32 DoF?
> length(mtcars$mpg)
[1] 32
> length(mtcars$wt)
[1] 32
A brief note: how do we calculate degrees of freedom in regression analysis?
In regression, degrees of freedom are calculated as n-k, where n is the number of (x,y) pairs and k is the number of parameters we’re estimating.
Why?
If we know k parameters (in this case, slope and intercept), only n-k of the y values are “free to vary.”
This is because we can always infer the values for k dependent variable measurements as long as we know:
a) the values of k parameters
b) the values of n-k y-values.
First, let’s visualize a t-distribution on 30 DoF
In our null hypothesis, the slope point estimate is 0, and the t-value is 0
A t-value of 0 is expected if our slope estimate is 0.
The observed t-value in our regression is well outside the bulk of the t-distribution (on 30 DoF)
But our observed t-value is at the fringes of the t-distribution.
> summary(ols_mod)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
...
The p-value reported in the summary reflects the probability we’d observe a t-value this extreme given the null
If the null is …
And thus...
...then our observed t-value is much more extreme than expected by chance.
> summary(ols_mod)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
...
The p-value reported in the summary is simply the result of a t-test
In fact, we can see that the p-value in our model summary is reported as the "probability of observing a t-value this extreme given the null."
> summary(ols_mod)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
...
The same exact calculation is done for the y-intercept
A small note: in R output, the smallest p-value reported is 2e-16!
What about that final part of the model output?
> summary(ols_mod)
...
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
What about that final part of the model output?
> summary(ols_mod)
...
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
The multiple R-squared is simply the squared Pearson’s correlation coefficient.
What about that final part of the model output?
> summary(ols_mod)
...
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
The multiple R-squared is simply the squared Pearson’s correlation coefficient.
What about that final part of the model output?
> summary(ols_mod)
...
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
The multiple R-squared is simply the squared Pearson’s correlation coefficient.
> cor(mtcars$wt, mtcars$mpg) ** 2
[1] 0.7528328
What about that final part of the model output?
> summary(ols_mod)
...
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
The multiple R-squared is simply the squared Pearson’s correlation coefficient.
> cor(mtcars$wt, mtcars$mpg) ** 2
[1] 0.7528328
The multiple R-squared is also the fraction of variance in the dependent variable that's explained by the independent variables.
Adjusted R-squared corrects for the number of predictors in the model
> summary(ols_mod)
...
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
We won’t get into the details of the adjusted R-squared, but it can be useful for comparing two models (when the second model uses more predictors than the first).
Generally, as more predictors are added, the “raw” multiple R-squared will improve. But if the adjusted R-squared doesn’t improve, then the additional predictors aren’t that great.
Finally, there is an F-test for the hypothesis that the slope is 0
> summary(ols_mod)
...
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
In a simple linear regression, the F-statistic is simply the squared t-statistic!
The F-statistic is essentially another measure of signal to noise.
> m1 = lm(mpg ~ 1, data=mtcars) # model in which the “line of best fit”
# is a straight horizontal line extending
# out from the mean mtcars$mpg value
> m2 = lm(mpg ~ wt, data=mtcars) # the “true” model
> anova(m2, m1)
The F-test is an ANOVA comparing two models
> m1 = lm(mpg ~ 1, data=mtcars) # model in which the “line of best fit”
# is a straight horizontal line extending
# out from the mean mtcars$mpg value
> m2 = lm(mpg ~ wt, data=mtcars) # the “true” model
> anova(m2, m1)
Analysis of Variance Table
Model 1: mpg ~ wt
Model 2: mpg ~ 1
Res.Df RSS Df Sum of Sq F Pr(>F)
1 30 278.32
2 31 1126.05 -1 -847.73 91.375 1.294e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Essentially, we’re asking whether the reduction in the residual sum of squares between the two models is “significant”
Essentially, we’re asking whether the reduction in the residual sum of squares between the two models is “significant”
m1 = lm(mpg ~ 1, data=mtcars)
m2 = lm(mpg ~ wt, data=mtcars)
Closing notes
Next time: