Linear regression
BIOM285
May 5 2025
Outline for this week
Mon: Linear regression (quantitative ~ quantitative)
Wed: Contingency tables (categorical ~ categorical)
Fri: Examples of linear regression models
Regression
Regression is one of the most powerful and flexible statistical techniques that you can learn
Today we will learn about linear regression:
Linear regression provides the relationship between a quantitative response (dependent) variable and one or more predictor (independent) variable(s)
Linear regression with one predictor variable
Let’s start with the most straightforward case - which is linear regression of a dependent variable using a single predictor
Some examples could be:
- Growth of cell line versus concentration of glucose
- Gene expression level versus dose of drug
- Tumor size versus duration of chemotherapy
Linear regression with one predictor variable
The simple case is linear regression of a dependent variable using a single predictor
Back to our example from last week - of happiness index and number of hours spent on Zoom
hours <- c(0,1,2,3,4,5,6,7,8,9)
happiness <- c(10,9.5,9,8.5,8,7.5,7,6.5,6,5.5)
plot(happiness~hours,ylim=c(5,10),xlim=c(0,9))
Linear regression with one predictor variable
Given these data which are perfectly correlated (R=-1) - we can add a straight line to the plot that models the relationship exactly
The ‘intercept’ of the line is 10 - because we start at hour 0 with a happiness index of 10
The ‘slope’ of the line is -0.5 because we lose 0.5 units of happiness per hour on Zoom
hours <- c(0,1,2,3,4,5,6,7,8,9)
happiness <- c(10,9.5,9,8.5,8,7.5,7,6.5,6,5.5)
plot(happiness~hours,ylim=c(5,10),xlim=c(0,9))
abline(10,-.5)
Linear regression with one predictor variable
The ‘intercept’ of the line is 10 - because we start at hour 0 with a happiness index of 10
The ‘slope’ of the line is -0.5 because we lose 0.5 units of happiness per hour
We can write the model for this line, where slope and intercept are ‘coefficients’ of the model, as:
happiness = 10 + -0.5 * Zoom hours
happiness = 10 + -0.5 * 5 = 7.5
Often times the model will be written as:
y = b0 + b1*x
y - dependent variable, x - independent variable, b0 - intercept, b1 - slope
Fitting a regression line
We can evaluate a linear model using the R function ‘lm()’
Recall that for ANOVA the way to specify a linear model, e.g.: y = b0 + b1*x in R is: y ~ x
hours <- c(0,1,2,3,4,5,6,7,8,9)
happiness <- c(10,9.5,9,8.5,8,7.5,7,6.5,6,5.5)
my_model <- lm(happiness~hours) ### happiness = b0 + b1*hours
my_model
Call:
lm(formula = happiness ~ hours)
Coefficients:
(Intercept) hours
10.0 -0.5 ### same coefficients we saw before
Linear regression with one predictor variable
Let’s go back to another example - which is the amount of money we spend on coffee per hour worked in lab
hours <- c(0,1,2,3,4,5,6,7,8,9)
coffee <- c(0,2,4,6,8,10,12,14,16,18)
plot(coffee~hours,ylim=c(0,18),xlim=c(0,9))
Linear regression with one predictor variable
Let’s go back to another example - the amount of money we spend on coffee per hour worked in lab
We want to fit a line to these data... which line is the better fit?
Linear regression with one predictor variable
Let’s go back to another example - the amount of money we spend on coffee per hour worked in lab
For the second example - why isn’t this a good fit? Because the line is further away from the points. The distance between each point and the line is called the ‘residual’ - the ‘residuals’ from this line are higher so it is a worse fit for the data
<- residuals
Linear regression with one predictor variable
In linear regression we attempt to find a line that best fits the data, which means minimizing the residual values
The calculation linear regression performs is to find the line that minimizes the sum of the squared residuals from the line - and is therefore called ‘least squares’
Even though it is a pretty simple concept - it is one of the most powerful statistical tools you can use
<- residuals
Linear regression with one predictor variable
Let’s look at simple examples to see how linear regression works, and how we can interpret the results and perform hypothesis testing
Say we want to determine whether the concentration of thyroid hormone in culture media affects differentiation of neurons derived from stem cells
exp1 <- data.frame(hormone=c(100,120,140,160,180,200,220,240),diff=c(20,25,24,28,30,27,32,30))
plot(diff~hormone,data=exp1,xlim=c(80,260),ylim=c(18,34))
Linear regression with one predictor variable
We evaluate a linear model using these data to first determine the coefficients of the model
The coefficients of the model are intercept=15.26 and slope=0.069:
Differentiation = 15.26 + 0.069*hormone
Meaning that for every unit of hormone added, differentiation increases by 0.069 units
exp1.lm <- lm(diff~hormone,data=exp1)
exp1.lm
Call:
lm(formula = diff ~ hormone, data = exp1)
Coefficients:
(Intercept) hormone
15.26190 0.06905
Linear regression with one predictor variable
The line appears to be a good fit for our data - so lets evaluate the hypothesis that hormone is a significant predictor of differentiation
Recall from the ANOVA lectures that we can use the ‘summary()’ function
summary(exp1.lm)
Call:
lm(formula = diff ~ hormone, data = exp1)
Residuals:
Min 1Q Median 3Q Max
-2.1667 -1.8929 0.2619 1.5833 2.3095
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.26190 2.82254 5.407 0.00165 **
hormone 0.06905 0.01603 4.307 0.00505 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.078 on 6 degrees of freedom
Multiple R-squared: 0.7556, Adjusted R-squared: 0.7149
F-statistic: 18.55 on 1 and 6 DF, p-value: 0.005053
We get lots of information in the summary
What we care most about is whether hormone is a significant predictor of differentiation and the effect
We can see from the output (highlighted in blue) that hormone is a significant predictor (P=.00505)
Therefore we can reject the null of no effect of the hormone
We also get the effect of hormone, the standard error of this effect and the statistic (t statistic)
Linear regression with one predictor variable
The line appears to be a good fit for our data - so lets evaluate the hypothesis that hormone is a significant predictor of differentiation
Recall from the ANOVA lectures that we can use the ‘summary()’ function
summary(exp1.lm)
Call:
lm(formula = diff ~ hormone, data = exp1)
Residuals:
Min 1Q Median 3Q Max
-2.1667 -1.8929 0.2619 1.5833 2.3095
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.26190 2.82254 5.407 0.00165 **
hormone 0.06905 0.01603 4.307 0.00505 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.078 on 6 degrees of freedom
Multiple R-squared: 0.7556, Adjusted R-squared: 0.7149
F-statistic: 18.55 on 1 and 6 DF, p-value: 0.005053
We get lots of information in the summary
We also learn about whether the model as a whole is a good fit to the data
Even though we care mostly about the coefficients of the individual variables (e.g. hormone) it is good to check that the model overall is a good fit
Linear regression with one predictor variable
The line appears to be a good fit for our data - so lets evaluate the hypothesis that hormone is a significant predictor of differentiation
Recall from the ANOVA lectures that we can use the ‘summary()’ function
summary(exp1.lm)
Call:
lm(formula = diff ~ hormone, data = exp1)
Residuals:
Min 1Q Median 3Q Max
-2.1667 -1.8929 0.2619 1.5833 2.3095
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.26190 2.82254 5.407 0.00165 **
hormone 0.06905 0.01603 4.307 0.00505 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.078 on 6 degrees of freedom
Multiple R-squared: 0.7556, Adjusted R-squared: 0.7149
F-statistic: 18.55 on 1 and 6 DF, p-value: 0.005053
The model also tells us information about the correlation (R2) between the actual and predicted values from the fit model - which we will come back to later
Linear regression with one predictor variable
How well does our model fit the data?
A fitted value is the value of y on the regression line of fit that corresponds to a particular value of x - i.e. it is the expected value of y based on the model
From our fitted model:
Differentiation = 15.26 + 0.069*hormone
We can predict differentiation from a given hormone value:
Hormone=125:
Differentiation = 15.26 + 0.069*125 = 23.885
Hormone=175:
Differentiation = 15.26 + 0.069*175 = 27.335
Linear regression with one predictor variable
How well does our model fit the data?
A fitted value is the value of y on the regression line of fit that corresponds to a particular value of x - i.e. it is the predicted value of y based on the model
We can view the fitted (expected) values for differentiation given our data for hormones, as well as the residuals in the differentiation values not explained by the hormone variable
fitted(exp1.lm)
1 2 3 4 5 6 7 8
22.16667 23.54762 24.92857 26.30952 27.69048 29.07143 30.45238 31.83333
resid(exp1.lm)
1 2 3 4 5 6 7 8
-2.1666667 1.4523810 -0.9285714 1.6904762 2.3095238 -2.0714286 1.5476190 -1.8333333
-2.1
1.4
-0.9
1.69
2.3
-2.1
1.5
-1.8
Linear regression with one predictor variable
Residuals represent the left over information in the outcome not explained by our predictor (error)
Residuals should be roughly normally distributed and the average should be ~0
Residuals that have different patterns may suggest issues - e.g. the model may be over/under fit
summary(exp1.lm)
Call:
lm(formula = diff ~ hormone, data = exp1)
Residuals:
Min 1Q Median 3Q Max
-2.1667 -1.8929 0.2619 1.5833 2.3095
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.26190 2.82254 5.407 0.00165 **
hormone 0.06905 0.01603 4.307 0.00505 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.078 on 6 degrees of freedom
Multiple R-squared: 0.7556, Adjusted R-squared: 0.7149
F-statistic: 18.55 on 1 and 6 DF, p-value: 0.005053
Linear regression with one predictor variable
How well does our model fit the data?
A fitted value is the value of y on the regression line of fit that corresponds to a particular value of x - i.e. it is the predicted value of y based on the model
summary(exp1.lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.26190 2.82254 5.407 0.00165 **
hormone 0.06905 0.01603 4.307 0.00505 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.078 on 6 degrees of freedom
Multiple R-squared: 0.7556, Adjusted R-squared: 0.7149
F-statistic: 18.55 on 1 and 6 DF, p-value: 0.005053
cor(exp1$diff,fitted(exp1.lm)) ### R
[1] 0.8692614
cor(exp1$diff,fitted(exp1.lm))^2 ### R2
[1] 0.7556155
We get lots of information in the summary
The model also tells us information about the correlation between the actual and predicted values from the fit model (R-squared)
An R-squared of 1 means that the regression line perfectly fits the data (as in our first example of happiness and Zoom calls)
Linear regression with one predictor variable
What about predicting other unseen values?
Say we want to know what the effects of different values for thyroid hormone would be on differentiation. Instead of running multiple additional experiments, we can leverage our model to predict them
new <- data.frame(hormone=c(134,223,300))
predict(exp1.lm,new)
1 2 3
24.51429 30.65952 35.97619
Can you see any potential issues with this?
Linear regression with one predictor variable
Summarizing the linear model gives you the standard error on the estimate of the effect of your dependent variable
What if you wanted to report the confidence interval instead?
confint(exp1.lm,level=.95)
2.5 % 97.5 %
(Intercept) 8.35539376 22.168416
hormone 0.02982127 0.108274
Multiple regression
The examples we just went over covered one predictor variable - but what if you want to know the effects of multiple predictor variables on an outcome variable, for example:
In multiple regression the null hypothesis is that the effect of each of the independent variables is 0
Multiple regression
In regression using a single independent variable we wrote the model as:
y = b0 + b1*x
Where the value of y is the intercept b0 plus the value of x multiplied by the effect of x, b1
In multiple regression, we extend this to consider two or more independent variables:
y = b0 + b1*x1 + b2*x2 + … bn*xn
Where the value of y is the intercept b0 plus the value of all of the independent variables (x1, x2 … xn) each multiplied by their effect (b1, b2, … bn)
Multiple regression
Let’s look at another example to see how multiple regression works, and how we can interpret the results and perform hypothesis testing
Who knows what Scottish fell racing is? We want to know whether the length of the race and the steepness of the climb affect the time it takes to complete the race
scottish <- read.table("scottish-hill-races.txt",header=T,sep="\t")
summary(scottish)
Race Distance.km. Climb..m. Mens.Time..min. Womens.Time..min.
Alva Games Hill Race : 1 Min. : 2.40 Min. : 160.0 Min. : 12.30 Min. : 14.80
Aonach Mor Uphill Race: 1 1st Qu.: 6.40 1st Qu.: 327.0 1st Qu.: 26.88 1st Qu.: 32.45
Arrochar Alps : 1 Median : 8.75 Median : 445.0 Median : 36.42 Median : 41.58
Beinn Lora Hill Race : 1 Mean :11.18 Mean : 672.9 Mean : 57.24 Mean : 67.63
Ben Aigan Hill Race : 1 3rd Qu.:13.90 3rd Qu.: 793.8 3rd Qu.: 70.99 3rd Qu.: 85.80
Ben Lomond Hill Race : 1 Max. :42.00 Max. :2420.0 Max. :206.73 Max. :253.52
(Other) :88 NA's :4 NA's :4
Multiple regression
We evaluate the linear model using both distance (km) and climb (m), which shows that both variables are highly significant predictors of men’s time
scot1 <- lm(Mens.Time..min. ~ Climb..m. + Distance.km.,data=scottish)
summary(scot1)
Call:
lm(formula = Mens.Time..min. ~ Climb..m. + Distance.km., data = scottish)
Residuals:
Min 1Q Median 3Q Max
-16.9062 -4.3878 0.3393 3.1767 22.1092
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.372265 1.244506 -8.334 1.03e-12 ***
Climb..m. 0.034227 0.002174 15.744 < 2e-16 ***
Distance.km. 4.042043 0.144761 27.922 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.623 on 87 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.9802, Adjusted R-squared: 0.9798
F-statistic: 2157 on 2 and 87 DF, p-value: < 2.2e-16
Multiple regression
We evaluate the linear model using both distance (km) and climb (m), which shows that both variables are highly significant predictors of men’s time
scot1 <- lm(Mens.Time..min. ~ Climb..m. + Distance.km.,data=scottish)
Call:
lm(formula = Mens.Time..min. ~ Climb..m. + Distance.km., data = scottish)
Residuals:
Min 1Q Median 3Q Max
-16.9062 -4.3878 0.3393 3.1767 22.1092
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.372265 1.244506 -8.334 1.03e-12 ***
Climb..m. 0.034227 0.002174 15.744 < 2e-16 ***
Distance.km. 4.042043 0.144761 27.922 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.623 on 87 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.9802, Adjusted R-squared: 0.9798
F-statistic: 2157 on 2 and 87 DF, p-value: < 2.2e-16
High correlation between actual and predicted values
Multiple regression
Residuals of the model appear normally distributed
scot1 <- lm(Mens.Time..min. ~ Climb..m. + Distance.km.,data=scottish)
qqnorm(resid(scot1))
Multiple regression
What is the expected value of race time based on the model? A function of both distance and incline
From our fitted model:
Time = -10.37 + 0.034*incline + 4.04*distance
Say there are two new courses added this year - we can predict the time to complete the course based on the distance and incline
New course 1 distance=5km, climb=300m:
Time = -10.37 + 0.034*300 + 4.04*5 = 20 min
New course 2 distance=8km, climb=200m:
Time = -10.37 + 0.034*200 + 4.04*8 = 28.75 min
new <- data.frame(Distance.km.=5, Climb..m.=300)
predict(scot1,new)
1
20.1061
new <- data.frame(Distance.km.=8, Climb..m.=200)
predict(scot1,new)
1
28.80952
Multiple regression
Let’s look at another example to see how multiple regression works, and how we can interpret the results and perform hypothesis testing
Say we want to determine whether the dose of a new drug, age and BMI (body mass index) affects a measure of lung function in individuals with COPD
exp2 <- data.frame(lung=c(37.8,25.9,34.4,37.1,34.2,35.7,49.5,50.6,28.5,34.8),age=c(65,80,52,50,64,55,47,40,75,54),bmi=c(22,31,22,23,27,29,31,21,29,27),drug=c(4,5,7,3,5,4,6,3,6,3))
summary(exp2)
lung age bmi drug
Min. :25.90 Min. :40.00 Min. :21.00 Min. :3.00
1st Qu.:34.25 1st Qu.:50.50 1st Qu.:22.25 1st Qu.:3.25
Median :35.25 Median :54.50 Median :27.00 Median :4.50
Mean :36.85 Mean :58.20 Mean :26.20 Mean :4.60
3rd Qu.:37.62 3rd Qu.:64.75 3rd Qu.:29.00 3rd Qu.:5.75
Max. :50.60 Max. :80.00 Max. :31.00 Max. :7.00
Multiple regression
We evaluate the linear model using age, sex and drug dose, which shows that only age is a significant predictor of lung function in COPD
exp2.lm <- lm(lung ~ age + bmi + drug,data=exp2)
summary(exp2.lm)
Call:
lm(formula = lung ~ age + bmi + drug, data = exp2)
Residuals:
Min 1Q Median 3Q Max
-5.1787 -3.5773 0.3358 3.6308 5.6531
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 63.84079 11.94922 5.343 0.00176 **
age -0.56025 0.15575 -3.597 0.01141 *
bmi 0.20565 0.50588 0.407 0.69847
drug 0.04954 1.27331 0.039 0.97023
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.081 on 6 degrees of freedom
Multiple R-squared: 0.7225, Adjusted R-squared: 0.5837
F-statistic: 5.207 on 3 and 6 DF, p-value: 0.04158
Model comparison
Which predictor variables should we include in our model?
exp2.lm <- lm(lung ~ age + bmi + drug,data=exp2)
summary(exp2.lm)
Call:
lm(formula = lung ~ age + bmi + drug, data = exp2)
Residuals:
Min 1Q Median 3Q Max
-5.1787 -3.5773 0.3358 3.6308 5.6531
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 63.84079 11.94922 5.343 0.00176 **
age -0.56025 0.15575 -3.597 0.01141 *
bmi 0.20565 0.50588 0.407 0.69847
drug 0.04954 1.27331 0.039 0.97023
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.081 on 6 degrees of freedom
Multiple R-squared: 0.7225, Adjusted R-squared: 0.5837
F-statistic: 5.207 on 3 and 6 DF, p-value: 0.04158
exp2.lm2 <- lm(lung ~ age,data=exp2)
summary(exp2.lm2)
Call:
lm(formula = lung ~ age, data = exp2)
Residuals:
Min 1Q Median 3Q Max
-5.7263 -3.7727 0.4714 3.2418 6.7315
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 67.6052 7.0229 9.626 1.13e-05 ***
age -0.5284 0.1182 -4.471 0.00208 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.466 on 8 degrees of freedom
Multiple R-squared: 0.7141, Adjusted R-squared: 0.6784
F-statistic: 19.99 on 1 and 8 DF, p-value: 0.002082
anova(exp2.lm,exp2.lm2)
Res.Df RSS Df Sum of Sq F Pr(>F)
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6 154.9134 NA NA NA NA
2 8 159.5704 -2 -4.656958 0.09018506 0.9149773
Feature selection
Which variables best explain the data?
library(MASS)
full.model <- lm(lung ~., data = exp2)
step.model <- stepAIC(full.model, direction = "both", trace = FALSE)
step.model
Call:
lm(formula = lung ~ age, data = exp2)
Coefficients:
(Intercept) age
67.6052 -0.5284
Feature selection - e.g. forward/backward/stepwise regression, LASSO
Assumptions of linear regression
As we have seen multiple times over - linear regression makes several assumptions that we should consider in our analyses:
1. The data generally appear to have a linear relationship
2. The residuals should be approx. normally distributed and have no obvious patterns
Visual inspection of the input data and the residuals can help identify potential issues - and as with other tests, transforming your data e.g. log, exponential, square root, inverse, more extreme transformations (e.g. quantile norm) - may help.
Different from some other tests - there aren’t really non-parametric versions (at least that are commonly used, or that I know of) for linear regression
Other issues - influential points
Your data may contain outlying values that could have a disproportionate effect on the linear model.
We saw similar issues with correlation - in that single outlying values can alone drive a high Pearson linear correlation
In regression analyses it is prudent to identify and remove outlying values in hypothesis testing, as they can complicate the interpretation of your model
There are many methods for identifying influential points in a model - the R function influential.measures() is a good place to start
We will go over this in more detail on Friday
Topics we didn’t get a chance to cover
Interactions in regression
Correlated variables in multiple regression
Non-linear regression (e.g. quadratic)
If any specific topics are of interest to you we can find time to cover them in week 8