1 of 37

Linear regression

BIOM285

May 5 2025

2 of 37

Outline for this week

Mon: Linear regression (quantitative ~ quantitative)

Wed: Contingency tables (categorical ~ categorical)

Fri: Examples of linear regression models

3 of 37

Regression

Regression is one of the most powerful and flexible statistical techniques that you can learn

Today we will learn about linear regression:

Linear regression provides the relationship between a quantitative response (dependent) variable and one or more predictor (independent) variable(s)

4 of 37

Linear regression with one predictor variable

Let’s start with the most straightforward case - which is linear regression of a dependent variable using a single predictor

Some examples could be:

- Growth of cell line versus concentration of glucose

- Gene expression level versus dose of drug

- Tumor size versus duration of chemotherapy

5 of 37

Linear regression with one predictor variable

The simple case is linear regression of a dependent variable using a single predictor

Back to our example from last week - of happiness index and number of hours spent on Zoom

hours <- c(0,1,2,3,4,5,6,7,8,9)

happiness <- c(10,9.5,9,8.5,8,7.5,7,6.5,6,5.5)

plot(happiness~hours,ylim=c(5,10),xlim=c(0,9))

6 of 37

Linear regression with one predictor variable

Given these data which are perfectly correlated (R=-1) - we can add a straight line to the plot that models the relationship exactly

The ‘intercept’ of the line is 10 - because we start at hour 0 with a happiness index of 10

The ‘slope’ of the line is -0.5 because we lose 0.5 units of happiness per hour on Zoom

hours <- c(0,1,2,3,4,5,6,7,8,9)

happiness <- c(10,9.5,9,8.5,8,7.5,7,6.5,6,5.5)

plot(happiness~hours,ylim=c(5,10),xlim=c(0,9))

abline(10,-.5)

7 of 37

Linear regression with one predictor variable

The ‘intercept’ of the line is 10 - because we start at hour 0 with a happiness index of 10

The ‘slope’ of the line is -0.5 because we lose 0.5 units of happiness per hour

We can write the model for this line, where slope and intercept are ‘coefficients’ of the model, as:

happiness = 10 + -0.5 * Zoom hours

happiness = 10 + -0.5 * 5 = 7.5

Often times the model will be written as:

y = b0 + b1*x

y - dependent variable, x - independent variable, b0 - intercept, b1 - slope

8 of 37

Fitting a regression line

We can evaluate a linear model using the R function ‘lm()’

Recall that for ANOVA the way to specify a linear model, e.g.: y = b0 + b1*x in R is: y ~ x

hours <- c(0,1,2,3,4,5,6,7,8,9)

happiness <- c(10,9.5,9,8.5,8,7.5,7,6.5,6,5.5)

my_model <- lm(happiness~hours) ### happiness = b0 + b1*hours

my_model

Call:

lm(formula = happiness ~ hours)

Coefficients:

(Intercept) hours

10.0 -0.5 ### same coefficients we saw before

9 of 37

Linear regression with one predictor variable

Let’s go back to another example - which is the amount of money we spend on coffee per hour worked in lab

hours <- c(0,1,2,3,4,5,6,7,8,9)

coffee <- c(0,2,4,6,8,10,12,14,16,18)

plot(coffee~hours,ylim=c(0,18),xlim=c(0,9))

10 of 37

Linear regression with one predictor variable

Let’s go back to another example - the amount of money we spend on coffee per hour worked in lab

We want to fit a line to these data... which line is the better fit?

11 of 37

Linear regression with one predictor variable

Let’s go back to another example - the amount of money we spend on coffee per hour worked in lab

For the second example - why isn’t this a good fit? Because the line is further away from the points. The distance between each point and the line is called the ‘residual’ - the ‘residuals’ from this line are higher so it is a worse fit for the data

<- residuals

12 of 37

Linear regression with one predictor variable

In linear regression we attempt to find a line that best fits the data, which means minimizing the residual values

The calculation linear regression performs is to find the line that minimizes the sum of the squared residuals from the line - and is therefore called ‘least squares’

Even though it is a pretty simple concept - it is one of the most powerful statistical tools you can use

<- residuals

13 of 37

Linear regression with one predictor variable

Let’s look at simple examples to see how linear regression works, and how we can interpret the results and perform hypothesis testing

Say we want to determine whether the concentration of thyroid hormone in culture media affects differentiation of neurons derived from stem cells

exp1 <- data.frame(hormone=c(100,120,140,160,180,200,220,240),diff=c(20,25,24,28,30,27,32,30))

plot(diff~hormone,data=exp1,xlim=c(80,260),ylim=c(18,34))

14 of 37

Linear regression with one predictor variable

We evaluate a linear model using these data to first determine the coefficients of the model

The coefficients of the model are intercept=15.26 and slope=0.069:

Differentiation = 15.26 + 0.069*hormone

Meaning that for every unit of hormone added, differentiation increases by 0.069 units

exp1.lm <- lm(diff~hormone,data=exp1)

exp1.lm

Call:

lm(formula = diff ~ hormone, data = exp1)

Coefficients:

(Intercept) hormone

15.26190 0.06905

15 of 37

Linear regression with one predictor variable

The line appears to be a good fit for our data - so lets evaluate the hypothesis that hormone is a significant predictor of differentiation

Recall from the ANOVA lectures that we can use the ‘summary()’ function

summary(exp1.lm)

Call:

lm(formula = diff ~ hormone, data = exp1)

Residuals:

Min 1Q Median 3Q Max

-2.1667 -1.8929 0.2619 1.5833 2.3095

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 15.26190 2.82254 5.407 0.00165 **

hormone 0.06905 0.01603 4.307 0.00505 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.078 on 6 degrees of freedom

Multiple R-squared: 0.7556, Adjusted R-squared: 0.7149

F-statistic: 18.55 on 1 and 6 DF, p-value: 0.005053

We get lots of information in the summary

What we care most about is whether hormone is a significant predictor of differentiation and the effect

We can see from the output (highlighted in blue) that hormone is a significant predictor (P=.00505)

Therefore we can reject the null of no effect of the hormone

We also get the effect of hormone, the standard error of this effect and the statistic (t statistic)

16 of 37

Linear regression with one predictor variable

The line appears to be a good fit for our data - so lets evaluate the hypothesis that hormone is a significant predictor of differentiation

Recall from the ANOVA lectures that we can use the ‘summary()’ function

summary(exp1.lm)

Call:

lm(formula = diff ~ hormone, data = exp1)

Residuals:

Min 1Q Median 3Q Max

-2.1667 -1.8929 0.2619 1.5833 2.3095

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 15.26190 2.82254 5.407 0.00165 **

hormone 0.06905 0.01603 4.307 0.00505 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.078 on 6 degrees of freedom

Multiple R-squared: 0.7556, Adjusted R-squared: 0.7149

F-statistic: 18.55 on 1 and 6 DF, p-value: 0.005053

We get lots of information in the summary

We also learn about whether the model as a whole is a good fit to the data

Even though we care mostly about the coefficients of the individual variables (e.g. hormone) it is good to check that the model overall is a good fit

17 of 37

Linear regression with one predictor variable

The line appears to be a good fit for our data - so lets evaluate the hypothesis that hormone is a significant predictor of differentiation

Recall from the ANOVA lectures that we can use the ‘summary()’ function

summary(exp1.lm)

Call:

lm(formula = diff ~ hormone, data = exp1)

Residuals:

Min 1Q Median 3Q Max

-2.1667 -1.8929 0.2619 1.5833 2.3095

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 15.26190 2.82254 5.407 0.00165 **

hormone 0.06905 0.01603 4.307 0.00505 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.078 on 6 degrees of freedom

Multiple R-squared: 0.7556, Adjusted R-squared: 0.7149

F-statistic: 18.55 on 1 and 6 DF, p-value: 0.005053

The model also tells us information about the correlation (R2) between the actual and predicted values from the fit model - which we will come back to later

18 of 37

Linear regression with one predictor variable

How well does our model fit the data?

A fitted value is the value of y on the regression line of fit that corresponds to a particular value of x - i.e. it is the expected value of y based on the model

From our fitted model:

Differentiation = 15.26 + 0.069*hormone

We can predict differentiation from a given hormone value:

Hormone=125:

Differentiation = 15.26 + 0.069*125 = 23.885

Hormone=175:

Differentiation = 15.26 + 0.069*175 = 27.335

19 of 37

Linear regression with one predictor variable

How well does our model fit the data?

A fitted value is the value of y on the regression line of fit that corresponds to a particular value of x - i.e. it is the predicted value of y based on the model

We can view the fitted (expected) values for differentiation given our data for hormones, as well as the residuals in the differentiation values not explained by the hormone variable

fitted(exp1.lm)

1 2 3 4 5 6 7 8

22.16667 23.54762 24.92857 26.30952 27.69048 29.07143 30.45238 31.83333

resid(exp1.lm)

1 2 3 4 5 6 7 8

-2.1666667 1.4523810 -0.9285714 1.6904762 2.3095238 -2.0714286 1.5476190 -1.8333333

-2.1

1.4

-0.9

1.69

2.3

-2.1

1.5

-1.8

20 of 37

Linear regression with one predictor variable

Residuals represent the left over information in the outcome not explained by our predictor (error)

Residuals should be roughly normally distributed and the average should be ~0

Residuals that have different patterns may suggest issues - e.g. the model may be over/under fit

summary(exp1.lm)

Call:

lm(formula = diff ~ hormone, data = exp1)

Residuals:

Min 1Q Median 3Q Max

-2.1667 -1.8929 0.2619 1.5833 2.3095

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 15.26190 2.82254 5.407 0.00165 **

hormone 0.06905 0.01603 4.307 0.00505 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.078 on 6 degrees of freedom

Multiple R-squared: 0.7556, Adjusted R-squared: 0.7149

F-statistic: 18.55 on 1 and 6 DF, p-value: 0.005053

21 of 37

Linear regression with one predictor variable

How well does our model fit the data?

A fitted value is the value of y on the regression line of fit that corresponds to a particular value of x - i.e. it is the predicted value of y based on the model

summary(exp1.lm)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 15.26190 2.82254 5.407 0.00165 **

hormone 0.06905 0.01603 4.307 0.00505 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.078 on 6 degrees of freedom

Multiple R-squared: 0.7556, Adjusted R-squared: 0.7149

F-statistic: 18.55 on 1 and 6 DF, p-value: 0.005053

cor(exp1$diff,fitted(exp1.lm)) ### R

[1] 0.8692614

cor(exp1$diff,fitted(exp1.lm))^2 ### R2

[1] 0.7556155

We get lots of information in the summary

The model also tells us information about the correlation between the actual and predicted values from the fit model (R-squared)

An R-squared of 1 means that the regression line perfectly fits the data (as in our first example of happiness and Zoom calls)

22 of 37

Linear regression with one predictor variable

What about predicting other unseen values?

Say we want to know what the effects of different values for thyroid hormone would be on differentiation. Instead of running multiple additional experiments, we can leverage our model to predict them

new <- data.frame(hormone=c(134,223,300))

predict(exp1.lm,new)

1 2 3

24.51429 30.65952 35.97619

Can you see any potential issues with this?

23 of 37

Linear regression with one predictor variable

Summarizing the linear model gives you the standard error on the estimate of the effect of your dependent variable

What if you wanted to report the confidence interval instead?

confint(exp1.lm,level=.95)

2.5 % 97.5 %

(Intercept) 8.35539376 22.168416

hormone 0.02982127 0.108274

24 of 37

Multiple regression

The examples we just went over covered one predictor variable - but what if you want to know the effects of multiple predictor variables on an outcome variable, for example:

The number of flowers on a plant versus amount of sun, water and fertilizer

The size of a tumor versus age, sex, duration of chemotherapy and frequency of treatments

The amount of time to graduate versus number of papers published, size of your lab and amount of homework from annoying classes

In multiple regression the null hypothesis is that the effect of each of the independent variables is 0

25 of 37

Multiple regression

In regression using a single independent variable we wrote the model as:

y = b0 + b1*x

Where the value of y is the intercept b0 plus the value of x multiplied by the effect of x, b1

In multiple regression, we extend this to consider two or more independent variables:

y = b0 + b1*x1 + b2*x2 + … bn*xn

Where the value of y is the intercept b0 plus the value of all of the independent variables (x1, x2 … xn) each multiplied by their effect (b1, b2, … bn)

26 of 37

Multiple regression

Let’s look at another example to see how multiple regression works, and how we can interpret the results and perform hypothesis testing

Who knows what Scottish fell racing is? We want to know whether the length of the race and the steepness of the climb affect the time it takes to complete the race

scottish <- read.table("scottish-hill-races.txt",header=T,sep="\t")

summary(scottish)

Race Distance.km. Climb..m. Mens.Time..min. Womens.Time..min.

Alva Games Hill Race : 1 Min. : 2.40 Min. : 160.0 Min. : 12.30 Min. : 14.80

Aonach Mor Uphill Race: 1 1st Qu.: 6.40 1st Qu.: 327.0 1st Qu.: 26.88 1st Qu.: 32.45

Arrochar Alps : 1 Median : 8.75 Median : 445.0 Median : 36.42 Median : 41.58

Beinn Lora Hill Race : 1 Mean :11.18 Mean : 672.9 Mean : 57.24 Mean : 67.63

Ben Aigan Hill Race : 1 3rd Qu.:13.90 3rd Qu.: 793.8 3rd Qu.: 70.99 3rd Qu.: 85.80

Ben Lomond Hill Race : 1 Max. :42.00 Max. :2420.0 Max. :206.73 Max. :253.52

(Other) :88 NA's :4 NA's :4

27 of 37

Multiple regression

We evaluate the linear model using both distance (km) and climb (m), which shows that both variables are highly significant predictors of men’s time

scot1 <- lm(Mens.Time..min. ~ Climb..m. + Distance.km.,data=scottish)

summary(scot1)

Call:

lm(formula = Mens.Time..min. ~ Climb..m. + Distance.km., data = scottish)

Residuals:

Min 1Q Median 3Q Max

-16.9062 -4.3878 0.3393 3.1767 22.1092

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -10.372265 1.244506 -8.334 1.03e-12 ***

Climb..m. 0.034227 0.002174 15.744 < 2e-16 ***

Distance.km. 4.042043 0.144761 27.922 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.623 on 87 degrees of freedom

(4 observations deleted due to missingness)

Multiple R-squared: 0.9802, Adjusted R-squared: 0.9798

F-statistic: 2157 on 2 and 87 DF, p-value: < 2.2e-16

28 of 37

Multiple regression

We evaluate the linear model using both distance (km) and climb (m), which shows that both variables are highly significant predictors of men’s time

scot1 <- lm(Mens.Time..min. ~ Climb..m. + Distance.km.,data=scottish)

Call:

lm(formula = Mens.Time..min. ~ Climb..m. + Distance.km., data = scottish)

Residuals:

Min 1Q Median 3Q Max

-16.9062 -4.3878 0.3393 3.1767 22.1092

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -10.372265 1.244506 -8.334 1.03e-12 ***

Climb..m. 0.034227 0.002174 15.744 < 2e-16 ***

Distance.km. 4.042043 0.144761 27.922 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.623 on 87 degrees of freedom

(4 observations deleted due to missingness)

Multiple R-squared: 0.9802, Adjusted R-squared: 0.9798

F-statistic: 2157 on 2 and 87 DF, p-value: < 2.2e-16

High correlation between actual and predicted values

29 of 37

Multiple regression

Residuals of the model appear normally distributed

scot1 <- lm(Mens.Time..min. ~ Climb..m. + Distance.km.,data=scottish)

qqnorm(resid(scot1))

30 of 37

Multiple regression

What is the expected value of race time based on the model? A function of both distance and incline

From our fitted model:

Time = -10.37 + 0.034*incline + 4.04*distance

Say there are two new courses added this year - we can predict the time to complete the course based on the distance and incline

New course 1 distance=5km, climb=300m:

Time = -10.37 + 0.034*300 + 4.04*5 = 20 min

New course 2 distance=8km, climb=200m:

Time = -10.37 + 0.034*200 + 4.04*8 = 28.75 min

new <- data.frame(Distance.km.=5, Climb..m.=300)

predict(scot1,new)

1

20.1061

new <- data.frame(Distance.km.=8, Climb..m.=200)

predict(scot1,new)

1

28.80952

31 of 37

Multiple regression

Let’s look at another example to see how multiple regression works, and how we can interpret the results and perform hypothesis testing

Say we want to determine whether the dose of a new drug, age and BMI (body mass index) affects a measure of lung function in individuals with COPD

exp2 <- data.frame(lung=c(37.8,25.9,34.4,37.1,34.2,35.7,49.5,50.6,28.5,34.8),age=c(65,80,52,50,64,55,47,40,75,54),bmi=c(22,31,22,23,27,29,31,21,29,27),drug=c(4,5,7,3,5,4,6,3,6,3))

summary(exp2)

lung age bmi drug

Min. :25.90 Min. :40.00 Min. :21.00 Min. :3.00

1st Qu.:34.25 1st Qu.:50.50 1st Qu.:22.25 1st Qu.:3.25

Median :35.25 Median :54.50 Median :27.00 Median :4.50

Mean :36.85 Mean :58.20 Mean :26.20 Mean :4.60

3rd Qu.:37.62 3rd Qu.:64.75 3rd Qu.:29.00 3rd Qu.:5.75

Max. :50.60 Max. :80.00 Max. :31.00 Max. :7.00

32 of 37

Multiple regression

We evaluate the linear model using age, sex and drug dose, which shows that only age is a significant predictor of lung function in COPD

exp2.lm <- lm(lung ~ age + bmi + drug,data=exp2)

summary(exp2.lm)

Call:

lm(formula = lung ~ age + bmi + drug, data = exp2)

Residuals:

Min 1Q Median 3Q Max

-5.1787 -3.5773 0.3358 3.6308 5.6531

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 63.84079 11.94922 5.343 0.00176 **

age -0.56025 0.15575 -3.597 0.01141 *

bmi 0.20565 0.50588 0.407 0.69847

drug 0.04954 1.27331 0.039 0.97023

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.081 on 6 degrees of freedom

Multiple R-squared: 0.7225, Adjusted R-squared: 0.5837

F-statistic: 5.207 on 3 and 6 DF, p-value: 0.04158

33 of 37

Model comparison

Which predictor variables should we include in our model?

exp2.lm <- lm(lung ~ age + bmi + drug,data=exp2)

summary(exp2.lm)

Call:

lm(formula = lung ~ age + bmi + drug, data = exp2)

Residuals:

Min 1Q Median 3Q Max

-5.1787 -3.5773 0.3358 3.6308 5.6531

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 63.84079 11.94922 5.343 0.00176 **

age -0.56025 0.15575 -3.597 0.01141 *

bmi 0.20565 0.50588 0.407 0.69847

drug 0.04954 1.27331 0.039 0.97023

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.081 on 6 degrees of freedom

Multiple R-squared: 0.7225, Adjusted R-squared: 0.5837

F-statistic: 5.207 on 3 and 6 DF, p-value: 0.04158

exp2.lm2 <- lm(lung ~ age,data=exp2)

summary(exp2.lm2)

Call:

lm(formula = lung ~ age, data = exp2)

Residuals:

Min 1Q Median 3Q Max

-5.7263 -3.7727 0.4714 3.2418 6.7315

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 67.6052 7.0229 9.626 1.13e-05 ***

age -0.5284 0.1182 -4.471 0.00208 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.466 on 8 degrees of freedom

Multiple R-squared: 0.7141, Adjusted R-squared: 0.6784

F-statistic: 19.99 on 1 and 8 DF, p-value: 0.002082

anova(exp2.lm,exp2.lm2)

Res.Df RSS Df Sum of Sq F Pr(>F)

1 6 154.9134 NA NA NA NA

2 8 159.5704 -2 -4.656958 0.09018506 0.9149773

34 of 37

Feature selection

Which variables best explain the data?

library(MASS)

full.model <- lm(lung ~., data = exp2)

step.model <- stepAIC(full.model, direction = "both", trace = FALSE)

step.model

Call:

lm(formula = lung ~ age, data = exp2)

Coefficients:

(Intercept) age

67.6052 -0.5284

Feature selection - e.g. forward/backward/stepwise regression, LASSO

35 of 37

Assumptions of linear regression

As we have seen multiple times over - linear regression makes several assumptions that we should consider in our analyses:

1. The data generally appear to have a linear relationship

2. The residuals should be approx. normally distributed and have no obvious patterns

Visual inspection of the input data and the residuals can help identify potential issues - and as with other tests, transforming your data e.g. log, exponential, square root, inverse, more extreme transformations (e.g. quantile norm) - may help.

Different from some other tests - there aren’t really non-parametric versions (at least that are commonly used, or that I know of) for linear regression

36 of 37

Other issues - influential points

Your data may contain outlying values that could have a disproportionate effect on the linear model.

We saw similar issues with correlation - in that single outlying values can alone drive a high Pearson linear correlation

In regression analyses it is prudent to identify and remove outlying values in hypothesis testing, as they can complicate the interpretation of your model

There are many methods for identifying influential points in a model - the R function influential.measures() is a good place to start

We will go over this in more detail on Friday

37 of 37

Topics we didn’t get a chance to cover

Interactions in regression

Correlated variables in multiple regression

Non-linear regression (e.g. quadratic)

If any specific topics are of interest to you we can find time to cover them in week 8