1 of 38

Linear mixed models

BIOM285

May 16 2025

2 of 38

Outline

Linear mixed models - Friday May 16

Biological examples using mixed models - Mon May 19

3 of 38

Generalized linear models

Generalized linear models (GLMs) allow us to define a linear relationship between one or more predictor variables and an outcome variable that can follow a variety of distributions

A key assumption of GLMs is that the effect is ‘constant’ across samples for each predictor - which is also known as a ‘fixed effect’

4 of 38

Linear models

The ‘intercept’ of the line is 10 - because we start at hour 0 with a happiness of 10

The ‘slope’ of the line is -0.5 because we lose 0.5 units of happiness per hour

We can write the model for this line as:

happiness = 10 + -0.5 * Zoom hours

happiness = 10 + -0.5 * 5 = 7.5

y = b0 + b1*x

b0 - intercept, b1 - slope

In a linear model the effects are the same across individuals - i.e. assumed that everyone decreases in happiness at the same rate

5 of 38

Limitations of GLMs

The constraint of fixed effects for each predictor is a limitation of GLMs that prohibit their application to certain questions

For example:

- Repeated measures such as across a time course

- Longitudinal measurements from a cohort

- Technical replicates of measurements within independent samples

- Experimental measurements of varying effect within blocks/groups

- ‘Pseudo-replication’, e.g. profiling many cells in a sample, such as from single cell genomics or imaging

6 of 38

Overcoming these limitations

For these and other questions we can use linear mixed models (aka mixed-effect models)

Linear mixed models contain a mixture of effects including both ‘fixed’ effects and ‘random’ effects to account for additional patterns, or non-independence, in aspects of your data

Can enable modeling nested - or hierarchical - measurements organized in groups, e.g. within a sample or another variable that is categorical

What are random effects?

Effects that can vary within groups either in the intercept (starting point) or the slope (effect) or both

7 of 38

Overcoming these limitations

Fixed effect:

Colors represent different groups of a categorical variable

- in this case university department

8 of 38

Overcoming these limitations

Random intercept:

9 of 38

Overcoming these limitations

Random slope:

10 of 38

Overcoming these limitations

Random intercept and slope:

11 of 38

Random effects

Random effects can vary by intercept, by slope, or both:

12 of 38

Mixed/random effects

Mixed effects can vary by intercept, by slope, or both:

The choice of whether to vary the intercept or the slope or both depends on the question being asked

Fully random effects is most flexible and generalizable (and in that sense preferred) but requires more levels/data to accurately estimate parameters per group

13 of 38

Mixed/random effects

Mixed effects can vary by intercept, by slope, or both:

What is an example of where a random intercept and fixed slope might be appropriate?

14 of 38

Mixed/random effects

Mixed effects can vary by intercept, by slope, or both:

What is an example of where a fixed intercept and random slope might be appropriate?

15 of 38

Creating mixed effects models

Mixed effects can vary by intercept, by slope, or both:

The ‘lmer’ function in the package lme can be used to fit mixed-effects linear models

For linear regression/GLMs - the syntax for predictor x and response y with a fixed effect is: y ~ x

For mixed models, the syntax for considering a fixed and random effect is:

y ~ x + (1|z) fixed effect x, random intercept for z

y ~ x + (0 + x|z) fixed effect x, random slope of x within z

y ~ x + (x|z) fixed effect x, random effect of x within z

16 of 38

Example of where linear mixed models can be effective

What is the effect of increasing doses of a new drug on weight of mice, when using different strains of mice

model <- lm(Weight ~ Dose,data=mouse_data)

summary(model)

Call:

lm(formula = Weight ~ Dose, data = mouse_data)

Residuals:

Min 1Q Median 3Q Max

-5.6239 -1.5714 -0.3714 2.1863 4.8653

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 26.9636 0.9371 28.77 < 2e-16 ***

Dose -2.4387 0.2949 -8.27 8.63e-11 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.711 on 48 degrees of freedom

Multiple R-squared: 0.5876, Adjusted R-squared: 0.579

F-statistic: 68.4 on 1 and 48 DF, p-value: 8.627e-11

17 of 38

Example of where linear mixed models can be effective

What is the effect of increasing doses of a new drug on weight of mice? Consider that there are multiple strains of mice

library(lme4)

lmm <- lmer(Weight ~ Dose + (1|Strain),data=mouse_data)

Linear mixed model fit by REML ['lmerMod']

Formula: Weight ~ Dose + (1 | Strain)

Data: mouse_data

REML criterion at convergence: 156.5756

Random effects:

Groups Name Std.Dev.

Strain (Intercept) 2.842

Residual 0.939

Number of obs: 50, groups: Strain, 5

Fixed Effects:

(Intercept) Dose

25.839 -2.051

How can we model the effect of drug dose on weight considering different starting weights of each strain?

18 of 38

Example of where linear mixed models can be effective

What is the effect of increasing doses of a new drug on weight of mice? Consider that there are multiple strains of mice

coef(lmm)

$Strain

(Intercept) Dose

m1 27.93008 -2.050958

m2 25.17448 -2.050958

m3 22.13342 -2.050958

m4 24.65351 -2.050958

m5 29.30412 -2.050958

How can we model the effect of drug dose on weight considering different starting weights of each strain?

19 of 38

Example of where linear mixed models can be effective

What is the effect of increasing doses of a new drug on weight of mice? Consider that there are multiple strains of mice

library(lme4)

lmm2 <- lmer(Weight ~ Dose + (Dose|Strain),data=mouse_data)

Linear mixed model fit by REML ['lmerMod']

Formula: Weight ~ Dose + (Dose | Strain)

Data: mouse_data

REML criterion at convergence: 153.4312

Random effects:

Groups Name Std.Dev. Corr

Strain (Intercept) 3.4405

Dose 0.2004 -1.00

Residual 0.9053

Number of obs: 50, groups: Strain, 5

Fixed Effects:

(Intercept) Dose

25.795 -2.052

optimizer (nloptwrap) convergence code: 0 (OK) ; 0 optimizer warnings; 1 lme4 warnings

How can we model the effect of drug dose on weight considering different starting weights of each strain and different effects within each strain?

20 of 38

Example of where linear mixed models can be effective

What is the effect of increasing doses of a new drug on weight of mice? Consider that there are multiple strains of mice

coef(lmm2)

$Strain

(Intercept) Dose

m1 28.37558 -2.202011

m2 25.06797 -2.009368

m3 21.20428 -1.784338

m4 24.43676 -1.972606

m5 29.88961 -2.290192

attr(,"class")

[1] "coef.mer"

How can we model the effect of drug dose on weight considering different starting weights of each strain and different effects within each strain?

21 of 38

Example of where linear mixed models can be effective

What is the effect of increasing doses of a new drug on weight of mice? Consider that there are multiple strains of mice

library(lme4)

lmm2 <- lmer(Weight ~ Dose + (Dose|Strain),data=mouse_data)

Linear mixed model fit by REML ['lmerMod']

Formula: Weight ~ Dose + (Dose | Strain)

Data: h

REML criterion at convergence: 153.4312

Random effects:

Groups Name Std.Dev. Corr

Strain (Intercept) 3.4405

Dose 0.2004 -1.00

Residual 0.9053

Number of obs: 50, groups: Strain, 5

Fixed Effects:

(Intercept) Dose

25.795 -2.052

optimizer (nloptwrap) convergence code: 0 (OK) ; 0 optimizer warnings; 1 lme4 warnings

Lower REML is higher likelihood - fully random effects in this case is a better model compared to the random intercept model which has higher REML

22 of 38

Example of where linear mixed models can be effective

What about data which contains both biological and technical replicates?

Here we measure enzymatic activity across 10 samples and want to test for association to a disease phenotype, where we perform 5 technical replicates of the experiments for each sample

md <- lm(enz$Value2 ~ enz$Value)

summary(md)

Call:

lm(formula = d$Value2 ~ d$Value)

Residuals:

Min 1Q Median 3Q Max

-4.3232 -1.5239 -0.0828 1.7889 4.0199

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.56206 2.30423 -1.112 0.272

d$Value 0.47312 0.06109 7.745 5.36e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.33 on 48 degrees of freedom

Multiple R-squared: 0.5555, Adjusted R-squared: 0.5462

F-statistic: 59.98 on 1 and 48 DF, p-value: 5.364e-10

Enzymatic activity

Phenotype

23 of 38

Example of where linear mixed models can be effective

What about data which contains both biological and technical replicates?

We could ‘collapse’ the technical replicates (average values) and that would allow us to use linear regression

summary(md)

Call:

lm(formula = enz_mean$Enzyme ~ enz_mean$Pheno)

Residuals:

Min 1Q Median 3Q Max

-5.142 -2.780 -1.456 2.427 9.973

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 19.7986 2.3410 8.457 4.52e-11 ***

d3$Pheno 1.1611 0.1511 7.683 6.67e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.687 on 48 degrees of freedom

Multiple R-squared: 0.5515, Adjusted R-squared: 0.5422

F-statistic: 59.02 on 1 and 48 DF, p-value: 6.666e-10

Enzymatic activity

Phenotype

24 of 38

Example of where linear mixed models can be effective

What about data which contains both biological and technical replicates?

Or we could model the replicate experiments using a random effect

md2 <- lmer(Enzyme ~ Pheno + (1|Sample),data=enz)

summary(md2)

Linear mixed model fit by REML ['lmerMod']

Formula: Enzyme ~ Pheno + (1 | Sample)

Data: d3

REML criterion at convergence: 271.8

Scaled residuals:

Min 1Q Median 3Q Max

-1.3948 -0.7540 -0.3951 0.6584 2.7051

Random effects:

Groups Name Variance Std.Dev.

Sample (Intercept) 0.00 0.000

Residual 13.59 3.687

Number of obs: 50, groups: Sample, 5

Fixed effects:

Estimate Std. Error t value

(Intercept) 19.7986 2.3410 8.457

Pheno 1.1611 0.1511 7.683

Correlation of Fixed Effects:

(Intr)

Pheno -0.975

Enzymatic activity

Phenotype

25 of 38

Example of where linear mixed models can be effective

Here we collect data on the effects of a new drug on weight gain in a cohort at a hospital clinic

m1 <- lm(Value ~ Dose,data=clinic_data)

summary(m1)

Call:

lm(formula = Value ~ Dose, data = v)

Residuals:

Min 1Q Median 3Q Max

-79.02 -23.39 8.50 18.39 71.03

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 126.64 14.97 8.460 2.62e-13 ***

Dose 149.23 38.53 3.873 0.000194 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 32.94 on 98 degrees of freedom

Multiple R-squared: 0.1328, Adjusted R-squared: 0.1239

F-statistic: 15 on 1 and 98 DF, p-value: 0.0001938

26 of 38

Example of where linear mixed models can be effective

Here we collect data on the effects of a new drug on weight gain in a cohort at a hospital clinic - what if we look at samples separated by gender? The effects are very different

m2 <- lmer(Value ~ Dose + (Dose|Gender),data=clinic_data)

summary(m2)

Linear mixed model fit by REML ['lmerMod']

Formula: Value ~ Dose + (Dose | Gender)

Data: v

REML criterion at convergence: 907.8

Scaled residuals:

Min 1Q Median 3Q Max

-2.2807 -0.4143 0.0302 0.3313 3.5559

Random effects:

Groups Name Variance Std.Dev. Corr

Gender (Intercept) 4532 67.32

Dose 58513 241.89 -1.00

Residual 564 23.75

Number of obs: 100, groups: Gender, 2

Fixed effects:

Estimate Std. Error t value

(Intercept) 126.64 48.81 2.594

Dose 149.23 173.29 0.861

coef(m2)

$Gender

(Intercept) Dose

F 173.98091 -20.87315

M 79.29753 319.34242

27 of 38

Other examples of linear mixed models

In this example we determine whether levels of a new biomarker of interest are associated with a quantitative measure of disease

mod <- lm(Disease ~ Biomarker,data=biomarker_data)

summary(mod)

Call:

lm(formula = Disease ~ Biomarker, data = biomarker_data)

Residuals:

Min 1Q Median 3Q Max

-2.40826 -0.90334 -0.04954 0.64218 2.40477

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 130.5009 9.0963 14.347 <2e-16 ***

Biomarker -0.2382 0.1150 -2.071 0.0429 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.254 on 58 degrees of freedom

Multiple R-squared: 0.06883, Adjusted R-squared: 0.05278

F-statistic: 4.287 on 1 and 58 DF, p-value: 0.04286

28 of 38

Other examples of linear mixed models

How about if we account for the cohort that the samples come from? The pattern is opposite, as the effects are positive within cohort - what is this phenomenon called?

mod2 <- lmer(Disease ~ Biomarker + (Biomarker|Cohort),data=biomarker_data)

summary(mod2)

Linear mixed model fit by REML ['lmerMod']

Formula: Disease ~ Biomarker + (Biomarker | Cohort)

Data: r

REML criterion at convergence: 137.4

Scaled residuals:

Min 1Q Median 3Q Max

-2.21860 -0.76260 -0.05759 0.83586 2.32448

Random effects:

Groups Name Variance Std.Dev. Corr

Cohort (Intercept) 0.6711521 0.81924

Biomarker 0.0001034 0.01017 0.98

Residual 0.4601512 0.67834

Number of obs: 60, groups: Cohort, 3

Fixed effects:

Estimate Std. Error t value

(Intercept) 83.90322 6.49943 12.909

Biomarker 0.35127 0.08219 4.274

29 of 38

Other examples of linear mixed models

In this example we determine whether levels of a new biomarker of interest are associated with a quantitative measure of disease

coef(mod2)

$Cohort

(Intercept) Biomarker

A 83.01421 0.3402402

B 84.08135 0.3534811

C 84.61410 0.3600939

30 of 38

Repeated measures and longitudinal data

Both are examples of multiple measurements performed within a sample over short or long periods of time

In this example we collected measurements of abundance of a protein biomarker on repeated visits every year - do the levels of the biomarker vary linearly with year?

mod2 <- lmer(Value ~ Year + (Year|Sample),data=rmdata)

summary(mod2)

Linear mixed model fit by REML ['lmerMod']

Formula: Value ~ Year + (Year | Sample)

Data: t

REML criterion at convergence: 95.2

Scaled residuals:

Min 1Q Median 3Q Max

-1.51418 -0.62538 -0.02284 0.42124 1.62411

Random effects:

Groups Name Variance Std.Dev. Corr

Sample (Intercept) 27.089 5.2048

Year 1.805 1.3436 -0.56

Residual 0.200 0.4472

Number of obs: 30, groups: Sample, 6

Fixed effects:

Estimate Std. Error t value

(Intercept) 48.4253 2.1334 22.698

Year 1.5750 0.5516 2.856

Effects of year on value are ~1.5 increase per year

Some samples have an effect some don’t, but the overall effect is positive

31 of 38

Diagnosing linear mixed effects models

Residuals (error) from model are normally distributed and centered around 0

No clear relationship in residuals across the distribution of fitted values

32 of 38

Hypothesis testing with linear mixed models

The LMM will estimate the fixed and random effects but not provide significance testing for whether these effects are different than null expectation (i.e. true effect=0)

One way to perform hypothesis testing of the fixed effect using LMMs is to perform model comparison of models with and without the fixed effect variable

From the previous example, is year a significant predictor of biomarker levels?

Biomarker levels increase by

~1.5 units/year

mod2 <- lmer(Value ~ Year + (Year|Sample),data=rmdata)

summary(mod2)

. . . .

Fixed effects:

Estimate Std. Error t value

(Intercept) 48.4253 2.1334 22.698

Year 1.5750 0.5516 2.856

33 of 38

Hypothesis testing with linear mixed models

The LMM will estimate the fixed and random effects but not provide significance testing for whether these effects are different than null expectation (effect=0)

One way to perform hypothesis testing of the fixed effect using LMMs is to perform model comparison of models with and without the fixed effect variable

From the previous example, is year a significant predictor of biomarker levels?

Biomarker levels increase by

~1.5 units/year

Can reject the null (p=.01) and conclude that yes year affects biomarker levels

library(lrtest)

mod2 <- lmer(Value ~ Year + (Year|Sample),data=rmdata)

mod3 <- lmer(Value ~ (Year|Sample),data=rmdata)

lrtest(mod3,mod2)

Likelihood ratio test

Model 1: Value ~ (Year | Sample)

Model 2: Value ~ Year + (Year | Sample)

#Df LogLik Df Chisq Pr(>Chisq)

1 5 -50.775

2 6 -47.596 1 6.3578 0.01169 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

34 of 38

Hypothesis testing with linear mixed models

The LMM will estimate the fixed and random effects but not provide significance testing for whether these effects are different than null expectation (effect=0)

Another way to perform hypothesis testing of the fixed effect using LMMs is to use the lmertest package (same function lmer) which provides significance testing

From the previous example, is year a significant predictor of biomarker levels?

Biomarker levels increase by

~1.5 units/year

Can reject the null (p=.01) and conclude that yes year affects levels

mod5 <- lmer(Value ~ Year + (Year|Sample),data=rmdata)

summary(mod5)

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']

Formula: Value ~ Year + (Year | Sample)

Data: t

REML criterion at convergence: 95.2

Scaled residuals:

Min 1Q Median 3Q Max

-1.51418 -0.62538 -0.02284 0.42124 1.62411

Random effects:

Groups Name Variance Std.Dev. Corr

Sample (Intercept) 27.089 5.2048

Year 1.805 1.3436 -0.56

Residual 0.200 0.4472

Number of obs: 30, groups: Sample, 6

Fixed effects:

Estimate Std. Error df t value Pr(>|t|)

(Intercept) 48.4253 2.1334 4.9999 22.698 3.09e-06 ***

Year 1.5750 0.5516 4.9996 2.856 0.0356 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

35 of 38

Linear mixed models with different outcome distributions

Just as GLMs are a generalization of LMs, we can use GLMMs to model outcomes that follow different distributions

For example, gene count data from cells (e.g. scRNA-seq) - cells are ‘pseudo-replicates’ within each sample and likely not truly independent measurements

Counts

n=50 cells/sample

Gene count in cells for s4:

Gene count in cells for s5:

36 of 38

Linear mixed models with different outcome distributions

Just as GLMs are a generalization of LMs, we can use GLMMs to model outcomes that follow different distributions

For example, gene count data from cells (e.g. scRNA-seq)

n=50 cells/sample

out <- glmer(counts ~ disease + (1|sample), data=tab, family="poisson")

summary(out)

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']

Family: poisson ( log )

Formula: counts ~ disease + (1 | sample)

Data: tab

AIC BIC logLik deviance df.resid

1637.8 1650.4 -815.9 1631.8 497

Scaled residuals:

Min 1Q Median 3Q Max

-1.7608 -1.0119 -0.0238 0.5116 4.4865

Random effects:

Groups Name Variance Std.Dev.

sample (Intercept) 1.121e-05 0.003348

Number of obs: 500, groups: sample, 10

Fixed effects:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.13140 0.03600 31.43 <2e-16 ***

diseaseND -1.10769 0.07212 -15.36 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Effect of disease on gene counts:

Beta = 1.1

exp(1.1) -> 3

3-fold change in counts in disease

37 of 38

Assumptions/limitations of linear mixed models

Linear relationship between predictors and outcome variable

Residuals (error) are normally distributed and have constant variance

The random effects can only consider effects with categorical variables, and ideally you want the categories to contain >2-3 levels in order to estimate the variance accurately

38 of 38

Summary

Linear regression and GLMs are limited by the effects of each predictor being fixed across samples/units

Mixed-effects linear models can estimate both a fixed effect as well as random effects within a unit/group and enable considering both independent and non-independent data

Can capture nested patterns within groups including those that may not be apparent - or even opposite - in the overall data

Mixed-effects models can be generalized to model outcome data following different distributions in the same manner as GLMs