1 of 18

Generalised Linear Models - II

Saket Choudhary

saketc@iitb.ac.in

Biostatistics in Healthcare

DH 801

Lecture 13 || Friday, 7^th November 2025

2 of 18

GLMs

3 of 18

Revisiting OLS

In ordinary least squares (OLS or linear regression), we model the “response” (dependent) variable y as a linear combination of the “explanatory” (independent) variables x
The fit predicts the expected value of the “response” variables as a linear combination of the “explanatory” variables
The assumption of linear regression are usually good for variables where the variation in the mean of the response variable is small as compared to change in the explanatory variables
However, the expected response can change by large amounts (say exponentially) with a small change in explanatory variables when the usual assumptions of OLS (which ones?) start to fail
Example: Change in the level of snoring can increase the probability of developing a heart disease by 4x

4 of 18

Problem

Survey data:

visits: Number of doctor consultations (outcome variable)
gender: Male or Female
age: Age in years (divided by 100 in raw data)
income: Annual income in tens of thousands
illness: Number of illnesses in past 2 weeks
health: Self-assessed health score (0-12, higher values = better health)
nchronic: Chronic condition indicator (no/yes)
private: Private health insurance indicator (no/yes)

8 of 18

OLS

ls_model <- lm(visits ~ health + chronic_condition + gender + age_years + private_insurance + illness,

data = health_data)

9 of 18

dfdf

Linear regression: Assumptions

L (linear) - There is a linear relationship between the outcome variable and each covariate.
I (independent) - The outcome for individual observations are independent from one another, given the covariates in the model.
N (normal) - The residuals (errors) are normally distributed. Note that the variables themselves do not need to be normally distributed.
E (equal variances) - The variance of the residuals is constant across covariate groups

10 of 18

dfdf

What does the linear regression really model?

Independent variables (x) are measured without error, but there is error in the measurement of dependent variables (y)

Probability of observing y_i is then given by:�

The likelihood of observing data (x₁,...,x_n) and (y₁,...y_n) is given by:

Conditional distribution of Y given X is Gaussian with mean: mx and variance

11 of 18

Generalized linear models

Linear models assume a linear relationship between the mean “response” variable (dependent) Y and the set of “covariates” or “independent” or “explanatory” variables.

Y is assumed to be normal
Models linear function of the mean

Generalized linear models: Superclass of linear models that allow modelling of non-normal response (dependent variable) as a function and allow for non-linear functions of the mean:

Response variable: Specifies the response variable and its probability distribution
Explanatory variables: Specifies the p explanatory variables for a linear predictor

Link function: A function that maps the mean of the response to the explanatory variables

Value of explanatory variable j for observation i

12 of 18

Components of a GLM

Random Component:

specifies the probability distribution of the response variable Y
Example:

Y ~ N(Xb, 𝝈²) in OLS
Y ~ Bernoulli in the (binary) logistic regression model

This is the only random component in the GLM → no separate error term

Systematic Component: specifies the explanatory variables in the model, more specifically, their linear combination →
Link function 𝞰 or g(𝞵) :

specifies the link between the random and the systematic components
Interpretation: How does the expected value of the response relates to the linear combination of explanatory variables
OLS → 𝞰 = g(𝞵) = g(E[Y]) = E[Y]
Poisson → 𝞰 = g(𝞵) = g(E[Y]) = ln(𝞵) =
Logistic → 𝞰 = log(𝞵/(1-𝞵)) = logit(𝞵)

13 of 18

Advantage of GLM rather than transforming

Before GLMs, a common way to handle response variables that violate the assumptions of OLS was to transform the variable such that the transformation was approximately normal (log/square root/…)
However finding the transformation that achieves this is itself tricky
It would forcefully model while we want to ideally model
GLMs solve this problem by allowing Y to have a flexible distribution

14 of 18

Link function

The probability distribution function captures how Y is distributed
The linear predictor captures the dominance of independent factors
The link function captures the relationship between linear predictor and the mean of the distribution function

Source

15 of 18

Ughhhh, this is so confusing…

Linear model models Y=Xb+e where e is

lm(y~x)

GLM fits g(Y)=Xb+e where g() is the “link” function and the “e” can follow a non-normal distribution

glm(y~x, family=’family(link=’link’))

g(Y)

Distribution of e

1 of 18

2 of 18

3 of 18

4 of 18

5 of 18

6 of 18

7 of 18