1 of 18

Generalised Linear Models - II

Saket Choudhary

saketc@iitb.ac.in

Biostatistics in Healthcare

DH 801

Lecture 13 || Friday, 7th November 2025

2 of 18

GLMs

3 of 18

Revisiting OLS

  • In ordinary least squares (OLS or linear regression), we model the “response” (dependent) variable y as a linear combination of the “explanatory” (independent) variables x
  • The fit predicts the expected value of the “response” variables as a linear combination of the “explanatory” variables
  • The assumption of linear regression are usually good for variables where the variation in the mean of the response variable is small as compared to change in the explanatory variables
  • However, the expected response can change by large amounts (say exponentially) with a small change in explanatory variables when the usual assumptions of OLS (which ones?) start to fail
  • Example: Change in the level of snoring can increase the probability of developing a heart disease by 4x

4 of 18

Problem

Survey data:

  • visits: Number of doctor consultations (outcome variable)
  • gender: Male or Female
  • age: Age in years (divided by 100 in raw data)
  • income: Annual income in tens of thousands
  • illness: Number of illnesses in past 2 weeks
  • health: Self-assessed health score (0-12, higher values = better health)
  • nchronic: Chronic condition indicator (no/yes)
  • private: Private health insurance indicator (no/yes)

5 of 18

Problem

6 of 18

Problem

7 of 18

Problem

8 of 18

OLS

ls_model <- lm(visits ~ health + chronic_condition + gender + age_years + private_insurance + illness,

data = health_data)

9 of 18

dfdf

Linear regression: Assumptions

  1. L (linear) - There is a linear relationship between the outcome variable and each covariate.
  2. I (independent) - The outcome for individual observations are independent from one another, given the covariates in the model.
  3. N (normal) - The residuals (errors) are normally distributed. Note that the variables themselves do not need to be normally distributed.
  4. E (equal variances) - The variance of the residuals is constant across covariate groups

10 of 18

dfdf

What does the linear regression really model?

Independent variables (x) are measured without error, but there is error in the measurement of dependent variables (y)

  • Probability of observing y_i is then given by:�

  • The likelihood of observing data (x1,...,xn ) and (y1,...yn ) is given by:

  • Conditional distribution of Y given X is Gaussian with mean: mx and variance

11 of 18

Generalized linear models

  • Linear models assume a linear relationship between the mean “response” variable (dependent) Y and the set of “covariates” or “independent” or “explanatory” variables.
    • Y is assumed to be normal
    • Models linear function of the mean
  • Generalized linear models: Superclass of linear models that allow modelling of non-normal response (dependent variable) as a function and allow for non-linear functions of the mean:
    • Response variable: Specifies the response variable and its probability distribution
    • Explanatory variables: Specifies the p explanatory variables for a linear predictor

    • Link function: A function that maps the mean of the response to the explanatory variables

Value of explanatory variable j for observation i

12 of 18

Components of a GLM

  • Random Component:
    • specifies the probability distribution of the response variable Y
    • Example:
      • Y ~ N(Xb, 𝝈2) in OLS
      • Y ~ Bernoulli in the (binary) logistic regression model
    • This is the only random component in the GLM → no separate error term
  • Systematic Component: specifies the explanatory variables in the model, more specifically, their linear combination →
  • Link function 𝞰 or g(𝞵) :
    • specifies the link between the random and the systematic components
    • Interpretation: How does the expected value of the response relates to the linear combination of explanatory variables
    • OLS → 𝞰 = g(𝞵) = g(E[Y]) = E[Y]
    • Poisson → 𝞰 = g(𝞵) = g(E[Y]) = ln(𝞵) =
    • Logistic → 𝞰 = log(𝞵/(1-𝞵)) = logit(𝞵)

13 of 18

Advantage of GLM rather than transforming

  • Before GLMs, a common way to handle response variables that violate the assumptions of OLS was to transform the variable such that the transformation was approximately normal (log/square root/…)
  • However finding the transformation that achieves this is itself tricky
  • It would forcefully model while we want to ideally model
  • GLMs solve this problem by allowing Y to have a flexible distribution

14 of 18

Link function

  • The probability distribution function captures how Y is distributed
  • The linear predictor captures the dominance of independent factors
  • The link function captures the relationship between linear predictor and the mean of the distribution function

15 of 18

Ughhhh, this is so confusing…

  • Linear model models Y=Xb+e where e is
    • lm(y~x)
  • GLM fits g(Y)=Xb+e where g() is the “link” function and the “e” can follow a non-normal distribution
    • glm(y~x, family=’family(link=’link’))

g(Y)

Distribution of e

16 of 18

Back to the problem

17 of 18

Visualising GLM predictions

18 of 18

dfdf

Questions