1 of 31

Model fitting and hypothesis testing

Saket Choudhary

saketc@iitb.ac.in

Introduction to Public Health Informatics

DH 302

Lecture 05 || Wednesday, 22^nd January 2025

2 of 31

Goals for today

How to decide what model best explains the observation
Chi-squared test and G-test
Expectations, Variances, CLT, Normal approximation

3 of 31

Question: What is going on this plot?

4 of 31

Question: What is going on this plot?

5 of 31

Some potential questions we would like to answer w.r.t the plots

Are certain points an “outlier”?
Has the trend of deaths changed “significantly” with time?

To be able to answer this question, we first need to answer: What is the best model that presents our data?

Even before we answer this question, we should learn about some (important) continuous distributions

6 of 31

Some (important) continuous distributions

https://distribution-explorer.github.io/

Normal

Student’s t

Gamma

Chi-square

Mean = k

Variance = 2k

7 of 31

Gaussian a.k.a Normal distribution

Normal

Quantities that are sum of large number of subprocesses tend to be normally distributed

Example: Height/blood pressure distribution of a sample

NFHS5

8 of 31

What is normal about normal?

“The literature gives conflicting evidence about the origin of the term Normal distribution". Karl Pearson (1920) claimed to have introduced it “many years ago", in order to avoid an old dispute over priority between Gauss and Legendre; but he gave no reference. Hilary Seal (1967) attributes it instead to Galton; but again fails to give a reference, so it would require a new historical study to decide this. However, the term had long been associated with the general topic: given a linear model y = Xβ+e where the vector y and the matrix X are known, the vector of parameters and the noise vector e unknown, Gauss (1823) called the system of equations which give the least squares parameter estimates , “the normal equations" and the ellipsoid of constant probability density was called the “normal surface." It appears that somehow the name got transferred from the equations to the sampling distribution that leads to those equations”

Source

Standard normal has mean 0, variance 1

9 of 31

Chi-square distribution

Mean = k

Variance = 2k

O_i = an observed count for bin i

E_i = an expected count for bin i

Example: Estimate the parameters by curve fitting and check how “good” does it explain the observations

10 of 31

Student’s t

Gaussian like distribution with “heavier tails”

As v → ∞ becomes a “Gaussian”

v = 1 becomes a “cauchy” distribution

Often used as a “statistic” to compare the difference in mean of two populations (samples)

Why the name “Student”?

Student

11 of 31

Why student’s t?

Why the name “Student”?

Student = William Sealey Gosset used to work for Guinness Brewery in Dublin
Gossett wrote a paper describing the “t-test” → Used “Student” pseudonym probably following company mandate of not using public names for papers or possibly because Guinness did not want competitors to figure out that they were using t-test to test the quality of barley based on chemical properties of samples

Student 1908

William Sealy Gosset

12 of 31

T-test primer

Student

X₁, ..., X_n are independent realizations of the normally-distributed random variable X

Sample mean

Sample variance

T follows student t-distribution

13 of 31

Gamma, the ‘versatile’ distribution

https://distribution-explorer.github.io/

Gamma

Versatile two parameter distribution
Often used to model multistep processes where each step as the same rate
Example: Waiting time till a system needs to be repaired or cell-division events or to model number of insurance claims, age distribution of cancer events

14 of 31

Question: What is going on this plot?

15 of 31

Back to the question: What is going on this plot?

16 of 31

Trauma and bite related deaths are seasonal

Is this a rare event?

17 of 31

Does the historical model fit the latest observations?

Does the Feb 2015 - Feb 2019 model explain the observations from Mar 2019 - Mar 2023?

18 of 31

Goodness of fit - Chi-squared test

O_i = an observed count for bin i

E_i = an expected count for bin i, asserted by the null hypothesis

Problem: What distribution should I fit?

Solution: Quantify how “good” does the expected model (frequencies) explain the observations

Calculate p-value

19 of 31

Goodness of fit - Chi-squared test

Problem: What distribution should I fit?

20 of 31

Goodness of fit - Chi-squared test

Problem: What distribution should I fit?

Use a pseudocount of +1 in frequencies

= 5.744762

Is 5.7 high/low/medium?

21 of 31

dfdf

How to evaluate whether the prior model explains the observations?

We don’t know the truth, so we start with a “null hypothesis”: “There is statistically no significant difference between the frequencies observed in Mar 2019 - Mar 2023 follow the same distribution as the Feb 2015 - Feb 2019 ones”
Summarize the discrepancy in expected to observed values (using a chi-squared test)
Select a testing threshold (ɑ) - the probability threshold below which the null hypothesis can be rejected [this is the extreme ends of your distribution]
Compute the test statistic and reject the null hypothesis if the test-statisic is in the extremes (probability of observing test statistic or extreme is < ɑ)
We never “accept” the hypothesis: we find evidence against it probabilistically

22 of 31

Hypothesis testing

In hypothesis testing, we investigate if the observations could have come by chance or if random processes are sufficient to explain the observations.
Null hypothesis H₀: Hypothesis that chance alone is responsible for explaining the observations
Example: The observations in Mar 2019 - Mar 2023 can be explained by the observations in Feb 2015 - Feb 2019 alone.
Requires constructing a statistical model of what the observations would like if chance or random process can alone explain it
A “test-statistic” measures the deviation of observations from the expectations (i.e. the null hypothesis)
We measure if the test-statistic deviates from an pre-decided threshold. If it does → We “reject” the null hypothesis, if it does not we “fail to reject” the null hypothesis

23 of 31

dfdf

Visualizing the p-values region

Area = α/2

Distribution of T under H₀

Significant

findings

Null findings

Significant

findings

T_1-α/2

T_α/2

P-value

T_obs

P-value = Probability of sampling a test statistic at least as extreme as the observed test statistic if the null hypothesis is true

We “reject” the null hypothesis (H₀) if the pvalue is below the threshold (𝝰)

24 of 31

dfdf

Type I,II errors and Power

Type I error:

Probability that the test incorrectly rejects the null hypothesis (H₀) when the null H₀ is true
Often denoted by 𝞪

Type II error:

Probability that the test incorrectly fails to reject the null hypothesis (H₀) when H₀ is false
Often denoted by β

Power:

Probability that the test correctly rejects the null hypothesis (H₀) when the alternative hypothesis (H₁) is true
Commonly denoted by 1- β where β is the probability of making a Type II error by incorrectly failing to reject the null hypothesis.
As β increases, the power of a test decreases.

25 of 31

dfdf

Type I,II errors and Power

False-positive

False-

negative

Distribution of T under H₀

False-positive

Distribution of T under H_A

Power

False-

negative

The false-positive rate is the probability of incorrectly rejecting H₀.

The false-negative rate is the probability of incorrectly accepting H₀.

Power = 1 – false-negative rate = probability of correctly rejecting H₀.

T_α/2

T_1-α/2

26 of 31

dfdf

Types of error

Paul Ellis, 2010

Source

27 of 31

P-value is NOT the probability of the alternate hypothesis being correct.
P-value is NOT the probability of observing the result by chance.
P-value = Probability of observing a result at least as extreme if the null hypothesis holds true.

What is p-value?

28 of 31

Example of Chi-square in R

chi_square_stat <- sum((observed - expected)^2 / expected)

dof <- length(observed) - 1

p_value <- pchisq(chi_square_stat, dof, lower.tail = FALSE)

alpha <- 0.05 # Significance level

if (p_value < alpha) {

cat("Reject the null hypothesis")

} else {

cat("Fail to reject the null hypothesis")

}

P-value = 0.33 (>0.05)

Thus, we fail to reject the null hypothesis that the there is statistically no significant difference between the frequencies observed in Mar 2019 - Mar 2023 follow the same distribution as the Feb 2015 - Feb 2019 ones”

29 of 31

Another goodness of fit test - Likelihood ratio test (or G-test)

O_i = an observed count for bin i

E_i = an expected count for bin i, asserted by the null hypothesis

G follows a chi-squared distribution with degrees of freedom = (length of observations - 1)

30 of 31

Example of G-test in R

G_stat <- 2 * sum(observed * log(observed / expected), na.rm = TRUE)

dof <- length(observed) - 1

p_value <- pchisq(G_stat, df = dof)

alpha <- 0.05 # Significance level

if (p_value < alpha) {

cat("Reject the null hypothesis")

} else {

cat("Fail to reject the null hypothesis")

}

P-value = 0.59 (>0.05)

1 of 31

2 of 31

3 of 31

4 of 31

5 of 31

6 of 31

7 of 31

8 of 31

9 of 31

10 of 31

11 of 31

12 of 31

13 of 31

14 of 31

15 of 31

16 of 31

17 of 31

18 of 31

19 of 31

20 of 31

21 of 31

22 of 31

23 of 31

24 of 31

25 of 31

26 of 31

27 of 31

28 of 31

29 of 31

30 of 31

31 of 31