1 of 31

Model fitting and hypothesis testing

Saket Choudhary

saketc@iitb.ac.in

Introduction to Public Health Informatics

DH 302

Lecture 05 || Wednesday, 22nd January 2025

2 of 31

Goals for today

  • How to decide what model best explains the observation
  • Chi-squared test and G-test
  • Expectations, Variances, CLT, Normal approximation

3 of 31

Question: What is going on this plot?

4 of 31

Question: What is going on this plot?

5 of 31

Some potential questions we would like to answer w.r.t the plots

  • Are certain points an “outlier”?
  • Has the trend of deaths changed “significantly” with time?

To be able to answer this question, we first need to answer: What is the best model that presents our data?

Even before we answer this question, we should learn about some (important) continuous distributions

6 of 31

Some (important) continuous distributions

Normal

Student’s t

Gamma

Chi-square

Mean = k

Variance = 2k

7 of 31

Gaussian a.k.a Normal distribution

Normal

Quantities that are sum of large number of subprocesses tend to be normally distributed

Example: Height/blood pressure distribution of a sample

8 of 31

What is normal about normal?

“The literature gives conflicting evidence about the origin of the term Normal distribution". Karl Pearson (1920) claimed to have introduced it “many years ago", in order to avoid an old dispute over priority between Gauss and Legendre; but he gave no reference. Hilary Seal (1967) attributes it instead to Galton; but again fails to give a reference, so it would require a new historical study to decide this. However, the term had long been associated with the general topic: given a linear model y = Xβ+e where the vector y and the matrix X are known, the vector of parameters and the noise vector e unknown, Gauss (1823) called the system of equations which give the least squares parameter estimates , “the normal equations" and the ellipsoid of constant probability density was called the “normal surface." It appears that somehow the name got transferred from the equations to the sampling distribution that leads to those equations”

Standard normal has mean 0, variance 1

9 of 31

Chi-square distribution

Mean = k

Variance = 2k

Oi = an observed count for bin i

Ei = an expected count for bin i

Example: Estimate the parameters by curve fitting and check how “good” does it explain the observations

10 of 31

Student’s t

Student’s t

Gaussian like distribution with “heavier tails”

As v → ∞ becomes a “Gaussian”

v = 1 becomes a “cauchy” distribution

Often used as a “statistic” to compare the difference in mean of two populations (samples)

Why the name “Student”?

11 of 31

Why student’s t?

Why the name “Student”?

  • Student = William Sealey Gosset used to work for Guinness Brewery in Dublin
  • Gossett wrote a paper describing the “t-test” → Used “Student” pseudonym probably following company mandate of not using public names for papers or possibly because Guinness did not want competitors to figure out that they were using t-test to test the quality of barley based on chemical properties of samples

William Sealy Gosset

12 of 31

T-test primer

X1, ..., Xn are independent realizations of the normally-distributed random variable X

Sample mean

Sample variance

T follows student t-distribution

13 of 31

Gamma, the ‘versatile’ distribution

Gamma

  • Versatile two parameter distribution
  • Often used to model multistep processes where each step as the same rate
  • Example: Waiting time till a system needs to be repaired or cell-division events or to model number of insurance claims, age distribution of cancer events

14 of 31

Question: What is going on this plot?

15 of 31

Back to the question: What is going on this plot?

16 of 31

Trauma and bite related deaths are seasonal

Is this a rare event?

17 of 31

Does the historical model fit the latest observations?

Does the Feb 2015 - Feb 2019 model explain the observations from Mar 2019 - Mar 2023?

18 of 31

Goodness of fit - Chi-squared test

Oi = an observed count for bin i

Ei = an expected count for bin i, asserted by the null hypothesis

Problem: What distribution should I fit?

Solution: Quantify how “good” does the expected model (frequencies) explain the observations

Calculate p-value

19 of 31

Goodness of fit - Chi-squared test

Problem: What distribution should I fit?

20 of 31

Goodness of fit - Chi-squared test

Problem: What distribution should I fit?

Use a pseudocount of +1 in frequencies

= 5.744762

Is 5.7 high/low/medium?

21 of 31

dfdf

How to evaluate whether the prior model explains the observations?

  • We don’t know the truth, so we start with a “null hypothesis”: “There is statistically no significant difference between the frequencies observed in Mar 2019 - Mar 2023 follow the same distribution as the Feb 2015 - Feb 2019 ones”
  • Summarize the discrepancy in expected to observed values (using a chi-squared test)
  • Select a testing threshold (ɑ) - the probability threshold below which the null hypothesis can be rejected [this is the extreme ends of your distribution]
  • Compute the test statistic and reject the null hypothesis if the test-statisic is in the extremes (probability of observing test statistic or extreme is < ɑ)
  • We never “accept” the hypothesis: we find evidence against it probabilistically

22 of 31

Hypothesis testing

  • In hypothesis testing, we investigate if the observations could have come by chance or if random processes are sufficient to explain the observations.
  • Null hypothesis H0: Hypothesis that chance alone is responsible for explaining the observations
  • Example: The observations in Mar 2019 - Mar 2023 can be explained by the observations in Feb 2015 - Feb 2019 alone.
  • Requires constructing a statistical model of what the observations would like if chance or random process can alone explain it
  • A “test-statistic” measures the deviation of observations from the expectations (i.e. the null hypothesis)
  • We measure if the test-statistic deviates from an pre-decided threshold. If it does → We “reject” the null hypothesis, if it does not we “fail to reject” the null hypothesis

23 of 31

dfdf

Visualizing the p-values region

Area = α/2

Area = α/2

Distribution of T under H0

Significant

findings

Null findings

Significant

findings

T1-α/2

Tα/2

P-value

Tobs

P-value = Probability of sampling a test statistic at least as extreme as the observed test statistic if the null hypothesis is true

We “reject” the null hypothesis (H0) if the pvalue is below the threshold (𝝰)

24 of 31

dfdf

Type I,II errors and Power

  • Type I error:
    • Probability that the test incorrectly rejects the null hypothesis (H0) when the null H0 is true
    • Often denoted by 𝞪
  • Type II error:
    • Probability that the test incorrectly fails to reject the null hypothesis (H0) when H0 is false
    • Often denoted by β
  • Power:
    • Probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true
    • Commonly denoted by 1- β where β is the probability of making a Type II error by incorrectly failing to reject the null hypothesis.
    • As β increases, the power of a test decreases.

25 of 31

dfdf

Type I,II errors and Power

False-positive

False-

negative

Distribution of T under H0

False-positive

Distribution of T under HA

Power

False-

negative

The false-positive rate is the probability of incorrectly rejecting H0.

The false-negative rate is the probability of incorrectly accepting H0.

Power = 1 – false-negative rate = probability of correctly rejecting H0.

Tα/2

T1-α/2

26 of 31

dfdf

Types of error

27 of 31

  • P-value is NOT the probability of the alternate hypothesis being correct.
  • P-value is NOT the probability of observing the result by chance.
  • P-value = Probability of observing a result at least as extreme if the null hypothesis holds true.

What is p-value?

28 of 31

Example of Chi-square in R

chi_square_stat <- sum((observed - expected)^2 / expected)

dof <- length(observed) - 1

p_value <- pchisq(chi_square_stat, dof, lower.tail = FALSE)

alpha <- 0.05 # Significance level

if (p_value < alpha) {

cat("Reject the null hypothesis")

} else {

cat("Fail to reject the null hypothesis")

}

P-value = 0.33 (>0.05)

Thus, we fail to reject the null hypothesis that the there is statistically no significant difference between the frequencies observed in Mar 2019 - Mar 2023 follow the same distribution as the Feb 2015 - Feb 2019 ones”

29 of 31

Another goodness of fit test - Likelihood ratio test (or G-test)

Oi = an observed count for bin i

Ei = an expected count for bin i, asserted by the null hypothesis

G follows a chi-squared distribution with degrees of freedom = (length of observations - 1)

30 of 31

Example of G-test in R

G_stat <- 2 * sum(observed * log(observed / expected), na.rm = TRUE)

dof <- length(observed) - 1

p_value <- pchisq(G_stat, df = dof)

alpha <- 0.05 # Significance level

if (p_value < alpha) {

cat("Reject the null hypothesis")

} else {

cat("Fail to reject the null hypothesis")

}

P-value = 0.59 (>0.05)

Thus, we fail to reject the null hypothesis that the there is statistically no significant difference between the frequencies observed in Mar 2019 - Mar 2023 follow the same distribution as the Feb 2015 - Feb 2019 ones”

31 of 31

31

Questions?