Model fitting and hypothesis testing
Saket Choudhary
Introduction to Public Health Informatics
DH 302
Lecture 05 || Wednesday, 22nd January 2025
Goals for today
Question: What is going on this plot?
Question: What is going on this plot?
Some potential questions we would like to answer w.r.t the plots
To be able to answer this question, we first need to answer: What is the best model that presents our data?
Even before we answer this question, we should learn about some (important) continuous distributions
Some (important) continuous distributions
Normal
Student’s t
Gamma
Chi-square
Mean = k
Variance = 2k
Gaussian a.k.a Normal distribution
Normal
Quantities that are sum of large number of subprocesses tend to be normally distributed
Example: Height/blood pressure distribution of a sample
What is normal about normal?
“The literature gives conflicting evidence about the origin of the term Normal distribution". Karl Pearson (1920) claimed to have introduced it “many years ago", in order to avoid an old dispute over priority between Gauss and Legendre; but he gave no reference. Hilary Seal (1967) attributes it instead to Galton; but again fails to give a reference, so it would require a new historical study to decide this. However, the term had long been associated with the general topic: given a linear model y = Xβ+e where the vector y and the matrix X are known, the vector of parameters and the noise vector e unknown, Gauss (1823) called the system of equations which give the least squares parameter estimates , “the normal equations" and the ellipsoid of constant probability density was called the “normal surface." It appears that somehow the name got transferred from the equations to the sampling distribution that leads to those equations”
Standard normal has mean 0, variance 1
Chi-square distribution
Mean = k
Variance = 2k
Oi = an observed count for bin i
Ei = an expected count for bin i
Example: Estimate the parameters by curve fitting and check how “good” does it explain the observations
Student’s t
Student’s t
Gaussian like distribution with “heavier tails”
As v → ∞ becomes a “Gaussian”
v = 1 becomes a “cauchy” distribution
Often used as a “statistic” to compare the difference in mean of two populations (samples)
Why the name “Student”?
Why student’s t?
Why the name “Student”?
William Sealy Gosset
T-test primer
X1, ..., Xn are independent realizations of the normally-distributed random variable X
Sample mean
Sample variance
T follows student t-distribution
Gamma, the ‘versatile’ distribution
Gamma
Question: What is going on this plot?
Back to the question: What is going on this plot?
Trauma and bite related deaths are seasonal
Is this a rare event?
Does the historical model fit the latest observations?
Does the Feb 2015 - Feb 2019 model explain the observations from Mar 2019 - Mar 2023?
Goodness of fit - Chi-squared test
Oi = an observed count for bin i
Ei = an expected count for bin i, asserted by the null hypothesis
Problem: What distribution should I fit?
Solution: Quantify how “good” does the expected model (frequencies) explain the observations
Calculate p-value
Goodness of fit - Chi-squared test
Problem: What distribution should I fit?
Goodness of fit - Chi-squared test
Problem: What distribution should I fit?
Use a pseudocount of +1 in frequencies
= 5.744762
Is 5.7 high/low/medium?
dfdf
How to evaluate whether the prior model explains the observations?
Hypothesis testing
dfdf
Visualizing the p-values region
Area = α/2
Area = α/2
Distribution of T under H0
Significant
findings
Null findings
Significant
findings
T1-α/2
Tα/2
P-value
Tobs
P-value = Probability of sampling a test statistic at least as extreme as the observed test statistic if the null hypothesis is true
We “reject” the null hypothesis (H0) if the pvalue is below the threshold (𝝰)
dfdf
Type I,II errors and Power
dfdf
Type I,II errors and Power
False-positive
False-
negative
Distribution of T under H0
False-positive
Distribution of T under HA
Power
False-
negative
The false-positive rate is the probability of incorrectly rejecting H0.
The false-negative rate is the probability of incorrectly accepting H0.
Power = 1 – false-negative rate = probability of correctly rejecting H0.
Tα/2
T1-α/2
dfdf
Types of error
What is p-value?
Example of Chi-square in R
chi_square_stat <- sum((observed - expected)^2 / expected)
dof <- length(observed) - 1
p_value <- pchisq(chi_square_stat, dof, lower.tail = FALSE)
alpha <- 0.05 # Significance level
if (p_value < alpha) {
cat("Reject the null hypothesis")
} else {
cat("Fail to reject the null hypothesis")
}
P-value = 0.33 (>0.05)
Thus, we fail to reject the null hypothesis that the there is statistically no significant difference between the frequencies observed in Mar 2019 - Mar 2023 follow the same distribution as the Feb 2015 - Feb 2019 ones”
Another goodness of fit test - Likelihood ratio test (or G-test)
Oi = an observed count for bin i
Ei = an expected count for bin i, asserted by the null hypothesis
G follows a chi-squared distribution with degrees of freedom = (length of observations - 1)
Example of G-test in R
G_stat <- 2 * sum(observed * log(observed / expected), na.rm = TRUE)
dof <- length(observed) - 1
p_value <- pchisq(G_stat, df = dof)
alpha <- 0.05 # Significance level
if (p_value < alpha) {
cat("Reject the null hypothesis")
} else {
cat("Fail to reject the null hypothesis")
}
P-value = 0.59 (>0.05)
Thus, we fail to reject the null hypothesis that the there is statistically no significant difference between the frequencies observed in Mar 2019 - Mar 2023 follow the same distribution as the Feb 2015 - Feb 2019 ones”
31
Questions?