Section 4
Sampling distributions
Confidence intervals
Hypothesis testing and p values
1
Population and sample
2
Samples drawn from a population
3
Population
sample
Sample is drawn “at random”. Everyone in the target population is equally eligible for sampling.
Similar to a blood sample. If we measure glucose levels in a blood sample, assume this is the glucose level in all the blood in the body.
True population distribution of Y�(individuals)- not Gaussian
4
Mean Y=μ= 1.5, SD=σ=1.12
Possible samples & statistics from the population (true mean=1.5)
sample (n=4) mean
(statistic)
0,0,0,0 0.00
…
1,1,3,2 1.75
…
3,3,3,3 3.00
5
6
Central Limit Theorem
For a large enough n, the distribution of any sample statistic (mean, mean difference, OR, RR, hazard, correlation coeff,regr coeff, proportion…) from sample to sample has a Gaussian (“Normal”) distribution centered at the true population value. The SD of this “sampling distribution” is called the standard error (SE). The SE is proportional to 1/√n.
But when n is small, the Central limit theorem may not hold, particularly if Y does not have a normal distribution.
7
8
Funnel plot - true difference is δ= 5�Each point is one study (meta analysis)
9
Publication bias - non reproducibility
10
Studies with larger sample effects are more likely to be published and may be larger than the true average effect.
Science – 28 Aug 2015 (Nosek)
11
The mean effect size of the replication effects (M=0.197, SD= 0.257) was half the magnitude of the mean effect size of the original effects (M = 0.403, SD = 0.188). …Ninety-seven percent of original studies had significant results (p < .05).Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; …
Resampling estimation (“bootstrap”)
One does not repeatedly sample from the same population, (one only carries out the study once). But a “simulation” of repeated sampling from the population can be obtained by repeatedly sampling from the sample with replacement & computing the statistic from each resample, creating an “estimated” sampling distribution. The SD of the statistics across all “resamples” is an estimate the standard error (SE) for the statistic.
12
Samples drawn from a population�sample
13
Population
Original sample
sample
Sample is drawn “at random” with replacement. Everyone in the original sample is eligible for sampling.
sample
sample
Confidence interval (for μ)
14
lower
upper
Confidence interval (for θ)
15
lower
upper
16
�
17
95% Confidence intervals
95% of the intervals will contain the true population value
But which ones?
Z vs t (technical note)
Confidence intervals computed with normal (Z) percentiles assume that the population σ is known. Since σ is usually not known and is estimated with the sample SD, the Z percentiles need to be adjusted. The adjusted tables are called “t” tables instead of Gaussian (Z) tables (t distribution). For n > 30, they are about the same (ie Z ≈ t).
18
Z vs t distribution (cont.)
The adjustment to the normal (Z) distribution that uses the sample SD instead of the unknown σ was first computed by a statistician named Gossett who published this under the name “the Student” (of mathematical statistics) so this is sometimes called “Students” t distribution.
19
20
Z distribution vs t distribution, n=5, n=30
Said t to Z: Does this distribution make my data look fat?
(t distribution is wider)
t vs Gaussian Z percentiles
21
%ile | 85th | 90th | 95th | 97.5th | 99.5th |
Confidence | 70% | 80% | 90% | 95% | 99% |
t, df=5 | 1.156 | 1.476 | 2.015 | 2.571 | 4.032 |
t, df=10 | 1.093 | 1.372 | 1.812 | 2.228 | 3.169 |
t, df=20 | 1.064 | 1.325 | 1.725 | 2.086 | 2.845 |
t, df=30 | 1.055 | 1.310 | 1.697 | 2.042 | 2.750 |
Gaussian Z | 1.036 | 1.282 | 1.645 | 1.960 | 2.576 |
What did the z distribution say to the t distribution?
You may look like me but you're not normal.
t degrees of freedom = df = n - number of groups = n - number of estimated parameters
Confidence Intervals (CI) *
22
CI for a proportion�“law” of small numbers
n=10, Proportion = 3/10 = 30%
What do you think are the 95% confidence bounds?
Is it ok to conclude that the “real” (true) proportion is less than 50%?
23
CI for a proportion�“law” of small numbers
n=10, Proportion = 3/10 = 30%
What do you think are the 95% confidence bounds?
Is it ok to conclude that the “real” (true) proportion is less than 50%?
Answer: 95% CI: 6.7% to 65.3%
24
Standard error (SE) for the difference between two statistics from two different groups
25
26
Lancet 2010; 375: 1447–56
Statistics for HBA1c change�from base to 26 weeks (Pratley et al, Lancet 2010)
27
Tx | n | Mean | SD | SE |
Liraglutide | 225 | -1.24% | 0.99% | 0.066% |
Sitaglipin | 219 | -0.90% | 0.98% | 0.066% |
Null hypothesis & p values
28
29
Hypothesis test statistics�Zobs = (Sample Statistic – null value) / Standard error
30
Z (or t)=3.82
sampling distribution mean is zero - null hypothesis value
31
Difference & Non inferiority (equivalence) hypothesis testing
Difference Testing (usual hypothesis testing):
Null Hyp: A=B (or A-B=0), Alternative: A≠B
Zobs = (observed stat – 0) / SE
Non inferiority (within δ) Testing:
Null Hyp: A > B + δ, Alternative: A <= B + δ
Zeq = (observed stat – δ )/ SE
Must specify δ for non inferiority testing
use t instead of Z if variance is not known
32
Non inferiority testing-HBA1c data
33
Left ventricle assist device-LVAD
34
Non inferiority testing-LVAD example
A LVAD (left ventricle assist device) keeps a weak heart pumping while the patient waits for a heart transplant. The primary outcome is non failure at 90 days after LVAD placement. The current LVAD has a 90% non failure rate. A new LAVD is under FDA consideration. FDA will consider the new LVAD “non inferior” to the current if it is no more than δ=4% worse than the current LVAD. (ie true value is 86% or better at 90 days).
What should the confidence interval look like?
What should the null hypothesis be?
35
LVAD example (continued)
Confidence interval:
Lower bound should be 86% or higher. Also preferred if 90% is within the confidence bound.
Hypothesis test p value:
Let δ be the true difference in failure percent between current vs new LVAD. When computing the p value
Should null δ = 0 ?
Should null δ = 4% ?
36
Confidence intervals�versus hypothesis testing
Study equivalence demonstrated only from –δ to +δ
(1‑8) (brackets show 95% confidence intervals)
Stat
Sig
1. Yes ----------------------------------------------------------------------------------------------- < not equivalent >
2. Yes -----------------------------------------------------------------------------< uncertain >--------------------
3. Yes ------------------------------------------------------------------< equivalent >-----------------------------------
4. No ---------------------------------------------------< equivalent >---------------------------------------------------
5. Yes ----------------------------------< equivalent > ----------------------------------------------------------------
6. Yes ---------------------< uncertain>----------------------------------------------------------------------------------
7. Yes -< not equivalent >-----------------------------------------------------------------------------------------------
8. No ---------<___________________________uncertain________________________________>------
| |
‑δ O +δ
true difference
Ref: Statistics Applied to Clinical Trials- Cleophas, Zwinderman, Cleopahas 2000, Kluwer Academic Pub Page 35
37
Non inferiority�JAMA 2006 - Piaggio et al, p 1152-1160
38
Confidence intervals & p values
39
When the 95% confidence interval contains the null value, the two sided p value is > 0.05. When the 95% confidence interval excludes the null value, the two sided p value is < 0.05.
above, α = 0.05 = 5% and 1-α = 0.95=95%
In general, the same applies for the 1-α level confidence interval and α.
When the (1-α) level confidence interval contains the null value, the corresponding two sided p value is > α.
When the (1-α) level confidence interval excludes the null value, the corresponding two sided p value is < α.
Many prefer confidence intervals to p values. Confidence intervals are less prone to misinterpretation and do not depend on a null hypothesis. Good practice to report both the confidence interval and the p value.
It is very bad practice to only report a p value without the statistic that goes with it (“disembodied” p value)
Forest plot
40
Odds of different outcomes when exposed to a toxin
Paired Mean Comparisons�Exampe: Serum cholesterol in mmol/L�Difference (d) between baseline and end of 4 weeks�
Subject baseline 4 wks difference(di)
1 9.0 6.5 2.5
2 7.1 6.3 0.8
3 6.9 5.9 1.0
4 6.9 4.9 2.0
5 5.9 4.0 1.9
6 5.4 4.9 0.5
mean 6.87 5.42 1.45
SD 1.24 0.97 0.79
SE 0.51 0.40 0.32
41
Confidence Intervals,�Hypothesis Tests
Confidence intervals are of the form
Sample Statistic +/- (Zpercentile*) (Standard error)
Lower bound = Sample Statistic- (Zpercentile)(Standard error)
Upper bound = Sample Statistic + (Zpercentile)(Standard error)
Hypothesis test statistics (Zobs*) are of the form
Zobs=(Sample Statistic – null value) / Standard error
* t percentile or tobs for continuous data when n is small. When data does not follow the normal distribution and n is “small”, above are only approximate
42
Common sample statistics & their SEs
Sample Statistic Symbol Standard error (SE)
Mean y̅ S/√n = √[S2/n] = SEM
Mean difference y̅1 – y̅2 = d̅ √[S12/n1 + S22/n2]= SEd
Proportion P √[P(1-P)/n]
Proportion difference P1 – P2 √[P1(1-P1)/n1 + P2(1-P2)/n2]
Log Odds ratio* logeOR √[ 1/a + 1/b + 1/c + 1/d]
Log Risk ratio* logeRR √[1/a -1/(a+c) + 1/b - 1/(b+d)]
Slope (rate) b Serror / Sx√(n-1)
Hazard rate (survival) h h/√[number events] = h/√e
Transform (z) of the
Correlation coefficient r* z=½loge[(1+r)/(1-r)] SE(z)=1/√([n-3])
r = (e2z -1)/(e2z + 1)
Log Hazard Ratio* loge HR √[(1/e1) + (1/e2)]
(assumes constant h) (e1 & e2 are number of events)
*form CI bounds on log scale, then take anti-log
43
a | b |
c | d |
2 x 2 table – for OR, RR
Sample statistics & SEs (continued)
P1 = a/n1, P2=b/n2 , RR=P1/P2
SE(loge RR) =
√ [ 1/(P1n1) – 1/n1 + 1/(P2n2) – 1/n2 ]
confidence interval for true RR
lower=exp(loge(RR) – Z SE)
upper=exp(loge(RR) + Z SE)
44
| Y positive | Y negative |
X positive | a | b |
X negative | c | d |
total | a+c = n1 | b+d=n2 |
Confidence intervals for transformations
If “Q” is a statistic with confidence bounds (L,U) and there is a transformation f(X) that transforms Q to a new statistic f(Q), the confidence bounds for f(Q) are ( f(L), f(U) ).
Example: log10Estradiol (log E2) in 14 yr old females has a normal distribution with mean = 1.86 log pg/ml. The 95% confidence bounds for this mean log E2 are (1.76,1.96). Median E2 is 101.86 = 72.4 pg/ml with 95% confidence bounds (101.76,101.96) = (57.5 pg/ml, 91.2 pg/ml).
45
Guide to Testing- null values
46
Sample Statistic & Comparison | Population null hypothesis |
Comparing two means | True population mean difference is zero |
Comparing two proportions | True population difference is zero |
Comparing two medians | True population median difference is zero |
Odds ratio (comparing odds) | True population odds ratio is one |
Risk ratio=relative risk (comparing risks) | True population risk ratio is one |
Correlation coefficient (compare to zero) | True population correlation coefficient is zero |
Slope=rate of change=regression coefficient | True population slope is zero |
Comparing two survival curves | True difference in survival is zero at all times |
Hypothesis Test Guide
Statistic/type of comparison
Mean comparison-unpaired
Mean comparison-paired
Median comparison-unpaired
Median comparison-paired
Proportion comparison-unpaired
Proportion comparison-paired
Odds ratio
Risk ratio
Correlation, slope
Survival curves, hazard rates
Test/analysis procedure – gives p value
t test (2 groups), ANOVA (3+ groups)
paired t test, repeated measures ANOVA
Wilcoxon rank sum test, KruskalWallis test*
Wilcoxon signed rank test on differences*
chi-square test (or Fishers test)
McNemar’s chi-square test
chi-square test, Fisher test
chi-square test, Fisher test
regression, t statistic
log rank test
ANOVA = analysis of variance
* non parametric – Gaussian distribution theory is not used to get the p value
47
Hypothesis test guide- p values
48
Non parametric Cis & tests
When the sample size is small and when the original distribution is not close to the normal distribution, the sampling distribution for the statistic may not follow the Gaussian, particularly in the distribution “tails”, so another method is needed to generate the sampling distribution, the confidence bounds and the p value. The bootstrap (resampling) method can always be used. Other methods based on the ranks of the data have been devised. These are called “non parametric” since they do not use the (parametric) Gaussian.
49
Parametric vs non parametric
Continuous data-Non parametric-compute p values using ranks of the data
Does not assume stats follow Gaussian distribution–particularly in distribution “tails”.
Parametric Nonparametric
2 indep means- 2 indep medians-
t test Wilcoxon rank sum test=MW
3+ indep mean- 3+ indep medians-
ANOVA F test Kruskal Wallis test
Paired means- Paired medians-
paired t test Wilcoxon signed rank test
Pearson correlation Spearman correlation
Any parametric test has a corresponding non parametric version.
50
Nomenclature for Testing
Delta (δ) = True difference or size of effect
Alpha (α) = Type I error = false positive
= Probability of rejecting the null hypothesis when it is true.
(Usually α is set to 0.05)
Beta (β) = Type II error = false negative
=Probability of not rejecting the null hypothesis when delta is not zero
( there is a real difference in the population)
Power = 1 – β
= Probability of getting a p value less than α
(ie declaring statistical significance)
when, in fact, there really is a non-zero delta.
We want small alpha levels and high power.
51