Section 5
Sample Size and Power
Multiple hypothesis testing
False discovery rate
Sample size (n) based on estimation precision-CI width
Can plan sample size so that standard errors (SEs) and the corresponding confidence intervals are sufficiently small.
(How small? How small do you need it to be?)
Does not need a formal comparison, unlike hypothesis testing.
Sample size based on size of the standard error (SE)
Want to estimate the true proportion π with sample proportion P. Sample size is n.
The standard error (SE) of p is
SE = sqrt(P(1-P)/n)
SE2 = P(1-P)/n
n = P(1-P)/SE2
If SE must be smaller, n must be larger
Sample size for precision/CIs
π = proportion with TB in the population (prevalence)
P = sample proportion with TB in a sample of size n
SE(P) = √[π(1- π)/n]
approximate 95% confidence interval for π: P ± 1.96(SE)
Precision: want to estimate true prevalence (π) within ± 6%
Solve for n: 1.96(SE) = 1.96 √[π(1- π)/n] = 0.06
n = 1.962 π(1- π)/(0.06)2 = 3.84 π(1- π)/0.0036
Can estimate π using the observed p or use maximum at π=0.50
If π=0.15, n = 3.84 (0.15)(0.85)/0.0036 = 136
At π=0.50, n = 3.84 (0.50)(0.50)/0.0036 = 267 (worst case)
Rule of thumb: For 95% CI for π, conservative n for precision w is n = 1/w2
Sample size (n) vs 95% CI half width-�one proportion P
Power and Sample size based on hypothesis testing
Hypothesis test decision table
| No difference in population (Null is true) | Actual difference (Null is false) |
Test: Do not reject null p value > α | 1-α (correct) | β (Type II error) |
Test: Reject null hypothesis p value < α | α (Type I error) | 1-β = power (correct) |
Statistical vs medical testing analogy
Medical testing truth: null – does not have disease
Person has disease (true pos ~ null hypothesis false) or
does not have disease (true negative ~ null hypothesis true)
Person tests positive or negative.
Test result is the observed data.
test positive-p value < α or test negative - p value ≥ α
1-α - medical test specificity – prob test negative if true neg
α - false positive – prob test positive if true negative
1-β -medical test sensitivity=power - prob test pos if true pos
β - false negative – prob test negative if true positive
Determinants of power, n
Power (1-β) depends on
δ = delta = true difference
σ = sigma = true SD or true variation
α = alpha = significance criterion
n = sample size
Power increases as these increase except for σ
********
n depends on: δ, σ, α, 1-β=power.
n increases if σ, 1-β increase
n decreases if δ, α increase (α is more liberal)
Alpha versus Power (1-β)
α/2
The top distribution shows the sampling distribution of a test statistic Z under the assumption that delta (δ) is zero. The null hypothesis is true. (SE=1)
The bottom distribution shows the true sampling distribution (unknown at the time of testing), with a true population delta= δ= 2.5. The null hypothesis is false.
1 - β
1 - β
Cutoff at Z=2, or α=0.05 (two sided)
Z
Z
Hypothesis testing & statistical power, true δ=5, SD=7, n=10,SE=2.21
Power calculation�Zpower = Zobs – Z1-α/2 = (δ/SE) – Z1-α/2
Zobs=0.34/0.66 = 0.516 (<1.96, so not statistically significant, p=0.622)
Zpower = Zobs - Z α = 0.516 – 1.96 = -1.44
From the Gaussian table or EXCEL, Zpower=-1.44 yields power about 7%.
Treatment | n | Mean HBA1c chg | SD | SE |
Liraglutide | 5 | -1.24 | 0.99 | 0.44 |
Sitaglipin | 4 | -0.90 | 0.98 | 0.49 |
Difference | | 0.34=δ | | √[0.442 + 0.492] = 0.66 |
Interpretation of power
If test is “statistically” significant (p < α), we have a “positive” or “significant” outcome & accept the false positive probability of α.
If test is not statistically significant (p > α) either there is no relationship (“negative” outcome) or sample size is inadequate (inconclusive).
If power is low for a given δ, results are inconclusive, not negative.
If power is high, results are affirmatively negative.
(But better to quote Confidence interval after the study is published)
Sample size to test difference between 2 means�(this is NOT a universal formula)
Two independent groups, each with sample size
n = 2(Zpower+ Z1-α/2)2 (σ/δ)2
Z0.975 = 1.96 and Zpower = 0.842 (for power of 80%), so
n = 2(0.842 + 1.96)2 (σ/δ)2 = 15.7 (σ/δ)2
or
n per group approximately ≈ (range/δ)2
(since 15.7 ≈16, 16(σ/δ)2 = (4σ/δ)2 and the range ≈4σ )
Power for increasing delta
Areas under the curves and right of the vertical line are α for the black curve and power for the other curves.
The power is larger for the red curve than for the blue.
δ = 0
δ = 3.5
δ =2.5
Power Summary
Power increases as:
Generally, we set α = 0.05 & power = 1 – β = 0.80. To determine n, we need to estimate δ and σ.
Often we use values of δ/σ for the calculation
For time to event outcomes (survival), n also depends on follow up time since “n” is the number of events. The sample size for comparing two survival curves is often computed based on comparing the corresponding two hazards.
Sample Size (n) Checklist
Sample size (n) depends on:
Effect size (δ) = smallest clinically important difference - n increases as δ decreases
Variability (σ) = patient heterogeneity - n increases as variability increases
Power (1-β) = probability of significantly detecting the effect (prob p value < α), often set at 80% or higher - n increases as required power increases
α level = probability of rejecting when δ is equal to its null value (often δ=0),
(two sided α often set to 0.05) - n increases with smaller α - smaller α is more “stringent”
Must also consider the percent who will agree to participate and the accrual rate if all patients are not recruited at the same time.
*** for time to event (survival) outcomes ***
Follow up time = time each patient is followed - n decreases if patients are followed longer. In survival “n” is the number who have the outcome/event.
There are more events if follow up time is increased.
For time to event outcomes, must also consider the patient dropout / loss rate
Sample size per group for δ/σ -unpaired 2 mean comparison, mean difference=δ, SD=σ, two-sided α=0.05
δ/σ | 70% power | 80% power | 90% power |
0.10 | 1,234 | 1,570 | 2,102 |
0.15 | 549 | 698 | 934 |
0.20 | 309 | 392 | 525 |
0.25 | 198 | 251 | 336 |
0.50 | 49 | 63 | 84 |
0.75 | 22 | 28 | 37 |
1.00 | 12 | 16 | 21 |
1.25 | 8 | 10 | 13 |
1.50 | 5 | 7 | 9 |
Sample size per group for comparing two proportions, 80% power, alpha=0.05
| Difference between P1 and P2 |P1- P2|=δ | |||
Smaller of P1 & P2 | 0.05 | 0.10 | 0.15 | 0.20 |
0.05 | 434 | 140 | 71 | 45 |
0.10 | 685 | 199 | 99 | 62 |
0.15 | 904 | 250 | 120 | 72 |
sigma = sqrt(P (1-P))
Power/sample size calculators
Gpower – University of Dusseldorf
(free download)
Columbia University
http://www.biomath.info/power/ttest.htm
Harvard University (Hedwig)
http://hedwig.mgh.harvard.edu/sample_size/
Hypothesis testing limitations
Pseudo replication
Most variation is between persons, not within person.
Two blood samples on n=10 is not a sample size of 20.
Observed value = true population mean
+ between person variation (σp)
+ within person variation (σe)
Example: To estimate the mean
1. Compute a mean for each person using her “m” observations per person.
2. Compute the group mean from the “n” person means.
SEM = √[σp2/n + σe2/nm], usually σe < σp
Statistical vs Medical “significance”
Average drop in weight (kg) after 3 months
Diet | Mean Drop | p | 95% CI |
I | 0.50 | <0.001 | (0.45,0.55) |
II | 10.0 | 0.16 | (-5.0, 25.0) |
(“A difference, in order to be a difference, must make a difference”–Gertrude Stein?).
p value limitations (ASA)
1. p values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone ignoring the model.
2. Conclusions should not be based only on whether a p-value passes a specific threshold (such as p < 0.05).
3. Proper inference requires full reporting & transparency.
4. A p-value does not measure the size of an effect or the importance of a result.
5. A p-value alone does not provide a good measure of evidence regarding a model or hypothesis.
24
R A Fisher on p values �Statistical Methods and Scientific Inference, Hafner, New York, ed. 1, 1956
“The concept that the scientific worker can regard himself as an inert item in a vast co-operative concern working according to accepted rules, is encouraged by directing attention away from his duty to form correct scientific conclusions,…and by stressing his supposed duty to mechanically make a succession of automatic “decisions”(ie p< 0.05). ...The idea that this responsibility can be delegated to a giant computer programmed with Decision Functions belongs to a phantasy of circles, rather remote from scientific research”. [pp. 104–105]
Multiple Hypothesis testing
Multiple efficacy endpoints /outcomes
Multiple safety endpoints/outcomes
Multiple treatment arms and/or doses
Multiple interim analyses
Multiple patient subgroups (subgroup analysis)
Multiple analyses
Exploratory vs confirmatory�protein example
Protein name | Atril fib | Atherosclerosis | p value |
RAS guanyl-releasing protein 2 | 33.3% | 0.0% | 0.0000 |
Glutathione S-transferase P | 38.9% | 100.0% | 0.0000 |
Selenium-binding protein 1 | 22.2% | 0.0% | 0.0000 |
Nucleosome assembly protein 1-like 4 | 16.7% | 0.0% | 0.0000 |
Integrin beta;Integrin beta-2 | 11.1% | 50.0% | 0.0000 |
Spectrin alpha chain, non-erythrocytic 1 | 11.1% | 0.0% | 0.0000 |
Pituitary tumor-transforming gene 1 protein-interacting | 11.1% | 0.0% | 0.0000 |
WW domain-binding protein 2 | 16.7% | 50.0% | 0.0000 |
Syntaxin-4 | 5.6% | 0.0% | 0.0006 |
CD9 antigen | 27.8% | 50.0% | 0.0013 |
ATP synthase-coupling factor 6, mitochondrial | 27.8% | 50.0% | 0.0013 |
Flotillin-1 | 77.8% | 100.0% | 0.0037 |
Aconitate hydratase, mitochondrial | 38.9% | 50.0% | 0.1142 |
Fructose-bisphosphate aldolase C | 94.4% | 100.0% | 0.4402 |
Alpha-adducin | 50.0% | 50.0% | 1.0000 |
40S ribosomal protein SA | 1.0% | 1.0% | 1.0000 |
Abl interactor 1 | 1.0% | 1.0% | 1.0000 |
Bone marrow proteoglycan;Eosinophil granule major basic | 1.0% | 1.0% | 1.0000 |
Tubulin alpha-4A chain | 100.0% | 100.0% | 1.0000 |
… (750 proteins total) |
|
|
|
750 proteins are compared between two groups – 12 are significant at p < 0.05
Note: 750 x 0.05 = 37.5
Prevagen & multiple testing�(Washington Post – 11 Sept 2021)
Does the supplement Prevagen improve memory (stop memory loss)? A court case asked that question
Quincy Bioscience describes the study as a randomized, double-blinded, placebo-controlled trial.
But, according to the FTC and the New York attorney general, the trial involved 218 subjects taking either 10 milligrams of Prevagen or a placebo and “failed to show a statistically significant improvement in the treatment group over the placebo group on any of the nine computerized cognitive tasks.”
The complaint alleges that after the Madison Memory Study failed to “find a treatment effect for memory loss for the sample as a whole,” Quincy’s researchers broke down the data in more than 30 different ways.
“Given the sheer number of comparisons run and the fact that they were post hoc, the few positive findings on isolated tasks for small subgroups of the study population do not provide reliable evidence of a treatment effect,” the lawsuit said. Post hoc studies are not uncommon but are generally not regarded as proof until confirmed, scientific experts say. According to the Center for Science in the Public Interest, which filed an amicus brief in support of the agencies’ charges, the subsequent analyses produced “three results that were statistically significant (and more than 27 results that weren’t).”
Exploratory vs confirmatory�Who killed Tweety Bird?
Did Sylvester do it?
Motivation (class discussion)
Tweety Bird is murdered by a cat who left a DNA sample. The particular DNA profile found in the sample is known to occur in one of every one million cats. There is also about a 0.01% false positive rate for this test.
Is the level of evidence (guilt) equal in these two scenarios?
2. A DNA database on 100,000 cats (but not all cats), including Sylvester, is searched and Sylvester is a match, although not necessarily the only match. No prior belief that Sylvester is guilty.
Motivation (class discussion)
The “disease score” ranges from 2 (good) to 12 (worst).
Scenario A: Due to prior suspicion (prior information), only patients 19 and 47 are measured and both have scores of 12. We report that they are “significantly” ill.
Scenario B: The score is measured on 72 patients. Only patients 19 and 47 have scores of 12. We report that they are “significantly” ill.
Is the amount of “evidence” or “belief” that patients 19 and 47 “really” are very ill (have “true” score of 12) the same in both scenarios? The data for patients 19 and 47 are the same in both scenarios.
Most would agree that, if both patients were retested (confirmation step), and came out with lower scores, this would decrease the belief that there “true” score is 12. If they came out with 12 again, this would increase the belief that the true score is 12.
Multiple testing
“If you torture the data long enough, it will eventually confess”
Two different situations for new arthritis treatment compared to aspirin.
A. Only pain (0-10) and swelling (0-10) are measured. Both are significantly better at p < 0.05 on the new treatment compared to aspirin.
Confirmatory studies specify outcomes in advance. Misleading to report only statistically significant results.
How to really lie with stats for fun and profit
2. Send financial advice after you know how the market did (example in class)
Multiple Testing
Out of m (independent) tests, if one declare “significance” if p< 0.05, below are the number of tests significant by chance alone (FWER), when all null hypotheses are true (assumes independence).
# tests=m | Probability reject at least one=FWER |
1 | 0.0500 |
2 | 0.0975 |
3 | 0.1426 |
4 | 0.1855 |
5 | 0.2262 |
10 | 0.4013 |
20 | 0.6415 |
25 | 0.7226 |
50 | 0.9231 |
m | 1-(0.95)m |
Multiple testing-What to do?
Option 1: Use nominal alpha level for significance. Creates too many false positives-bad.
Option 2: Use Bonferroni criterion –Declare significance if p < α/m if “m” tests are made. Keeps overall false positive (type I) error ≤ α but has too many false negatives-bad.
Option 3: Use Holm/Hochberg criterion (or other adjustment criteria) – a compromise
Holm/Hochberg/Benjamini criterion
Rule for m (not necessarily independent) significance
tests. Keeps overall false positive rate at ≤α for all “m” tests.
1) Sort the “m” p values from lowest to highest.
2) Declare the ith ordered p significant if it is less than α/(m+1-i). If p > α/(m+1-i), this & all larger p values are declared non significant.
This makes the overall type I error rate (FWER) ≤ α.
(FWER = family wise error rate)
Holm/Hochberg Example for m=5, α=0.05
i p value α/(6-i) 0.05/(6-i)
1 p1-smallest α/5 0.0100
2 p2 α/4 0.0125
3 p3 α/3 0.0167
4 p4 α/2 0.0250
5 p5-largest α 0.0500
(Bonferroni is p < 0.05/5 = 0.01)
No adjustment vs Hochberg vs Bonferroni
m=5, alpha=0.05
i | no adjustment criterion | Bonferroni criterion | Hochberg criterion | actual p value |
1 | 0.05 | 0.01 | 0.0100 | 0.007 |
2 | 0.05 | 0.01 | 0.0125 | 0.011 |
3 | 0.05 | 0.01 | 0.0167 | 0.014 |
4 | 0.05 | 0.01 | 0.0250 | 0.044 |
5 | 0.05 | 0.01 | 0.0500 | 0.049 |
FWER vs FDR
If a “family” of “m” hypothesis tests are carried out, the family wise error rate (FWER) is the chance of any “false positive” type I error assuming that the null is true for all m tests (not looking at test result).
Rather than control the FWER, it may be preferable to control the number of “positive” tests (not all tests) that are false positives. This is called controlling the false discovery rate (FDR), a less stringent criterion.
For FDR, the ith ordered p value must be less than (i/m)α which is larger than α/(m+1-i) for FWER.
FDR vs FWER�errors committed when testing “m” null hypotheses
| Declare non sig | Declare sig | Total |
Truth-Null true | U | V | m0 |
Truth-Null false | T | S | m-m0 |
total | m-R | R | m |
FWER= Prob V ≥ 1 = 1- Prob(V=0)
FDR = V/R (average V/R)
FDR is more liberal
Example-FDR vs FWER
| Declare non sig | Declare sig | Total |
Truth-Null true | 855 | 45 | 900 |
Truth-Null false | 20 | 80 | 100 |
total | 875 | 125 | 1000 |
alpha=45/900=0.05
power=80/100=0.80
FWER*= 1-(0.95)900 > 0.999
FDR = 45/125=0.360
FDR is more liberal
*assuming independence
FWER vs FDR significance criteria�m=5 hypothesis, 5 p values�α=0.05�
p value | FDR criteria | FWER criteria | actual p value |
p1-smallest | (1/5) α=0.010 | α/5=0.010 | 0.007 |
p2 | (2/5) α=0.020 | α/4=0.0125 | 0.011 |
p3 | (3/5) α=0.030 | α/3=0.0167 | 0.014 |
p4 | (4/5) α=0.040 | α/2=0.025 | 0.044 |
p5-largest | α=0.050 | α=0.050 | 0.049 |
FDR – q values
When controlling for the FDR at rate α, the “m” p values must be less than (i/m)α in order to be significant (i=1,2,3,…m).
Therefore, some report “q values” (adjusted p values) defined as:
q value (adjusted p value) = (p value) (m/i)
when i=1, q value = m p value
when i=m, q value = p value
�FDR adjusted p values = q values�m=5 hypothesis, α=0.05�
p value rank | original p value | FDR (m/i) | q value |
p1-smallest | 0.007 | 5/1=5.00 | 0.0350 |
p2 | 0.011 | 5/2=2.50 | 0.0275 |
p3 | 0.014 | 5/3=1.67 | 0.0233 |
p4 | 0.044 | 5/4=1.25 | 0.0550 |
p5-largest | 0.049 | 5/5=1.00 | 0.0490 |
Courtesy of Graph Pad – don’t do post hoc adjustment
Specify and carry out the steps in the blue squares before, not after, computing p values.
Make a stat plan.
Multiple testing & primary outcomes
As “m’, the number of outcomes, increases, individual αi for each outcome must be smaller so n must be larger if overall α is to stay constant (ie at α=0.05).
But not all outcomes are equally important. Designate important outcomes “primary” & the rest secondary so ‘m’ is only the number of primary outcomes. Assumes less concern if there is a false positive finding among secondary outcomes.
Must designate primary vs secondary outcomes in advance, before study results are known. It is not fair to declare which outcomes are primary and which are secondary based on their p values.
Statistical Analysis Plan
Statistical models and methods to answer study questions
Conclusions = data + models (assumptions)
Each specific aim needs a stat analysis section.
Sample size and power follows the analysis plan.
Outline:
•Outcomes: denote primary & secondary
•Primary predictors or comparison groups
•Covariates/confounders/effect modifiers
•Methods for missing data, dropouts
•Interim analyses (for efficacy, for safety)
Common Methods
Univariate analysis
Continuous outcome: Means, SDs, medians
Time to event: Survival curves
Discrete: Proportions
Multivariate analysis
Continuous outcome: Linear regression,correlation
Positive integers: Poisson regression
Binary (yes/no): Logistic regression
Time to event: Proportional hazard regression
ANOVA and t-test are special cases of linear regression