1 of 51

Section 4

Sampling distributions

Confidence intervals

Hypothesis testing and p values

1

2 of 51

Population and sample

2

3 of 51

Samples drawn from a population

3

Population

sample

Sample is drawn “at random”. Everyone in the target population is equally eligible for sampling.

Similar to a blood sample. If we measure glucose levels in a blood sample, assume this is the glucose level in all the blood in the body.

4 of 51

True population distribution of Y�(individuals)- not Gaussian

4

Mean Y=μ= 1.5, SD=σ=1.12

5 of 51

Possible samples & statistics from the population (true mean=1.5)

sample (n=4) mean

(statistic)

0,0,0,0 0.00

…

1,1,3,2 1.75

…

3,3,3,3 3.00

5

6 of 51

6

7 of 51

Central Limit Theorem

For a large enough n, the distribution of any sample statistic (mean, mean difference, OR, RR, hazard, correlation coeff,regr coeff, proportion…) from sample to sample has a Gaussian (“Normal”) distribution centered at the true population value. The SD of this “sampling distribution” is called the standard error (SE). The SE is proportional to 1/√n.

But when n is small, the Central limit theorem may not hold, particularly if Y does not have a normal distribution.

7

8 of 51

8

9 of 51

Funnel plot - true difference is δ= 5�Each point is one study (meta analysis)

9

10 of 51

Publication bias - non reproducibility

10

Studies with larger sample effects are more likely to be published and may be larger than the true average effect.

11 of 51

Science – 28 Aug 2015 (Nosek)

11

The mean effect size of the replication effects (M=0.197, SD= 0.257) was half the magnitude of the mean effect size of the original effects (M = 0.403, SD = 0.188). …Ninety-seven percent of original studies had significant results (p < .05).Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; …

12 of 51

Resampling estimation (“bootstrap”)

One does not repeatedly sample from the same population, (one only carries out the study once). But a “simulation” of repeated sampling from the population can be obtained by repeatedly sampling from the sample with replacement & computing the statistic from each resample, creating an “estimated” sampling distribution. The SD of the statistics across all “resamples” is an estimate the standard error (SE) for the statistic.

12

13 of 51

Samples drawn from a population�sample

13

Population

Original sample

sample

Sample is drawn “at random” with replacement. Everyone in the original sample is eligible for sampling.

sample

14 of 51

Confidence interval (for μ)

14

lower

upper

15 of 51

Confidence interval (for θ)

15

lower

upper

16 of 51

16

17 of 51

�

17

95% Confidence intervals

95% of the intervals will contain the true population value

But which ones?

18 of 51

Z vs t (technical note)

Confidence intervals computed with normal (Z) percentiles assume that the population σ is known. Since σ is usually not known and is estimated with the sample SD, the Z percentiles need to be adjusted. The adjusted tables are called “t” tables instead of Gaussian (Z) tables (t distribution). For n > 30, they are about the same (ie Z ≈ t).

18

19 of 51

Z vs t distribution (cont.)

The adjustment to the normal (Z) distribution that uses the sample SD instead of the unknown σ was first computed by a statistician named Gossett who published this under the name “the Student” (of mathematical statistics) so this is sometimes called “Students” t distribution.

19

20 of 51

20

Z distribution vs t distribution, n=5, n=30

Said t to Z: Does this distribution make my data look fat?

(t distribution is wider)

21 of 51

t vs Gaussian Z percentiles

21

%ile	85th	90th	95th	97.5th	99.5^th
Confidence	70%	80%	90%	95%	99%
t, df=5	1.156	1.476	2.015	2.571	4.032
t, df=10	1.093	1.372	1.812	2.228	3.169
t, df=20	1.064	1.325	1.725	2.086	2.845
t, df=30	1.055	1.310	1.697	2.042	2.750
Gaussian Z	1.036	1.282	1.645	1.960	2.576

What did the z distribution say to the t distribution?

You may look like me but you're not normal.

t degrees of freedom = df = n - number of groups = n - number of estimated parameters

22 of 51

Confidence Intervals (CI) *

22

23 of 51

CI for a proportion�“law” of small numbers

n=10, Proportion = 3/10 = 30%

What do you think are the 95% confidence bounds?

Is it ok to conclude that the “real” (true) proportion is less than 50%?

23

24 of 51

CI for a proportion�“law” of small numbers

n=10, Proportion = 3/10 = 30%

What do you think are the 95% confidence bounds?

Is it ok to conclude that the “real” (true) proportion is less than 50%?

Answer: 95% CI: 6.7% to 65.3%

24

25 of 51

Standard error (SE) for the difference between two statistics from two different groups

25

26 of 51

26

Lancet 2010; 375: 1447–56

27 of 51

Statistics for HBA1c change�from base to 26 weeks (Pratley et al, Lancet 2010)

27

Tx	n	Mean	SD	SE
Liraglutide	225	-1.24%	0.99%	0.066%
Sitaglipin	219	-0.90%	0.98%	0.066%

28 of 51

Null hypothesis & p values

28

29 of 51

29

30 of 51

Hypothesis test statistics�Z_obs= (Sample Statistic – null value) / Standard error

30

Z (or t)=3.82

sampling distribution mean is zero - null hypothesis value

31 of 51

31

32 of 51

Difference & Non inferiority (equivalence) hypothesis testing

Difference Testing (usual hypothesis testing):

Null Hyp: A=B (or A-B=0), Alternative: A≠B

Z_obs = (observed stat – 0) / SE

Non inferiority (within δ) Testing:

Null Hyp: A > B + δ, Alternative: A <= B + δ

Z_eq = (observed stat – δ )/ SE

Must specify δ for non inferiority testing

use t instead of Z if variance is not known

32

33 of 51

Non inferiority testing-HBA1c data

33

34 of 51

Left ventricle assist device-LVAD

34

35 of 51

Non inferiority testing-LVAD example

A LVAD (left ventricle assist device) keeps a weak heart pumping while the patient waits for a heart transplant. The primary outcome is non failure at 90 days after LVAD placement. The current LVAD has a 90% non failure rate. A new LAVD is under FDA consideration. FDA will consider the new LVAD “non inferior” to the current if it is no more than δ=4% worse than the current LVAD. (ie true value is 86% or better at 90 days).

What should the confidence interval look like?

What should the null hypothesis be?

35

36 of 51

LVAD example (continued)

Confidence interval:

Lower bound should be 86% or higher. Also preferred if 90% is within the confidence bound.

Hypothesis test p value:

Let δ be the true difference in failure percent between current vs new LVAD. When computing the p value

Should null δ = 0 ?

Should null δ = 4% ?

36

37 of 51

Confidence intervals�versus hypothesis testing

Study equivalence demonstrated only from –δ to +δ

(1‑8) (brackets show 95% confidence intervals)

Stat

Sig

1. Yes ----------------------------------------------------------------------------------------------- < not equivalent >

2. Yes -----------------------------------------------------------------------------< uncertain >--------------------

3. Yes ------------------------------------------------------------------< equivalent >-----------------------------------

4. No ---------------------------------------------------< equivalent >---------------------------------------------------

5. Yes ----------------------------------< equivalent > ----------------------------------------------------------------

6. Yes ---------------------< uncertain>----------------------------------------------------------------------------------

7. Yes -< not equivalent >-----------------------------------------------------------------------------------------------

8. No ---------<___________________________uncertain________________________________>------

| |

‑δ O +δ

true difference

Ref: Statistics Applied to Clinical Trials- Cleophas, Zwinderman, Cleopahas 2000, Kluwer Academic Pub Page 35

37

38 of 51

Non inferiority�JAMA 2006 - Piaggio et al, p 1152-1160

38

39 of 51

Confidence intervals & p values

39

When the 95% confidence interval contains the null value, the two sided p value is > 0.05. When the 95% confidence interval excludes the null value, the two sided p value is < 0.05.

above, α = 0.05 = 5% and 1-α = 0.95=95%

In general, the same applies for the 1-α level confidence interval and α.

When the (1-α) level confidence interval contains the null value, the corresponding two sided p value is > α.

When the (1-α) level confidence interval excludes the null value, the corresponding two sided p value is < α.

Many prefer confidence intervals to p values. Confidence intervals are less prone to misinterpretation and do not depend on a null hypothesis. Good practice to report both the confidence interval and the p value.

It is very bad practice to only report a p value without the statistic that goes with it (“disembodied” p value)

40 of 51

Forest plot

40

Odds of different outcomes when exposed to a toxin

41 of 51

Paired Mean Comparisons�Exampe: Serum cholesterol in mmol/L�Difference (d) between baseline and end of 4 weeks�

Subject baseline 4 wks difference(d_i)

1 9.0 6.5 2.5

2 7.1 6.3 0.8

3 6.9 5.9 1.0

4 6.9 4.9 2.0

5 5.9 4.0 1.9

6 5.4 4.9 0.5

mean 6.87 5.42 1.45

SD 1.24 0.97 0.79

SE 0.51 0.40 0.32

41

42 of 51

Confidence Intervals,�Hypothesis Tests

Confidence intervals are of the form

Sample Statistic +/- (Z_percentile*) (Standard error)

Lower bound = Sample Statistic- (Z_percentile)(Standard error)

Upper bound = Sample Statistic + (Z_percentile)(Standard error)

Hypothesis test statistics (Z_obs*) are of the form

Z_obs=(Sample Statistic – null value) / Standard error

* t percentile or t_obs for continuous data when n is small. When data does not follow the normal distribution and n is “small”, above are only approximate

42

43 of 51

Common sample statistics & their SEs

Sample Statistic Symbol Standard error (SE)

Mean y̅ S/√n = √[S²/n] = SEM

Mean difference y̅₁ – y̅₂ = d̅ √[S₁²/n₁ + S₂²/n₂]= SE_d

Proportion P √[P(1-P)/n]

Proportion difference P₁ – P₂ √[P₁(1-P₁)/n₁ + P₂(1-P₂)/n₂]

Log Odds ratio* log_eOR √[ 1/a + 1/b + 1/c + 1/d]

Log Risk ratio* log_eRR √[1/a -1/(a+c) + 1/b - 1/(b+d)]

Slope (rate) b S_error/ S_x√(n-1)

Hazard rate (survival) h h/√[number events] = h/√e

Transform (z) of the

Correlation coefficient r* z=½log_e[(1+r)/(1-r)] SE(z)=1/√([n-3])

r = (e^2z -1)/(e^2z + 1)

Log Hazard Ratio* log_e HR √[(1/e₁) + (1/e₂)]

(assumes constant h) (e₁ & e₂ are number of events)

^*form CI bounds on log scale, then take anti-log

43

a	b
c	d

2 x 2 table – for OR, RR

44 of 51

Sample statistics & SEs (continued)

P₁ = a/n₁, P₂=b/n₂ , RR=P₁/P₂

SE(log_e RR) =

√ [ 1/(P₁n₁) – 1/n₁ + 1/(P₂n₂) – 1/n₂ ]

confidence interval for true RR

lower=exp(log_e(RR) – Z SE)

upper=exp(log_e(RR) + Z SE)

44

	Y positive	Y negative
X positive	a	b
X negative	c	d
total	a+c = n₁	b+d=n₂

45 of 51

Confidence intervals for transformations

If “Q” is a statistic with confidence bounds (L,U) and there is a transformation f(X) that transforms Q to a new statistic f(Q), the confidence bounds for f(Q) are ( f(L), f(U) ).

Example: log₁₀Estradiol (log E₂) in 14 yr old females has a normal distribution with mean = 1.86 log pg/ml. The 95% confidence bounds for this mean log E₂ are (1.76,1.96). Median E₂ is 10^1.86 = 72.4 pg/ml with 95% confidence bounds (10^1.76,10^1.96) = (57.5 pg/ml, 91.2 pg/ml).

45

46 of 51

Guide to Testing- null values

46

Sample Statistic & Comparison	Population null hypothesis
Comparing two means	True population mean difference is zero
Comparing two proportions	True population difference is zero
Comparing two medians	True population median difference is zero
Odds ratio (comparing odds)	True population odds ratio is one
Risk ratio=relative risk (comparing risks)	True population risk ratio is one
Correlation coefficient (compare to zero)	True population correlation coefficient is zero
Slope=rate of change=regression coefficient	True population slope is zero
Comparing two survival curves	True difference in survival is zero at all times

47 of 51

Hypothesis Test Guide

Statistic/type of comparison

Mean comparison-unpaired

Mean comparison-paired

Median comparison-unpaired

Median comparison-paired

Proportion comparison-unpaired

Proportion comparison-paired

Odds ratio

Risk ratio

Correlation, slope

Survival curves, hazard rates

Test/analysis procedure – gives p value

t test (2 groups), ANOVA (3+ groups)

paired t test, repeated measures ANOVA

Wilcoxon rank sum test, KruskalWallis test*

Wilcoxon signed rank test on differences*

chi-square test (or Fishers test)

McNemar’s chi-square test

chi-square test, Fisher test

regression, t statistic

log rank test

ANOVA = analysis of variance

* non parametric – Gaussian distribution theory is not used to get the p value

47

48 of 51

Hypothesis test guide- p values

48

49 of 51

Non parametric Cis & tests

When the sample size is small and when the original distribution is not close to the normal distribution, the sampling distribution for the statistic may not follow the Gaussian, particularly in the distribution “tails”, so another method is needed to generate the sampling distribution, the confidence bounds and the p value. The bootstrap (resampling) method can always be used. Other methods based on the ranks of the data have been devised. These are called “non parametric” since they do not use the (parametric) Gaussian.

49

50 of 51

Parametric vs non parametric

Continuous data-Non parametric-compute p values using ranks of the data

Does not assume stats follow Gaussian distribution–particularly in distribution “tails”.

Parametric Nonparametric

2 indep means- 2 indep medians-

t test Wilcoxon rank sum test=MW

3+ indep mean- 3+ indep medians-

ANOVA F test Kruskal Wallis test

Paired means- Paired medians-

paired t test Wilcoxon signed rank test

Pearson correlation Spearman correlation

Any parametric test has a corresponding non parametric version.

50

51 of 51

Nomenclature for Testing

Delta (δ) = True difference or size of effect

Alpha (α) = Type I error = false positive

= Probability of rejecting the null hypothesis when it is true.

(Usually α is set to 0.05)

Beta (β) = Type II error = false negative

=Probability of not rejecting the null hypothesis when delta is not zero

( there is a real difference in the population)

Power = 1 – β

= Probability of getting a p value less than α

(ie declaring statistical significance)

when, in fact, there really is a non-zero delta.

We want small alpha levels and high power.

51