2 of 26

The Wolf River in Tennessee flows past an abandoned site once used by the pesticide industry for dumping wastes, including chlordane (pesticide), aldrin, and dieldrin (both insecticides)
These highly toxic organic compounds can cause various cancers and birth defects
The standard method to test whether these substances are present in a river is to take samples at six-tenths depth
But since these compounds are denser than water and their molecules tend to stick to particles of sediment, they are more likely to be found in higher concentrations near the bottom

3 of 26

aldrin depth

3.8 bottom

4.8 bottom

4.9 bottom

5.3 bottom

5.4 bottom

5.7 bottom

5.1 surface

5.2 surface

�Exploratory analysis

Aldrin concentration (nanograms per liter) at three levels of depth

4 of 26

�Research question

Is there a difference between the mean aldrin concentrations among the three levels?

To compare means of 2 groups we used a t statistic
To compare means of 3+ groups we use a new test called ANOVA and a new statistic called F

5 of 26

�ANOVA

ANOVA is used to assess whether the mean of the outcome variable is different for different levels of a categorical variable

H₀: The mean outcome is the same across all categories,

𝜇₁ = 𝜇₂ = … = 𝜇_k

where 𝜇_k represents the mean of the outcome for observations in category i

H_A: At least one mean is different than others

6 of 26

�Hypotheses

H₀ : 𝜇_B = 𝜇_M = 𝜇_S

H_A : 𝜇_B ≠ 𝜇_M ≠ 𝜇_S

H₀ : 𝜇_B ≠ 𝜇_M ≠ 𝜇_S

H_A : 𝜇_B = 𝜇_M = 𝜇_S

H₀ : 𝜇_B = 𝜇_M = 𝜇_S

H_A : At least one mean is different

H₀ : 𝜇_B = 𝜇_M = 𝜇_S = 0

H_A : At least one mean is different

H₀ : 𝜇_B = 𝜇_M = 𝜇_S

H_A : 𝜇_B > 𝜇_M > 𝜇_S

7 of 26

�Hypotheses

H₀ : 𝜇_B = 𝜇_M = 𝜇_S

H_A : 𝜇_B ≠ 𝜇_M ≠ 𝜇_S

H₀ : 𝜇_B ≠ 𝜇_M ≠ 𝜇_S

H_A : 𝜇_B = 𝜇_M = 𝜇_S

H₀ : 𝜇_B = 𝜇_M = 𝜇_S

H_A : At least one mean is different

H₀ : 𝜇_B = 𝜇_M = 𝜇_S = 0

H_A : At least one mean is different

H₀ : 𝜇_B = 𝜇_M = 𝜇_S

H_A : 𝜇_B > 𝜇_M > 𝜇_S

8 of 26

�𝘵 test vs. ANOVA - Purpose

𝘵 test

Compare means from two groups to see whether they are so far apart that the observed difference cannot reasonably be attributed to sampling variability

H₀ : 𝜇₁ = 𝜇₂

ANOVA

Compare the means from two or more groups to see whether they are so far apart that the observed differences cannot all reasonably be attributed to sampling variability

H₀ : 𝜇₁ = 𝜇₂ = … = 𝜇_k

9 of 26

�test vs. ANOVA - Method

z/𝘵 test

Compute a test statistic (a ratio)

ANOVA

Compute a test statistic (a ratio)

Large test statistics lead to small p-values
If the p-value is small enough H₀ is rejected, we conclude that the population means are not equal

10 of 26

�𝘵 test vs. ANOVA

With only two groups t-test and ANOVA are equivalent, but only if we use a pooled standard variance in the denominator of the test statistic
With more than two groups, ANOVA compares the sample means to an overall grand mean

11 of 26

�Test statistic

Does there appear to be a lot of variability within groups? How about between groups?

12 of 26

�𝑭 distribution and p-value

In order to be able to reject H₀, we need a small p-value, which requires a large F statistic
In order to obtain a large F statistic, variability between sample means needs to be greater than variability within sample means

13 of 26

�Conditions/Assumptions

The observations should be independent within and between groups
Always important, but sometimes difficult to check

The observations within each group should be nearly normal
Especially important when the sample sizes are small

How do we check for normality? (Remember previous lectures)

The variability across the groups should be about equal
Especially important when the sample sizes differ between groups

How can we check this condition?

14 of 26

Degrees of freedom associated with ANOVA

groups: df_G = k - 1, where k is the number of groups
total: df_T = n - 1, where n is the total sample size
error: df_E = df_T - df_G

df_G = k - 1 = 3 - 1 = 2
df_T = n - 1 = 30 - 1 = 29
df_E = 29 - 2 = 27

15 of 26

Sum of squares between groups, SSG

Measures the variability between groups

where is each group size, is the average for each group, is the overall (grand) mean

16 of 26

aldrin depth

3.8 bottom

4.8 bottom

4.9 bottom

5.3 bottom

5.4 bottom

5.7 bottom

5.1 surface

5.2 surface

Sum of squares total, SST

Measures the variability between groups

where x_i represent each observation in the dataset

SST = (3.8 - 5.1)² + (4.8 - 5.1)² + (4.9 - 5.1)² + … + (5.2 - 5.1)²

= (-1.3)² + (-0.3)² + (-0.2)² + … + (0.1)²

= 1.69 + 0.09 + 0.04 + … + 0.01

= 54.29

17 of 26

Sum of squares error, SSE

Measures the variability within groups:

SSE = SST - SSG

SSE = 54.29 - 16.96 = 37.33

18 of 26

Mean square error

Mean square error is calculated as sum of squares divided by the degrees of freedom

MSG = 16.96/2 = 8.48

MSE = 37.33/27 = 1.38

19 of 26

Test statistic, F value

As we discussed before, the F statistic is the ratio of the between group and within group variability

20 of 26

p-value

p-value is the probability of at least as large a ratio between the “between group” and “within group” variability, if in fact the means of all groups are equal. It’s calculated as the area under the F curve, with degrees of freedom df_G and df_E, above the observed F statistic.

df_G = 2; df_E = 27

21 of 26

R, Please Save Me!

22 of 26

�Conclusion

If p-value is small (less than α), reject H₀. The data provide convincing evidence that at least one mean is different from (but we can’t tell which one)
If p-value is large, fail to reject H₀. The data do not provide convincing evidence that at least one pair of means are different from each other, the observed differences in sample means are attributable to sampling variability (or chance)

23 of 26

�Conclusion - in context

What is the conclusion of the hypothesis test?

The data provide convincing evidence that the average aldrin concentration

is different for all groups
on the surface is lower than the other levels
is different for at least one group
is the same for all groups

24 of 26

�Conclusion - in context

What is the conclusion of the hypothesis test?

The data provide convincing evidence that the average aldrin concentration

is different for all groups
on the surface is lower than the other levels
is different for at least one group
is the same for all groups

25 of 26

�Which means differ?

We concluded that at least one pair of means differ. The natural question that follows is “which ones?”
We can do two sample 𝘵 tests for differences in each possible pair of groups

Can you see any pitfalls with this approach?

When we run too many tests, the Type 1 Error rate increases
This issue is resolved by using a modified significance level

We will see it later this week.

26 of 26

�Which means differ?

Based on the box plots below, which means would you expect to be significantly different?

bottom & surface
bottom & mid-depth
mid-depth & surface
bottom & mid-depth;

mid-depth & surface

bottom & mid-depth;

bottom & surface;

mid-depth & surface