1 of 26

Introduction to ANOVA

Week 9 Lecture 1

2 of 26

  • The Wolf River in Tennessee flows past an abandoned site once used by the pesticide industry for dumping wastes, including chlordane (pesticide), aldrin, and dieldrin (both insecticides)
  • These highly toxic organic compounds can cause various cancers and birth defects
  • The standard method to test whether these substances are present in a river is to take samples at six-tenths depth
  • But since these compounds are denser than water and their molecules tend to stick to particles of sediment, they are more likely to be found in higher concentrations near the bottom

3 of 26

aldrin depth

3.8 bottom

4.8 bottom

4.9 bottom

5.3 bottom

5.4 bottom

5.7 bottom

.

.

.

5.1 surface

5.2 surface

�Exploratory analysis

Aldrin concentration (nanograms per liter) at three levels of depth

4 of 26

�Research question

Is there a difference between the mean aldrin concentrations among the three levels?

  • To compare means of 2 groups we used a t statistic
  • To compare means of 3+ groups we use a new test called ANOVA and a new statistic called F

5 of 26

�ANOVA

ANOVA is used to assess whether the mean of the outcome variable is different for different levels of a categorical variable

H0 : The mean outcome is the same across all categories,

𝜇1 = 𝜇2 = … = 𝜇k

where 𝜇k represents the mean of the outcome for observations in category i

HA : At least one mean is different than others

6 of 26

�Hypotheses

  1. H0 : 𝜇B = 𝜇M = 𝜇S

HA : 𝜇B ≠ 𝜇M ≠ 𝜇S

  1. H0 : 𝜇B ≠ 𝜇M ≠ 𝜇S

HA : 𝜇B = 𝜇M = 𝜇S

  1. H0 : 𝜇B = 𝜇M = 𝜇S

HA : At least one mean is different

  1. H0 : 𝜇B = 𝜇M = 𝜇S = 0

HA : At least one mean is different

  1. H0 : 𝜇B = 𝜇M = 𝜇S

HA : 𝜇B > 𝜇M > 𝜇S

7 of 26

�Hypotheses

  • H0 : 𝜇B = 𝜇M = 𝜇S

HA : 𝜇B ≠ 𝜇M ≠ 𝜇S

  • H0 : 𝜇B ≠ 𝜇M ≠ 𝜇S

HA : 𝜇B = 𝜇M = 𝜇S

  • H0 : 𝜇B = 𝜇M = 𝜇S

HA : At least one mean is different

  • H0 : 𝜇B = 𝜇M = 𝜇S = 0

HA : At least one mean is different

  • H0 : 𝜇B = 𝜇M = 𝜇S

HA : 𝜇B > 𝜇M > 𝜇S

8 of 26

�𝘵 test vs. ANOVA - Purpose

𝘵 test

Compare means from two groups to see whether they are so far apart that the observed difference cannot reasonably be attributed to sampling variability

H0 : 𝜇1 = 𝜇2

ANOVA

Compare the means from two or more groups to see whether they are so far apart that the observed differences cannot all reasonably be attributed to sampling variability

H0 : 𝜇1 = 𝜇2 = … = 𝜇k

9 of 26

�test vs. ANOVA - Method

z/𝘵 test

Compute a test statistic (a ratio)

ANOVA

Compute a test statistic (a ratio)

  • Large test statistics lead to small p-values
  • If the p-value is small enough H0 is rejected, we conclude that the population means are not equal

10 of 26

�𝘵 test vs. ANOVA

  • With only two groups t-test and ANOVA are equivalent, but only if we use a pooled standard variance in the denominator of the test statistic
  • With more than two groups, ANOVA compares the sample means to an overall grand mean

11 of 26

�Test statistic

Does there appear to be a lot of variability within groups? How about between groups?

12 of 26

�𝑭 distribution and p-value

  • In order to be able to reject H0, we need a small p-value, which requires a large F statistic
  • In order to obtain a large F statistic, variability between sample means needs to be greater than variability within sample means

13 of 26

�Conditions/Assumptions

  1. The observations should be independent within and between groups
  2. Always important, but sometimes difficult to check

  1. The observations within each group should be nearly normal
  2. Especially important when the sample sizes are small

How do we check for normality? (Remember previous lectures)

  1. The variability across the groups should be about equal
  2. Especially important when the sample sizes differ between groups

How can we check this condition?

14 of 26

Degrees of freedom associated with ANOVA

  • groups: dfG = k - 1, where k is the number of groups
  • total: dfT = n - 1, where n is the total sample size
  • error: dfE = dfT - dfG

  • dfG = k - 1 = 3 - 1 = 2
  • dfT = n - 1 = 30 - 1 = 29
  • dfE = 29 - 2 = 27

15 of 26

Sum of squares between groups, SSG

Measures the variability between groups

where is each group size, is the average for each group, is the overall (grand) mean

16 of 26

aldrin depth

3.8 bottom

4.8 bottom

4.9 bottom

5.3 bottom

5.4 bottom

5.7 bottom

.

.

.

5.1 surface

5.2 surface

Sum of squares total, SST

Measures the variability between groups

where xi represent each observation in the dataset

SST = (3.8 - 5.1)2 + (4.8 - 5.1)2 + (4.9 - 5.1)2 + … + (5.2 - 5.1)2

= (-1.3)2 + (-0.3)2 + (-0.2)2 + … + (0.1)2

= 1.69 + 0.09 + 0.04 + … + 0.01

= 54.29

17 of 26

Sum of squares error, SSE

Measures the variability within groups:

SSE = SST - SSG

SSE = 54.29 - 16.96 = 37.33

18 of 26

Mean square error

Mean square error is calculated as sum of squares divided by the degrees of freedom

MSG = 16.96/2 = 8.48

MSE = 37.33/27 = 1.38

19 of 26

Test statistic, F value

As we discussed before, the F statistic is the ratio of the between group and within group variability

20 of 26

p-value

p-value is the probability of at least as large a ratio between the “between group” and “within group” variability, if in fact the means of all groups are equal. It’s calculated as the area under the F curve, with degrees of freedom dfG and dfE, above the observed F statistic.

dfG = 2; dfE = 27

21 of 26

R, Please Save Me!

22 of 26

�Conclusion

  • If p-value is small (less than α), reject H0. The data provide convincing evidence that at least one mean is different from (but we can’t tell which one)
  • If p-value is large, fail to reject H0. The data do not provide convincing evidence that at least one pair of means are different from each other, the observed differences in sample means are attributable to sampling variability (or chance)

23 of 26

�Conclusion - in context

What is the conclusion of the hypothesis test?

The data provide convincing evidence that the average aldrin concentration

  1. is different for all groups
  2. on the surface is lower than the other levels
  3. is different for at least one group
  4. is the same for all groups

24 of 26

�Conclusion - in context

What is the conclusion of the hypothesis test?

The data provide convincing evidence that the average aldrin concentration

  • is different for all groups
  • on the surface is lower than the other levels
  • is different for at least one group
  • is the same for all groups

25 of 26

�Which means differ?

  • We concluded that at least one pair of means differ. The natural question that follows is “which ones?”
  • We can do two sample 𝘵 tests for differences in each possible pair of groups

Can you see any pitfalls with this approach?

  • When we run too many tests, the Type 1 Error rate increases
  • This issue is resolved by using a modified significance level
    • We will see it later this week.

26 of 26

�Which means differ?

Based on the box plots below, which means would you expect to be significantly different?

  1. bottom & surface
  2. bottom & mid-depth
  3. mid-depth & surface
  4. bottom & mid-depth;

mid-depth & surface

  1. bottom & mid-depth;

bottom & surface;

mid-depth & surface