1 of 124

Comparing means among groups

Working with continuous data: One-way ANOVAs and multiple comparisons, introducing the linear model, t-test connections, and non-parametric options

J. Stephen Gosnell

Baruch College

2 of 124

Goals

Extend tests of continuous data to comparing groups

ANOVA as first linear model
more multiple comparisons

Historical connections: t-tests
Non-parametric options

3 of 124

Frequency Distributions Or Histograms

A frequency distribution or histogram can be used to graphically analyze variability
Show skewness and shape
Fisher’s famous sepal length dataset
How often did each value occur in a sample?

These graphs look simple because they are ones YOU can replicate in R. See class files!

From earlier lecture!

4 of 124

Iris virginica, Frank Mayfield, CC BY-SA 2.0 <https://creativecommons.org/licenses/by-sa/2.0>, via Wikimedia Commons

Flower morphology. Pearson Scott Foresman, Public domain, via Wikimedia Commons

5 of 124

Transition to Hypothesis Testing

What is a hypothesis we might want to test?

These graphs look simple because they are ones YOU can replicate in R. See class files!

6 of 124

Transition to Hypothesis Testing

What is a hypothesis we might want to test?

These graphs look simple because they are ones YOU can replicate in R. See class files!

7 of 124

What about multiple groups?

What do we test now (and how?)

8 of 124

What about multiple groups?

What do we test now (and how?)

9 of 124

What about multiple groups?

What do we test now (and how?)

10 of 124

What about multiple groups?

What do we test now (and how?)
We still focus on distributions!

Asking if the data are “equal” to something is not helpful!

11 of 124

What about multiple groups?

What do we test now (and how?)
We still focus on distributions!

Asking if the data are “equal” to something is not helpful!
focus on means!

12 of 124

Wide vs long aside

Note we have one “unit” per row

may have multiple measurements!

14 of 124

Explaining the data

Group means + noise
Overall average + noise

15 of 124

Null Hypothesis, Visualized

Sample from a single population and assign to each group
Under the null the groups means would be exactly on the grand mean, minus sampling error

16 of 124

Testing the Null Hypothesis

What is the test statistic?

3+ groups, so we can‘t just subtract

Instead, we can calculate variance within and among groups and compare
Both would give same estimate σ²
So we can calculate both and make a ratio

17 of 124

Variance Estimated Within Each Group: Mean Square Error (MSE)

18 of 124

Variance Estimated Among Each Group

multiply by n to get estimate for s²
mean square treatment

19 of 124

We Can Approximate The Distribution Under The Null Hypothesis Using The F Distribution And Error Estimates

If you divide these two estimates of variance when H₀ is true, you should be close to 1

In other words, variance among groups = variance within groups

20 of 124

What are the ways we can find p-values?

21 of 124

P-value Via Simulation

Ratio is at >119 in our data!

22 of 124

But this requires sampling

and is slow
Given our earlier calculations, it turns out we can approximate this distribution using a ratio of χ2 variables

remember those?

a standard Z squared!

but now, numerator has df #groups -1 and denominator df is n- # of groups -1

F ratio!

23 of 124

P-value Via Distribution

Ratio is at >119 in our data!

24 of 124

Sums Of Squares Notes: Divide Variance Among Parts Of The Model

Big idea is we can break up the total sum of squares into parts!

Sum of total squared distance from grand mean squared (for each point) is equal to distance of each point from group mean squared + distance of each group mean to grand mean squared
These add up

We then divide these by degrees of freedom/parts to get mean sums of squares

not additive!

Is there more variance among or within groups?

25 of 124

Sums Of Squares Notes: Divide Variance Among Parts Of The Model

We looked at difference between estimating variance within and among groups under the null
Realize we can always break the data points into these components

deviation between observation and the mean of its group (called “error”)
deviation between observation’s group mean and the grand mean
These add up to total error (deviation from grand mean!)

Is there more variance among or within groups?

26 of 124

Welcome to the linear model

First, make the model (minus intercept here) and check assumptions
Notice the formula interface and need to save to an object
Residuals/errors are basis for assumptions
Visual inspection

29 of 124

Remember q-q plots!

https://rpubs.com/mbh038/725314

32 of 124

Visually Checking Assumptions

4 plots from R
Note all red lines are fairly flat and variance appears similar around each group in Residuals vs Fitted plot
Note dots in qq-plot are close to line, which indicates normality

E are identically distributed

E follow a normal distribution

No outliers

Bands are ok! Why?

33 of 124

Assumptions

Explaining 4 plots from R

Check residuals vs fitted for any structure
Q-q plot should show basic normality
Scale-location plot is another way of looking at residuals vs fitted; check for increase with fitted values
Residuals vs leverage shows how far a point is from fitted points vs how much influence it has (leverage).

Cook's Distance compares the fitted response of the regression which uses every data point, against the fitted response of the regression where a particular data point has been dropped from the analysis (and then sums this difference across all data points).
Very influential data points (on the parameter estimates) are identified, and are labeled in this plot.

important if they fall outside dashed red Cook Distance marker (upper or lower right corners), but you may not even see this!

34 of 124

Check Model Outcomes

Note R sets first group (alphabetically) as intercept for others
This is what you want to use (model with intercept)!

35 of 124

Problems with LM approach

We get a p-value for each level of the variable, but not an overall p-value

we could use model level for now, but not long-term/consistent answer

But other info here is useful!

36 of 124

Other lm Benefits

Full p-value for model is also available

Same as variable level p-value fit for one-way ANOVA!

37 of 124

Other benefits of LM: P-value Isn’t Everything!

You may have significant differences among groups; but how much variance does it explain?

R² answers this

R² = SS_groups/SS_total
Relies on partitioning of variance that we noted earlier!
•If R² were near 0, all of the variability is within groups (rather than among) and the group means are all similar•

Must use model with intercept! Otherwise R sets the overall mean to 0 for the null model and inflates F and R2 greatly!

38 of 124

Can Be Used To Compare Models As Well….

But we prefer smaller models
The adjusted R²penalizes larger models so you can compare them

More on this later!

For single model analysis/description, just use the multiple R-squared value

39 of 124

Model without Intercept Issues

Compare to model without intercept
Provides group means, but otherwise causes error for follow-up analysis

Overall F and R²incorrect as it compares to full null model ( zero intercept!)

Can also get groups means from summarySE function to avoid this

40 of 124

Problems with LM approach

We get a p-value for each level of the variable, but not an overall p-value

we could use model level for now, but not long-term/consistent answer

41 of 124

How to Build Anova Tables

Traditional way of viewing outcomes
Provides overall p-value for if your treatment was significant

Same as overall model fit for one-way ANOVA!

In the traditional table, note SS add up
Make sure fit for model with intercept!

Reporting:

F_2,147=119.26, p < .001; reject null hypothesis

42 of 124

What Is Type “III”?

Partitioning variance is key to model comparison overall

How much is explained by null model?
How much “extra” is explained by larger model?
These questions can be answered by partitioning variance among different nested models and comparing fit using an F distribution

In other words, the big picture of linear model analysis is we can determine what adding a particular factor into a model does given other factors are already included in the model
We do this by considering sums of squares, but these can be calculated in multiple ways
For one-way ANOVA (what we are doing), they are all the same, but later on it matters
We’ll come back to this when we need it, but good tip is just to get in habit of specifying type III

and maybe something about contrasts

43 of 124

Post-hoc Tests

If we have a significant p-value from the Anova, we need to see what caused that to happen
For tests with more than 2 groups, any difference can invalidate null hypothesis?

44 of 124

Post-hoc Tests

If we have a significant p-value from the Anova, we need to see what caused that to happen
For tests with more than 2 groups, any difference can invalidate null hypothesis?
Which one?

Running individual tests will inflate type 1 error rate!

Every test has a 5% chance of being wrong if α=.05
So multiple tests inflate these, independent or not

Answer?

Control for the family-wise error rate by adjusting α

45 of 124

https://xkcd.com/882

46 of 124

https://xkcd.com/882

47 of 124

https://xkcd.com/882

48 of 124

https://xkcd.com/882

49 of 124

Options (all seen before!)

Bonferroni

Divide alpha by total number of tests you need to complete
Each test must meet this new alpha level to be “significant”
For this problem, there are a lot of comparisons?

10! Let’s list them!

So the new α is .05/10 = .05

Small!

Adjusted/Sequential Bonferroni (Holm’s in R)

Rank post-hoc tests by p-value, smallest to largest
For each p-value, starting with smallest, compare to alpha/(number of remaining tests)

Maximize α
Rank post-hoc tests by p-value, smallest to largest
For each p-value, calculate a q-value ((number of true tests/rank of p-value)*p-value) and compare to alpha

Many more!

50 of 124

Multcomp Lets Us Do Post Hoc Tests With Lm And Extensions

Tukey’s means to compare all means

not exactly TukeyHSD

glht command just needs model and to specify which variable you are interested in

“Species” here
note need to save as object

Reject all null hypotheses under your original α level

55 of 124

Options

Bonferroni

Divide alpha by total number of tests you need to complete
Each test must meet this new alpha level to be “significant”
For this problem, there are a lot of comparisons?

10! Let’s list them!

So the new α is .05/3 = .0167

3 is number of comparisons we are making!
Small!

Adjusted/Sequential Bonferroni (Holm’s in R)

Rank post-hoc tests by p-value, smallest to largest
Apportion α as needed to meet these until you run out

Maximize α
Rank post-hoc tests by p-value, smallest to largest
Find “largest” p-value you can accept; all before (smaller than) that are significant!

Many more!

56 of 124

Multcomp Lets Us Do Post Hoc Tests With Lm And Extensions

Can also specify contrasts

57 of 124

Multcomp Lets Us Do Post Hoc Tests With Lm And Extensions

Can also specify contrasts
Set control method

58 of 124

Multcomp Lets Us Do Post Hoc Tests With Lm And Extensions

Can also specify contrasts
Set control method

59 of 124

A Little Linear Algebra Goes A Long Way

The general linear model! This is what unites t-tests, anovas, ancovas, regression, and can be extended in a number of ways!

60 of 124

A Little Linear Algebra Goes A Long Way

Y = XB + E
Y = response

N x 1 matrix

X = explanatory variables

N x #of explanatory variables

B = matrix of responses

Think of as slopes (numerical) or adjustments for groups (factors)
# of explanatory variables x 1

E is our error (nx1) matrix

Remember, multiply row by columns to get an answer for matrices

Matrices must match dimensions to multiply!

Build the linear model for our iris example!�

61 of 124

Model/design matrix

62 of 124

Coefficient/Beta matrix

63 of 124

Residuals

64 of 124

Why Do This?

Explains

Singularity messages

When a column is linear combination of others
connected to non-invertible matrix

we don’t divide matrices, instead we multiply by inverse

So to solve for B, we use (X’X)X’B = y

But determinant doesn’t exist for non-independent column, so no inverse since X’ is d*[moved matrix)

Similar logic eventually used to find leverage values via H (hat) matrix

https://www.mathsisfun.com/algebra/matrix-inverse.html

65 of 124

Why Do This?

Degrees of freedom

how many parameters did you estimate?
equal to columns of X matrix (or rows of B matrix)

similarity among tests

Shared assumptions

Linear relationship among the response and predictor variables
Errors are identically and independently distributed and follow a normal distribution

NOTE THIS IS THE ERRORS/RESIDUALS, NOT THE MEANS!

Allows contrasts

Multiple comparisons actually test for differences among Betas!

Errors iid ~ N(0, σ²) is all you need!

66 of 124

Back To Linear Algebra: Orthogonal Contrasts Don’t Need Adjusting

User-defined contrasts may be useful for setting orthogonal contrasts
Orthogonal contrasts are ones that are independent, meaning you can’t form them from other columns in the X portion of the matrix

sum of multiplied coefficients is 0
k-1 orthogonal contrasts exist

Means you don’t have to correct for FWER
Can be specified in multcomp and then not corrected

If you combine groups, use fractions if you want estimates of differences
must add to 0

Makes no difference for p-values

67 of 124

Orthogonal Contrasts

Sum of product of contrast coefficients sum to is 0
Everything compared once!

68 of 124

Orthogonal Contrasts

Sum of product of contrast coefficients sum to is 0
Everything compared once!

69 of 124

Orthogonal Contrasts

Sum of product of contrast coefficients sum to is 0
Everything compared once!

set totals to 1 or -1 for appropriate estimates

70 of 124

Graphical Representation Of Results

Groups which cannot be distinguished share the same letter
Done by adding column to output from summarySE function

71 of 124

Graphical Representation Of Results

Groups which cannot be distinguished share the same letter
Done by adding column to output from summarySE function

72 of 124

Other options

73 of 124

t-test connections

Often used as bridge from 1 to many samples

easy statistics

Let D = the difference between the mean of the first group and the mean of the second group

74 of 124

Deriving The Test Statistics For 2 Samples

We have to estimate noise again
Null hypothesis is that they come from a single population, so draw from one urn
Variance estimate still does not matter!

75 of 124

t-test connections

2-sample t-test is special case of ANOVA
since numerator is 1, only report denominator df

76 of 124

t-test connections

2-sample t-test is special case of ANOVA
since numerator is 1, only report denominator df

77 of 124

Differences: Estimate Of Pooled Variance

If we assume both populations have the same variance, we can can simply weight our estimates for the variance

Assumes that larger sample yields better estimate, so weight it more

Still controls for degrees of freedom

This simplifies to

Pooled variance formula

78 of 124

Differences: Estimate Of Pooled Variance

However, we often don’t know this, and using overall mean of variance is highly biased if H₀ is false
Instead, we calculate each individual estimate and weight them
Known as the Welsh or Behren-Fisher’s t-test

Leads to an approximate following of the t-distribution
Degrees of freedom can be non-integer (decimal) and less than n1 + n2 - 2

Corrected variance formula – default used in R

Remember:

79 of 124

Degrees Of Freedom Odd For Unbalanced Design!

Cavity data

General formula (Welch modification for df)

Degrees of freedom can be non-integer (decimal) and less than n1 + n2 - 2

80 of 124

Welch/Behren-Fisher t-test

81 of 124

The Signal/Noise Ratio

We usually assume no difference among means, but we can test any shift!
Note our graph shows +2.213

just reversed the means

82 of 124

What If We Violate Normality Assumption?

Wilcoxon/Mann-Whitney U test

compares the central tendencies of two groups using ranks.
Assumes distributions are the same shape

Sign test

not option for unpaired data

Bootstrapping

Needs large enough sample size

Permutation

Needs large enough sample size
not option for paired data

83 of 124

Performing A Mann-Whitney U Test

First, rank all individuals from both groups together in order (for example, smallest to largest)
Sum the ranks for all individuals in one of the groups

R₁ or R₂

Calculate the test statistics, U

U₁ is the number of times an individual from pop. 1 has a lower rank than an individual from pop. 2, out of all pairwise comparisons. (How many pairwise comparisons are possible?)
Use larger U value

U₂ = n₁n₂ – U₁

84 of 124

Assumptions Of Mann-Whitney U Test

Both samples are random samples
Both populations have the same shape of distribution
Mann-Whitney test is quite sensitive to data violating this 2^nd assumption

90 of 124

Example: Garter Snake Resistance To Newt Toxin

Rough skinned newt

By Don Loarie (Rough-skinned Newt Taricha granulosa) [CC BY 2.0 (https://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons

Photo by Jessica Bolser/USFWS.

91 of 124

Comparing Snake Resistance To Ttx (Tetradotoxin)

Resistance is known to not be normally distributed within populations

Locality	Proportion of snakes resistant
Benton	0.29
Benton	0.77
Benton	0.96
Benton	0.64
Benton	0.70
Benton	0.99
Benton	0.34
Warrenton	0.17
Warrenton	0.28
Warrenton	0.20
Warrenton	0.20
Warrenton	0.37

Geffeney, S., E.D. Brodie, Jr., P.C. Ruben, and E.D. Brodie III. 2002. Mechanisms of adaptation in a predator-prey arms race: TTX-resistant sodium channels. Science 297: 1336-1339.

92 of 124

Hypotheses

H₀: The TTX resistance for snakes from Benton is the same as for snakes from Warrenton
H_A: The TTX resistance for snakes from Benton is different from snakes from Warrenton

93 of 124

Calculating The Ranks

Rank sum for Warrenton:

R = 1+4+2.5+2.5+7 = 17

Locality	Proportion of snakes resistant	Rank
Benton	0.29	5
Benton	0.77	10
Benton	0.96	11
Benton	0.64	8
Benton	0.70	9
Benton	0.99	12
Benton	0.34	6
Warrenton	0.17	1
Warrenton	0.28	4
Warrenton	0.20	2.5
Warrenton	0.20	2.5
Warrenton	0.37	7

94 of 124

Calculating U₁ And U₂

U₂ = n₁n₂ – U₁ = 5(7) – 33 = 2

For a two-tailed test, we choose the larger of U₁ or U₂: U = 33 and compare to critical U value (determined by sample sizes)

Large-sample approximation to normal exists (but not necessary with computers!)

95 of 124

Compare U To The U Table

Critical value for U for n₁ = 5 and n₂ = 7 is 30
33 > 30, so we can reject the null hypothesis
Snakes from Benton have a different distribution of resistance to TTX than the Warrenton snakes

96 of 124

How To Deal With Ties In Rankings

Determine the ranks that the values would have gotten if they were slightly different
Average these ranks, and assign that average to each tied individual
Count all those individuals when deciding the rank of the next largest individual

Group	Y	Rank
2	12	1
2	14	2
1	17	3
1	19	4.5
2	19	4.5
1	24	6
2	27	7
1	28	8

97 of 124

In R

Same format as single-sample tests
Formula notation used here

Reporting:

W = 526, p < .01; reject null hypothesis

98 of 124

Nonparametric Version Of ANOVA

Kruskal-Wallis test
Uses the ranks of the data points (rather than their magnitudes)

Best for ranked variables

Assumes:

Group samples are randomly sampled from populations
Distribution of variable must have same shape for all groups

Data still must be fairly homoscedastic (have same variance/shape) but doesn’t need to be normal

Does not use correction!

99 of 124

In R

Same format as single-sample tests
Formula notation used here

100 of 124

In R

Same format as single-sample tests
Formula notation used here

100

101 of 124

In R

Post-hoc tests

101

102 of 124

Bootstrapping Option

If we can’t assume anything, bootstrapping is still an option

Deals with heteroscedastic data

Means groups can have different variances

Useful if your initial plot of assumptions show a “funnel” shape where variance increases with fitted value

Another option is to log transform the data

May use “trimmed” data

Remove top and bottom x% to minimize impact of outliers
In WRS2 package, not shown here

102

103 of 124

In R

Using MKinfer package for ease
Many other options!

103

104 of 124

In R

Using MKinfer package for ease
Many other options!

104

105 of 124

In R

t1waybt for 3+ groups
Many other options!

105

106 of 124

In R

t1waybt for 3+ groups
Many other options!

106

107 of 124

Permutation is Same Idea, But You Sample Without Replacement and Combine the Data!

Remember permutations?
Logically, if groups don’t matter, you can just re-assign groups randomly among the data and calculate a signal

These should really be called (and sometimes are) combination or randomization tests

Do this lots of time to consider sampling distribution (noise)
Compare to your signal to get a p-value
Can be “exact” if you carry out every permutation!

107

108 of 124

Permutation Test in R

Use the coin package

108

109 of 124

Permutation Test in R

Use the coin package

109

110 of 124

Permutation Test in R

Use the coin package

110

111 of 124

Now That You’ve Seen These, Consider What They Require

Permutation tests also requires that you assume data come from a similar distribution

Same as Mann-Whitney U test!

Bootstrapping only assumes independent data!

111

112 of 124

Summary

ANOVA allows for comparisons of means among groups

Significant finding requires post-hoc testing for follow-up
Main assumptions are based on residuals

Special case of linear model
Generalized form of t-test
Bootstrapping and Kruskal-Wallis test are options for data that don’t meet these assumptions
We’ll return to paired designs next week

112

113 of 124

Notes On Actually Doing These

Plots at beginning of this lecture show several ways to visualize data

Note you can add letters to identify specific differences after post-hoc tests using included code as well

To get results

Start with a linear model (lm) unless data is highly skewed

Initial plots of the data are good but only identify potential outliers; remember all the assumptions are based on residuals
Can also use summarySE command to get initial look at variance (sd) among groups to check for similarity

After creating an object with the lm command, plot it to check assumptions

If it passes, look at overall F-value and do post-hoc tests if needed

Make sure you note correction method used for post-hoc tests to control FWER!

If not, use bootstrapping techniques if sample size is large or Kruskal-Wallis test if you are dealing with ranks (or prefer) and then proceed

113

114 of 124

What To Report

Method used and why

ANOVA (lm), Kruskal-Wallis, Bootstrapping

Overall p-value

Usually denoted

F_{df groups, df error}= observed p-value for lm
Chi-squared statistic, df, and p-value for Kruskal-Wallis
Test statistic and number of iterations for bootstrapping, along with p-value

Multiple-comparison methods and results if needed

Make sure you note correction method used for post-hoc tests to control FWER!
Generally provide table with adjusted p-values for each comparison

Generally accompanied by graph showing error bars around groups, sometimes with letters or other symbols to denote significant differences and numerical summaries to indicate amplitude of differences
Including R² value is also useful in explaining how “important” your variable is

114

115 of 124

R Aside: Working With Large Data

Data can be sent to you in multiple ways

We’ve covered inputting by hand or from csv from the web
You’ll see a .txt file used in assignment today

115

116 of 124

R Aside: Working With Large Data

Data can be sent to you in multiple formats

Long vs wide

Today you’ll see both, but its good to know how to redo

116

Long

Wide

117 of 124

reshape2 Package Is Useful

dcast the data into a wide format

dcast vs acast gives you data frame

melt into a long format
recast combines but can be tricky…

117

Long to wide: get the data

118 of 124

reshape2 Package Is Useful

dcast the data into a wide format

dcast vs acast gives you data frame

melt into a long format
recast combines but can be tricky…

118

Long to wide: dcast; formula is row ~ columns, value.var is what you fill dataframe with

Note this only works when you have a unique identifier

1 of 124

2 of 124

3 of 124

4 of 124

5 of 124

6 of 124

7 of 124

8 of 124

9 of 124

10 of 124

11 of 124

12 of 124

13 of 124

14 of 124

15 of 124

16 of 124

17 of 124

18 of 124

19 of 124

20 of 124

21 of 124

22 of 124

23 of 124

24 of 124

25 of 124

26 of 124

27 of 124

28 of 124

29 of 124

30 of 124

31 of 124

32 of 124

33 of 124

34 of 124

35 of 124

36 of 124

37 of 124

38 of 124

39 of 124

40 of 124

41 of 124

42 of 124

43 of 124

44 of 124

45 of 124

46 of 124

47 of 124

48 of 124

49 of 124

50 of 124

51 of 124

52 of 124

53 of 124

54 of 124

55 of 124

56 of 124

57 of 124

58 of 124

59 of 124

60 of 124

61 of 124

62 of 124

63 of 124

64 of 124

65 of 124

66 of 124

67 of 124

68 of 124

69 of 124

70 of 124

71 of 124

72 of 124

73 of 124

74 of 124

75 of 124

76 of 124

77 of 124

78 of 124

79 of 124

80 of 124