1 of 43

Chi-Squared Goodness of Fit and Tests for Independence

Tests about Distributions for Categorical Variables with Multiple (Three or More) Classes, and Tests for Associations Between Pairs of Categorical Variables

2 of 43

�A Reminder on Inference and Inferential Tools

  • We use statistical inference to make or test claims about population parameters which we cannot measure directly
    • We make claims by constructing confidence intervals
    • We test claims by conducting hypothesis tests
  • Confidence intervals provide a range of plausible values for a population parameter
    • They are centered at the point estimate (sample statistic)
    • They open up some “wiggle room” called a margin of error, which is influenced by the critical value and the standard error

(𝚙𝚘𝚒𝚗𝚝 𝚎𝚜𝚝𝚒𝚖𝚊𝚝𝚎) ± (𝚌𝚛𝚒𝚝𝚒𝚌𝚊𝚕 𝚟𝚊𝚕𝚞𝚎)⋅(𝚜𝚝𝚊𝚗𝚍𝚊𝚛𝚍 𝚎𝚛𝚛𝚘𝚛)

3 of 43

�Inference Reminders, Continued

  •  

4 of 43

�Where We Are and Where We Are Going…

Inference On…

Covered?

One Numerical Variable

✔️

One Binary Categorical Variable

✔️

Associations Between a Numerical Variable and a Binary Categorical Variable

✔️

Associations Between Two Binary Categorical Variables

✔️

One MultiClass Categorical Variable

Today

Associations Between Two MultiClass Categorical Variables

Today

Associations Between One Numerical Variable and One MultiClass Categorical Variable

Associations Between Two Numerical Variables

5 of 43

Reminder: Inference on a Single �Categorical Variable

We’ve been focused on binary (two-class) categorical variables

The single-variable questions we’ve asked are of the form:

    • Can we estimate the population proportion?
      • For example, with 95% confidence, what is the proportion of likely voters in New Hampshire who are planning to vote in favor of Amendment 1?
    • Is the population proportion greater/less/different than some proposed value?
      • For example, is the proportion of likely voters in New Hampshire who favor Amendment 1 at least 66.67%?

But what if we were interested in categorical variables that have more than just two levels?

Are ideological alignments of voting-aged citizens in the US uniformly distributed across the categories very liberalliberalmoderateconservative, and very conservative?

6 of 43

Reminder: Inference on Associations Between Categorical Variables

Our multivariable questions have been of the form:

    • Can we estimate the difference in population proportions between Group A and Group B?
      • For example, Find a 90% confidence interval for the difference in the proportion of students who feel a sense of belonging at their university between first-year students and seniors.
    • Is the population proportion in Group A greater/less/different than the population proportion in Group B?
      • For example, Is the proportion of students who feel a sense of belonging at their university greater for seniors than for first-year students?

What about associations between categorical variables where at least one has three or more levels?

Is there an association between and individual’s ideology and their perception of the state of their finances (better off, worse off, or about the same) relative to four years ago?

7 of 43

�Outline of What’s to Come

  • Analysing the form of a test statistic
  • The need for a different test statistic
  • The need for a new probability distribution
  • Chi-Squared Tests for Goodness of Fit (inference on a single, multiclass categorical variable)
    • A Completed Example
  • Chi-Squared Tests for Independence (inference on associations between two potentially multiclass categorical variables)
    • A Probability Review
    • A Completed Example
  • Additional Examples

8 of 43

�A Closer Look at a Test Statistic

  •  

9 of 43

�A New Test Statistic

With binary categorical variables, measuring the proportion associated with just a single category was sufficient – if we know the proportion associated with one outcome, we also know the proportion associated with the other

For example, if we surveyed 100 voters in the US and asked them if they identify more closely with a liberal ideology or a conservative ideology and the results were

Then you know what the full table looks like…

Liberal

Conservative

54

?

Liberal

Conservative

54

46

10 of 43

�A New Test Statistic

With binary categorical variables, measuring the proportion associated with just a single category was sufficient – if we know the proportion associated with one outcome, we also know the proportion associated with the other

With multiclass categorical variables, this is not the case

Consider a random sample of 500 individuals

Very Liberal

Liberal

Moderate

Conservative

Very Conservative

?

?

?

?

?

11 of 43

�A New Test Statistic

With binary categorical variables, measuring the proportion associated with just a single category was sufficient – if we know the proportion associated with one outcome, we also know the proportion associated with the other

With multiclass categorical variables, this is not the case

Consider a random sample of 500 individuals

Very Liberal

Liberal

Moderate

Conservative

Very Conservative

35

?

?

?

?

12 of 43

�A New Test Statistic

With binary categorical variables, measuring the proportion associated with just a single category was sufficient – if we know the proportion associated with one outcome, we also know the proportion associated with the other

With multiclass categorical variables, this is not the case

Consider a random sample of 500 individuals

Very Liberal

Liberal

Moderate

Conservative

Very Conservative

35

105

?

?

?

13 of 43

�A New Test Statistic

With binary categorical variables, measuring the proportion associated with just a single category was sufficient – if we know the proportion associated with one outcome, we also know the proportion associated with the other

With multiclass categorical variables, this is not the case

Consider a random sample of 500 individuals

Very Liberal

Liberal

Moderate

Conservative

Very Conservative

35

105

175

?

?

14 of 43

�A New Test Statistic

With binary categorical variables, measuring the proportion associated with just a single category was sufficient – if we know the proportion associated with one outcome, we also know the proportion associated with the other

With multiclass categorical variables, this is not the case

Consider a random sample of 500 individuals

Only at this point do you know what the remaining part of the table looks like

Very Liberal

Liberal

Moderate

Conservative

Very Conservative

35

105

175

140

?

Note: We had freedom to fill in all but the final group count before being forced with the final count.

15 of 43

�A New Test Statistic

Question (Reminder): Are political ideologies uniformly distributed within the United States?

If we collected data from a sample of 500 citizens and political ideology was uniformly distributed, then we would expect:

But what if we observed:

Very Liberal

Liberal

Moderate

Conservative

Very Conservative

100

100

100

100

100

Very Liberal

Liberal

Moderate

Conservative

Very Conservative

35

105

175

140

45

16 of 43

�A New Test Statistic

Expected results from a random sample of 500 people in the US

Observed results from our random sample of 500 people in the US

Comparing a single observed value to a single expected value is no longer enough to describe our scenario�We could calculate the difference observed - expected for each group though

Very Liberal

Liberal

Moderate

Conservative

Very Conservative

100

100

100

100

100

Very Liberal

Liberal

Moderate

Conservative

Very Conservative

35

105

175

140

45

Very Liberal

Liberal

Moderate

Conservative

Very Conservative

35 – 100 = -65

105 – 100 = 5

175 – 100 = 75

140 - 100 = 40

45 – 100 = -55

17 of 43

�A New Test Statistic

  •  

Very Liberal

Liberal

Moderate

Conservative

Very Conservative

-65

5

75

40

-55

 

18 of 43

�The Chi-Squared Distribution

This new test statistic doesn’t follow a normal distribution. It instead follows a Chi-Squared distribution.

Actually, the Chi-Squared distribution isn’t a single distribution – it is a family of distributions defined by a single parameter…degrees of freedom.

We’ll talk more about degrees of freedom when we get to the actual tests but, for now, here are a few Chi-Squared distributions.

The Chi-Squared distributions are defined over non-negative numbers only, are right-skewed, and [for our course] we’ll only ever be interested in the area in the right tail.

 

19 of 43

A Completed Example: Chi-Squared �Goodness of Fit

  •  

Very Liberal

Liberal

Moderate

Conservative

Very Conservative

35

105

175

140

45

 

20 of 43

A Completed Example: Chi-Squared �Goodness of Fit

Scenario: We wonder if political ideologies (very liberalliberalmoderateconservative, or very conservative) are uniformly distributed. We collect data from a random sample of 500 individuals and observe the following results

Recall the table of differences from earlier

Very Liberal

Liberal

Moderate

Conservative

Very Conservative

Observed

35

105

175

140

45

Expected

100

100

100

100

100

Difference

-65

5

75

40

-55

 

 

 

df = groups - 1

21 of 43

A Completed Example: Chi-Squared �Goodness of Fit (Alternatively)

Scenario: We wonder if political ideologies (very liberalliberalmoderateconservative, or very conservative) are uniformly distributed. We collect data from a random sample of 500 individuals and observe the following results

 

 

 

Do it this way!

observed

expected

22 of 43

�Recap: Chi-Squared Goodness of Fit

We now have an ability to test whether the levels of a categorical variable over a population follow a particular distribution

Here, we tested for a uniform distribution, but we can test any generic distribution we like – the choice just influences how we calculate our expected counts

23 of 43

�Recap: Chi-Squared Goodness of Fit

  •  

24 of 43

�Chi-Squared Tests for Independence

  •  

25 of 43

�Reminder on Two-Way Tables

two-way table is a way to summarize data observed with respect to two categorical variables – its rows correspond to the levels of one of the categorical variables and its columns correspond to levels of the other categorical variable.

Consider the two-way table below which shows species of bird and the type of feeder they were observed feeding from.

Seed Feeder

Nectar Feeder

Fruit Feeder

Total

Finch

60

15

10

85

Hummingbird

5

70

5

80

Oriole

20

5

60

85

Woodpecker

30

0

25

55

Total

115

90

100

305

26 of 43

�Reminder on Two-Way Tables

Consider the two-way table below which shows species of bird and the type of feeder they were observed feeding from.

Seed Feeder

Nectar Feeder

Fruit Feeder

Total

Finch

60

15

10

85

Hummingbird

5

70

5

80

Oriole

20

5

60

85

Woodpecker

30

0

25

55

Total

115

90

100

305

 

 

27 of 43

�Reminder on Two-Way Tables

  •  

Seed Feeder

Nectar Feeder

Fruit Feeder

Total

Finch

60

15

10

85

Hummingbird

5

70

5

80

Oriole

20

5

60

85

Woodpecker

30

0

25

55

Total

115

90

100

305

28 of 43

�Reminder on Two-Way Tables

  •  

Seed Feeder

Nectar Feeder

Fruit Feeder

Total

Finch

60

15

10

85

Hummingbird

5

70

5

80

Oriole

20

5

60

85

Woodpecker

30

0

25

55

Total

115

90

100

305

29 of 43

�Reminder on Two-Way Tables

  •  

Seed Feeder

Nectar Feeder

Fruit Feeder

Total

Finch

60

15

10

85

Hummingbird

5

70

5

80

Oriole

20

5

60

85

Woodpecker

30

0

25

55

Total

115

90

100

305

30 of 43

�Reminder on Two-Way Tables

  •  

Seed Feeder

Nectar Feeder

Fruit Feeder

Total

Finch

60

15

10

85

Hummingbird

5

70

5

80

Oriole

20

5

60

85

Woodpecker

30

0

25

55

Total

115

90

100

305

31 of 43

�Reminder on Two-Way Tables

  •  

Seed Feeder

Nectar Feeder

Fruit Feeder

Total

Finch

60

15

10

85

Hummingbird

5

70

5

80

Oriole

20

5

60

85

Woodpecker

30

0

25

55

Total

115

90

100

305

 

32 of 43

�Calculating Expected Counts

We just saw that we can calculate expected counts by multiplying the corresponding row and column totals together and dividing by the overall total

That’s a job for Excel

33 of 43

�Calculating Expected Counts

We just saw that we can calculate expected counts by multiplying the corresponding row and column totals together and dividing by the overall total

That’s a job for Excel

34 of 43

�Running the Test for Independence

  •  

35 of 43

�Running the Test for Independence

 

 

36 of 43

�Determining a Conclusion

 

 

37 of 43

�Examples

The following slides contain additional example scenarios for you to try out

Write out the relevant hypotheses

Use Excel to conduct the test

Make and justify a conclusion in the context of the scenario

38 of 43

�Example: Streaming Service Popularity

Scenario: A media company expects certain subscription preferences among viewers for various streaming services: Netflix, Hulu, Amazon Prime, and Disney+, with expected preferences of 40%, 25%, 20%, and 15%, respectively. A sample of 500 viewers reveals the following counts:

Conduct a test at the 5% level of significance to determine whether the sample provides evidence against this expected distribution.

Streaming Service

Observed Count

Netflix

210

Hulu

130

Amazon Prime

100

Disney+

60

39 of 43

�Example: Department and Work Arrangements

Scenario: A company wants to investigate if employees’ work arrangement preferences (In-office, Hybrid, or Remote) vary by department (IT, Marketing, HR). A survey of 300 employees yields the following results:

Conduct a test at the 1% level of significance to determine whether this data provides evidence to suggest that work arrangements are different across departments.

In-office

Hybrid

Remote

Total

IT

30

60

40

130

Marketing

25

35

30

90

HR

20

25

35

80

Total

75

120

105

300

40 of 43

Example: Student Major and Preferred �Social Media Platform

  •  

Instagram

TikTok

Twitter

LinkedIn

Total

STEM

20

15

30

35

100

Arts

15

25

20

5

65

Business

10

5

15

25

55

Humanities

5

5

10

5

25

Total

50

50

75

70

200

41 of 43

�Example: Voting Method Preferences

Scenario: An electoral commission wants to verify if voter preferences for different voting methods in a city (In-person, Mail-in, Drop-off, Online) match their pre-election forecast based on trends in similar areas. They projected that 50% would prefer in-person voting, 20% mail-in, 15% drop-off, and 15% online. A random sample of 1000 likely voters revealed the following preferences.

Does the sample provide evidence against their projections?

Voting Method

Observed Count

Expected Percentage

In-person

520

50%

Mail-in

200

20%

Drop-off

150

15%

Online

130

15%

Total

1,000

42 of 43

Inference: Where We’ve Been and Where �We Are Headed

Inference On…

Covered?

One Numerical Variable

✔️

One Binary Categorical Variable

✔️

Associations Between a Numerical Variable and a Binary Categorical Variable

✔️

Associations Between Two Binary Categorical Variables

✔️

One MultiClass Categorical Variable

✔️

Associations Between Two MultiClass Categorical Variables

✔️

Associations Between One Numerical Variable and One MultiClass Categorical Variable

Associations Between Two Numerical Variables

43 of 43

�Next Time…

  • What we’ll be doing…
    • Inference for Associations Between a Numerical Variable and a MultiClass Categorical Variable (ANOVA)
  • How to prepare…
    • Read sections 11.1, 11.2, and 11.4 in our textbook
  • Homework: Complete HW 9 (Chi-Squared Tests for Goodness of Fit and Independence) on MyOpenMath