Chi-Squared Goodness of Fit and Tests for Independence
Tests about Distributions for Categorical Variables with Multiple (Three or More) Classes, and Tests for Associations Between Pairs of Categorical Variables
�A Reminder on Inference and Inferential Tools
(𝚙𝚘𝚒𝚗𝚝 𝚎𝚜𝚝𝚒𝚖𝚊𝚝𝚎) ± (𝚌𝚛𝚒𝚝𝚒𝚌𝚊𝚕 𝚟𝚊𝚕𝚞𝚎)⋅(𝚜𝚝𝚊𝚗𝚍𝚊𝚛𝚍 𝚎𝚛𝚛𝚘𝚛)
�Inference Reminders, Continued
�Where We Are and Where We Are Going…
Inference On… | Covered? |
One Numerical Variable | ✔️ |
One Binary Categorical Variable | ✔️ |
Associations Between a Numerical Variable and a Binary Categorical Variable | ✔️ |
Associations Between Two Binary Categorical Variables | ✔️ |
One MultiClass Categorical Variable | Today |
Associations Between Two MultiClass Categorical Variables | Today |
Associations Between One Numerical Variable and One MultiClass Categorical Variable | |
Associations Between Two Numerical Variables | |
Reminder: Inference on a Single �Categorical Variable
We’ve been focused on binary (two-class) categorical variables
The single-variable questions we’ve asked are of the form:
But what if we were interested in categorical variables that have more than just two levels?
Are ideological alignments of voting-aged citizens in the US uniformly distributed across the categories very liberal, liberal, moderate, conservative, and very conservative?
Reminder: Inference on Associations Between Categorical Variables
Our multivariable questions have been of the form:
What about associations between categorical variables where at least one has three or more levels?
Is there an association between and individual’s ideology and their perception of the state of their finances (better off, worse off, or about the same) relative to four years ago?
�Outline of What’s to Come
�A Closer Look at a Test Statistic
�A New Test Statistic
With binary categorical variables, measuring the proportion associated with just a single category was sufficient – if we know the proportion associated with one outcome, we also know the proportion associated with the other
For example, if we surveyed 100 voters in the US and asked them if they identify more closely with a liberal ideology or a conservative ideology and the results were
Then you know what the full table looks like…
Liberal | Conservative |
54 | ? |
Liberal | Conservative |
54 | 46 |
�A New Test Statistic
With binary categorical variables, measuring the proportion associated with just a single category was sufficient – if we know the proportion associated with one outcome, we also know the proportion associated with the other
With multiclass categorical variables, this is not the case
Consider a random sample of 500 individuals
Very Liberal | Liberal | Moderate | Conservative | Very Conservative |
? | ? | ? | ? | ? |
�A New Test Statistic
With binary categorical variables, measuring the proportion associated with just a single category was sufficient – if we know the proportion associated with one outcome, we also know the proportion associated with the other
With multiclass categorical variables, this is not the case
Consider a random sample of 500 individuals
Very Liberal | Liberal | Moderate | Conservative | Very Conservative |
35 | ? | ? | ? | ? |
�A New Test Statistic
With binary categorical variables, measuring the proportion associated with just a single category was sufficient – if we know the proportion associated with one outcome, we also know the proportion associated with the other
With multiclass categorical variables, this is not the case
Consider a random sample of 500 individuals
Very Liberal | Liberal | Moderate | Conservative | Very Conservative |
35 | 105 | ? | ? | ? |
�A New Test Statistic
With binary categorical variables, measuring the proportion associated with just a single category was sufficient – if we know the proportion associated with one outcome, we also know the proportion associated with the other
With multiclass categorical variables, this is not the case
Consider a random sample of 500 individuals
Very Liberal | Liberal | Moderate | Conservative | Very Conservative |
35 | 105 | 175 | ? | ? |
�A New Test Statistic
With binary categorical variables, measuring the proportion associated with just a single category was sufficient – if we know the proportion associated with one outcome, we also know the proportion associated with the other
With multiclass categorical variables, this is not the case
Consider a random sample of 500 individuals
Only at this point do you know what the remaining part of the table looks like
Very Liberal | Liberal | Moderate | Conservative | Very Conservative |
35 | 105 | 175 | 140 | ? |
Note: We had freedom to fill in all but the final group count before being forced with the final count.
�A New Test Statistic
Question (Reminder): Are political ideologies uniformly distributed within the United States?
If we collected data from a sample of 500 citizens and political ideology was uniformly distributed, then we would expect:
But what if we observed:
Very Liberal | Liberal | Moderate | Conservative | Very Conservative |
100 | 100 | 100 | 100 | 100 |
Very Liberal | Liberal | Moderate | Conservative | Very Conservative |
35 | 105 | 175 | 140 | 45 |
�A New Test Statistic
Expected results from a random sample of 500 people in the US
Observed results from our random sample of 500 people in the US
Comparing a single observed value to a single expected value is no longer enough to describe our scenario�We could calculate the difference observed - expected for each group though
Very Liberal | Liberal | Moderate | Conservative | Very Conservative |
100 | 100 | 100 | 100 | 100 |
Very Liberal | Liberal | Moderate | Conservative | Very Conservative |
35 | 105 | 175 | 140 | 45 |
Very Liberal | Liberal | Moderate | Conservative | Very Conservative |
35 – 100 = -65 | 105 – 100 = 5 | 175 – 100 = 75 | 140 - 100 = 40 | 45 – 100 = -55 |
�A New Test Statistic
Very Liberal | Liberal | Moderate | Conservative | Very Conservative |
-65 | 5 | 75 | 40 | -55 |
�The Chi-Squared Distribution
This new test statistic doesn’t follow a normal distribution. It instead follows a Chi-Squared distribution.
Actually, the Chi-Squared distribution isn’t a single distribution – it is a family of distributions defined by a single parameter…degrees of freedom.
We’ll talk more about degrees of freedom when we get to the actual tests but, for now, here are a few Chi-Squared distributions.
The Chi-Squared distributions are defined over non-negative numbers only, are right-skewed, and [for our course] we’ll only ever be interested in the area in the right tail.
A Completed Example: Chi-Squared �Goodness of Fit
Very Liberal | Liberal | Moderate | Conservative | Very Conservative |
35 | 105 | 175 | 140 | 45 |
A Completed Example: Chi-Squared �Goodness of Fit
Scenario: We wonder if political ideologies (very liberal, liberal, moderate, conservative, or very conservative) are uniformly distributed. We collect data from a random sample of 500 individuals and observe the following results
Recall the table of differences from earlier
| Very Liberal | Liberal | Moderate | Conservative | Very Conservative |
Observed | 35 | 105 | 175 | 140 | 45 |
Expected | 100 | 100 | 100 | 100 | 100 |
Difference | -65 | 5 | 75 | 40 | -55 |
df = groups - 1
A Completed Example: Chi-Squared �Goodness of Fit (Alternatively)
Scenario: We wonder if political ideologies (very liberal, liberal, moderate, conservative, or very conservative) are uniformly distributed. We collect data from a random sample of 500 individuals and observe the following results
Do it this way!
observed
expected
�Recap: Chi-Squared Goodness of Fit
We now have an ability to test whether the levels of a categorical variable over a population follow a particular distribution
Here, we tested for a uniform distribution, but we can test any generic distribution we like – the choice just influences how we calculate our expected counts
�Recap: Chi-Squared Goodness of Fit
�Chi-Squared Tests for Independence
�Reminder on Two-Way Tables
A two-way table is a way to summarize data observed with respect to two categorical variables – its rows correspond to the levels of one of the categorical variables and its columns correspond to levels of the other categorical variable.
Consider the two-way table below which shows species of bird and the type of feeder they were observed feeding from.
| Seed Feeder | Nectar Feeder | Fruit Feeder | Total |
Finch | 60 | 15 | 10 | 85 |
Hummingbird | 5 | 70 | 5 | 80 |
Oriole | 20 | 5 | 60 | 85 |
Woodpecker | 30 | 0 | 25 | 55 |
Total | 115 | 90 | 100 | 305 |
�Reminder on Two-Way Tables
Consider the two-way table below which shows species of bird and the type of feeder they were observed feeding from.
| Seed Feeder | Nectar Feeder | Fruit Feeder | Total |
Finch | 60 | 15 | 10 | 85 |
Hummingbird | 5 | 70 | 5 | 80 |
Oriole | 20 | 5 | 60 | 85 |
Woodpecker | 30 | 0 | 25 | 55 |
Total | 115 | 90 | 100 | 305 |
�Reminder on Two-Way Tables
| Seed Feeder | Nectar Feeder | Fruit Feeder | Total |
Finch | 60 | 15 | 10 | 85 |
Hummingbird | 5 | 70 | 5 | 80 |
Oriole | 20 | 5 | 60 | 85 |
Woodpecker | 30 | 0 | 25 | 55 |
Total | 115 | 90 | 100 | 305 |
�Reminder on Two-Way Tables
| Seed Feeder | Nectar Feeder | Fruit Feeder | Total |
Finch | 60 | 15 | 10 | 85 |
Hummingbird | 5 | 70 | 5 | 80 |
Oriole | 20 | 5 | 60 | 85 |
Woodpecker | 30 | 0 | 25 | 55 |
Total | 115 | 90 | 100 | 305 |
�Reminder on Two-Way Tables
| Seed Feeder | Nectar Feeder | Fruit Feeder | Total |
Finch | 60 | 15 | 10 | 85 |
Hummingbird | 5 | 70 | 5 | 80 |
Oriole | 20 | 5 | 60 | 85 |
Woodpecker | 30 | 0 | 25 | 55 |
Total | 115 | 90 | 100 | 305 |
�Reminder on Two-Way Tables
| Seed Feeder | Nectar Feeder | Fruit Feeder | Total |
Finch | 60 | 15 | 10 | 85 |
Hummingbird | 5 | 70 | 5 | 80 |
Oriole | 20 | 5 | 60 | 85 |
Woodpecker | 30 | 0 | 25 | 55 |
Total | 115 | 90 | 100 | 305 |
�Reminder on Two-Way Tables
| Seed Feeder | Nectar Feeder | Fruit Feeder | Total |
Finch | 60 | 15 | 10 | 85 |
Hummingbird | 5 | 70 | 5 | 80 |
Oriole | 20 | 5 | 60 | 85 |
Woodpecker | 30 | 0 | 25 | 55 |
Total | 115 | 90 | 100 | 305 |
�Calculating Expected Counts
We just saw that we can calculate expected counts by multiplying the corresponding row and column totals together and dividing by the overall total
That’s a job for Excel
�Calculating Expected Counts
We just saw that we can calculate expected counts by multiplying the corresponding row and column totals together and dividing by the overall total
That’s a job for Excel
�Running the Test for Independence
�Running the Test for Independence
�Determining a Conclusion
�Examples
The following slides contain additional example scenarios for you to try out
Write out the relevant hypotheses
Use Excel to conduct the test
Make and justify a conclusion in the context of the scenario
�Example: Streaming Service Popularity
Scenario: A media company expects certain subscription preferences among viewers for various streaming services: Netflix, Hulu, Amazon Prime, and Disney+, with expected preferences of 40%, 25%, 20%, and 15%, respectively. A sample of 500 viewers reveals the following counts:
Conduct a test at the 5% level of significance to determine whether the sample provides evidence against this expected distribution.
Streaming Service | Observed Count |
Netflix | 210 |
Hulu | 130 |
Amazon Prime | 100 |
Disney+ | 60 |
�Example: Department and Work Arrangements
Scenario: A company wants to investigate if employees’ work arrangement preferences (In-office, Hybrid, or Remote) vary by department (IT, Marketing, HR). A survey of 300 employees yields the following results:
Conduct a test at the 1% level of significance to determine whether this data provides evidence to suggest that work arrangements are different across departments.
| In-office | Hybrid | Remote | Total |
IT | 30 | 60 | 40 | 130 |
Marketing | 25 | 35 | 30 | 90 |
HR | 20 | 25 | 35 | 80 |
Total | 75 | 120 | 105 | 300 |
Example: Student Major and Preferred �Social Media Platform
| TikTok | Total | |||
STEM | 20 | 15 | 30 | 35 | 100 |
Arts | 15 | 25 | 20 | 5 | 65 |
Business | 10 | 5 | 15 | 25 | 55 |
Humanities | 5 | 5 | 10 | 5 | 25 |
Total | 50 | 50 | 75 | 70 | 200 |
�Example: Voting Method Preferences
Scenario: An electoral commission wants to verify if voter preferences for different voting methods in a city (In-person, Mail-in, Drop-off, Online) match their pre-election forecast based on trends in similar areas. They projected that 50% would prefer in-person voting, 20% mail-in, 15% drop-off, and 15% online. A random sample of 1000 likely voters revealed the following preferences.
Does the sample provide evidence against their projections?
Voting Method | Observed Count | Expected Percentage |
In-person | 520 | 50% |
Mail-in | 200 | 20% |
Drop-off | 150 | 15% |
Online | 130 | 15% |
Total | 1,000 | |
Inference: Where We’ve Been and Where �We Are Headed
Inference On… | Covered? |
One Numerical Variable | ✔️ |
One Binary Categorical Variable | ✔️ |
Associations Between a Numerical Variable and a Binary Categorical Variable | ✔️ |
Associations Between Two Binary Categorical Variables | ✔️ |
One MultiClass Categorical Variable | ✔️ |
Associations Between Two MultiClass Categorical Variables | ✔️ |
Associations Between One Numerical Variable and One MultiClass Categorical Variable | |
Associations Between Two Numerical Variables | |
�Next Time…