1 of 22

Lecture 17

Comparing Distributions

DATA 8

Spring 2023

2 of 22

Announcements

Homework 6 due Wednesday

Turn in on Tuesday for a bonus point

Lab 6 due Friday
Swupnil’s OH today @ FSM, 4-6pm

3 of 22

Weekly Goals

Today

Comparing distributions
Hypothesis tests

Wednesday

Making decisions when visualizations don’t suffice
Comparing numerical data

Friday

A/B testing
Permutation Test

4 of 22

Two Viewpoints

5 of 22

Model and Alternative

Jury selection:

Model: The people on the jury panels were selected at random from the eligible population
Alternative viewpoint: No, they were biased against black men

Genetics:

Model: Each plant has a 75% chance of having purple flowers
Alternative viewpoint: No, it doesn’t

6 of 22

Steps in Assessing a Model

Choose a statistic to measure “discrepancy” between model and data
Simulate the statistic under the model’s assumptions
Compare the data to the model’s predictions:

Draw a histogram of simulated values of the statistic
Compute the observed statistic from the real sample

If the observed statistic is far from the histogram, that is evidence against the model

7 of 22

Discussion Questions

In each of (a) and (b), choose a statistic that will help you decide between the two viewpoints.

Data: the results of 400 tosses of a coin

(a)

“This coin is fair.”
“No, it’s biased towards heads.”

(b)

“This coin is fair.”
“No, it’s not.”

8 of 22

“Fair”

For both (a) and (b),

The percent of heads in the 400 tosses is a good starting point, but might need adjustment

A percent of heads around 50% suggests “fair”

9 of 22

Answers

(a) Large values of the percent of heads suggest “biased towards heads”

Statistic: percent of heads

(b) Very large or very small values of the percent of heads suggest “not fair.”

The distance between percent of heads and 50% is the key
Statistic: | percent of heads − 50% |
Large values of the statistic suggest “not fair”

10 of 22

Comparing Distributions

11 of 22

Jury Selection in Alameda County

12 of 22

Jury Panels

Section 197 of California's Code of Civil Procedure says, "All persons selected for jury service shall be selected at random, from a source or sources inclusive of a representative cross section of the population of the area served by the court."

Eligible jurors in a County

Jury

List of eligible residents

Jury panel

(Demo)

13 of 22

A New Statistic

14 of 22

Distance Between Distributions

People on the panels are of multiple ethnicities
Distribution of ethnicities is categorical

To see whether the distribution of ethnicities of the panels is “close” to that of the eligible jurors, we have to measure the “distance” between two categorical distributions

(Demo)

15 of 22

Total Variation Distance

Every distance has a computational recipe

Total Variation Distance (TVD):

For each category, compute the difference in proportions between two distributions
Take the absolute value of each difference
Sum, and then divide the sum by 2

(Demo)

16 of 22

Summary of the Method

To assess whether a sample was drawn randomly from a known categorical distribution:

Use TVD as the statistic because it measures the distance between categorical distributions
Sample at random from the population and compute the TVD from the random sample; repeat numerous times
Compare:

Empirical distribution of simulated TVDs
Actual TVD from the sample in the study

17 of 22

Testing Hypotheses

18 of 22

Testing Hypotheses

A test chooses between two views of how data were generated

The views are called hypotheses

The test picks the hypothesis that is better supported by the observed data

19 of 22

Null and Alternative

The method only works if we can simulate data under one of the hypotheses.

Null hypothesis

A well defined chance model about how the data were generated
We can simulate data under the assumptions of this model – “under the null hypothesis”

Alternative hypothesis

A different view about the origin of the data

20 of 22

Test Statistic

The statistic that we choose to simulate, to decide between the two hypotheses

Questions before choosing the statistic:

What values of the statistic will make us lean towards the null hypothesis?

What values will make us lean towards the alternative?

Preferably, the answer should be just “high”. Try to avoid “both high and low”.

21 of 22

Prediction Under the Null Hypothesis

Simulate the test statistic under the null hypothesis; draw the histogram of the simulated values
This displays the empirical distribution of the statistic under the null hypothesis
It is a prediction about the statistic, made by the null hypothesis

It shows all the likely values of the statistic
Also how likely they are (if the null hypothesis is true)

The probabilities are approximate, because we can’t generate all the possible random samples

22 of 22

Conclusion of the Test

Resolve choice between null and alternative hypotheses

Compare the observed test statistic and its empirical distribution under the null hypothesis
If the observed value is not consistent with the distribution, then the test favors the alternative (“data is consistent with the alternative”)

Whether a value is consistent with a distribution:

A visualization may be sufficient
If not, there are conventions about “consistency”