1 of 32

Statistics, Part II

CONFIDENCE INTERVALS, AND HYPOTHESIS TESTING

LECTURE 10

PSTAT 100: Spring 2024

Instructor: Ethan P. Marzban

2 of 32

Continue our discussion on confidence intervals
Introduce the framework of hypothesis testing

AGENDA

3 of 32

Quick Recap

Population

(µ, σ, m, etc.)

Sample

(X₁, X₂, …, X_n)

Observed Instances

(x₁, x₂, …, x_n)

DRAW

INFER

We have a population, governed by a set of population parameters that are unobserved (but that we’d like to make claims about)

To make claims about the population parameters, we take a sample.

We then use our sample to make inferences (i.e. claims) about the population parameters.

Inference could either entail estimation (in which we seek to estimate the value of a population parameter) or hypothesis testing.

4 of 32

Estimation Terminology

Sample

(X₁, X₂, …, X_n)

Observed Sample Values

(x₁, x₂, …, x_n)

θ̂_n := g(X₁, …, X_n)

θ̂_n(x₁, …, x_n)

Estimator

Estimate

The distribution of an estimator is called its sampling distribution.
An unbiased estimator of a parameter has its sampling distribution centered about the true population parameter value.

For example, the sample mean is an unbiased estimator of the population mean, as was shown on the chalkboard last lecture.

5 of 32

Confidence Intervals

Now, even unbiased estimators are still random. Hence, it’s a bit risky to try and trust a �single point estimate in order to estimate the true value of the population parameter.
It would be ideal to take many samples, and then average out (like we did in our simulation). But, this is not always possible - sometimes one sample is all we get.
As such, an alternative we can consider is to report an interval (i.e. set of values) that we are somewhat confident covers the true value of the parameter.

Such an interval is called a confidence interval.

Why use an interval? Well, here’s a nice quote from a pretty good introductory statistics textbook (Openintro Statistics):

Using only a point estimate is like fishing in a murky lake with a spear. We can throw a spear where we saw a fish, but we will probably miss. On the other hand, if we toss a net in that area, we have a good chance of catching the fish.

(page 181)

6 of 32

Confidence Intervals

As a somewhat concrete example, let’s return to a scenario we discussed last time.
Suppose we take a representative sample of 100 cats and observe 3 cats in this sample that have FIV (Feline Immunodeficiency Virus).

The value of 3% is an estimate; specifically, it is an observed value of the sample proportion (which we are using as an estimator of the population proportion)

Hence, we cannot simply say “the true proportion of cats that have FIV is 3%.
Instead, we’d like to say something like “I am C% confident that the interval [L, U] covers the true value of the population proportion”, for some specified confidence level C.
Question: can I give an interval that I am 100% sure covers the true proportion of cats that have FIV?

Yup: [0%, 100%]. I’m fully confident that the true proportion of FIV-positive cats in the world is somewhere between 0% and 100%.
Okay… but what about, say, [2% , 4%]? Am I still 100% sure this interval covers the true value of p (the population proportion)? No, not really.

By the way, I didn’t pull that [2% , 4%] interval completely out of the blue. I found it by taking our observed sample proportion (3%) and adding and subtracting some “padding”.

7 of 32

Confidence Intervals

This leads us to the idea of constructing confidence intervals as:

(estimate) ± (margin of error)

Now, as illustrated by our silly confidence interval of [0% , 100%], the margin of error should depend on our confidence level.

The more confident we want to be that our interval covers the true value of the population parameter, the wider we need to make our interval.

But, we should also be aware that our estimate is simply an observed instance of an estimator, which is random.

Hence, we should also incorporate the uncertainty due to the randomness of our estimator into the margin of error.

An easy way to accomplish both of these things is to make the margin of error proportional to the product of a constant (called the confidence coefficient) and the standard deviation of the estimator we are using:

(estimate) ± c * sd(estimator)

8 of 32

Confidence Interval for a Proportion

For example, a confidence interval for a population proportion will take the form

�

One glaring problem with this CI (confidence interval) is that it involves the population parameter p. We’ll fix that in a moment.
For now, let me outline how we find the value c based on a given confidence level.
Say we want to construct a 95% CI; i.e. we want to be 95% certain our interval covers the true value of p.
Probabilistically, this means we want to pick c such that the following equation is satisfied:

9 of 32

Confidence Interval for a Proportion

With a bit of work, we find that this is equivalent to asserting��

By the De Moivre-Laplace Theorem, if n is large enough the middle quantity above will be approximately standard normally distributed. As such, we find that c must satisfy��and so our 95% CI for the population proportion takes the form

10 of 32

Confidence Interval for a Proportion

Since p is unknown, we often simply replace p with p̂, the observed instance of the �sample proportion:��

Also, for a general confidence level γ, the confidence coefficient c will be given by

Example: Suppose we take a representative sample of 100 cats, and observe 3 of them as being FIV positive. Let’s construct a 90% CI together on the board.
Interpretation: we are 90% confident that the interval [0.19% , 5.8%] covers the true proportion of FIV-positive cats.

11 of 32

Confidence Interval for a Mean

As we discussed before, an unbiased estimator of the population mean is the sample �mean.

It therefore makes sense that our CIs for a population mean will take the form

If we use the first form, the confidence coefficient will be the appropriate quantile of the standard normal distribution. If we use the second form, the confidence coefficient will be the appropriate quantile of the t distribution with (n – 1) degrees of freedom.

12 of 32

Hypothesis Testing

https://media2.giphy.com/media/v1.Y2lkPTc5MGI3NjExNXBnYTA4eXJvaHp0bXVzMWNuMzBnMWg4eGtmbTgyMGh4cG4xOGF5NCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/gw3IWyGkC0rsazTi/giphy.gif

13 of 32

Hypothesis Testing

Up until now, all of our discussions on statistical inference have surrounded estimation �(in which we seek to estimate the value of a population proportion).

We saw last lecture, briefly, that another part of inferential statistics is that of hypothesis testing, where rather than seeking to estimate the value of a population parameter we seek to assess the validity of a claim about a given population parameter.
Keeping in line with our cat theme, let’s consider cats that are polydactyl (i.e. born with extra toes).
According to a Quora post, the average cat has about a 10% chance of being born with polydactyly.
Suppose we wish to assess the validity of this claim, using data.

Image Source: https://www.treehugger.com/thing-didnt-know-polydactyl-cats-4864197

Specifically, suppose we take a representative sample of, say, 100 cats and record the proportion of these cats that are polydactyl.
If we observed 9 polydactyl cats in our sample of 100, we wouldn’t necessarily question the Quora statistic above.

14 of 32

Hypothesis Testing

But, instead if we observed 80 polydactyl cats in our sample of 100, we might start to �question the Quora statistic of only 10% of cats being born with polydactyly.
Similarly, if we observed only 1 polydactyl in our sample of 100, we might also start to question the Quora statistic.
So, where do we make the cutoff? In other words, how few is too few (or how many is too many) polydactyl cats we need to observe in our sample of 100 before we start to question the initial claim?
This is the general framework of hypothesis testing.
We start off with a pair of competing claims, which we call the null hypothesis and alternative hypothesis.

The null hypothesis is usually set to be the “status quo”. For instance, in our polydactyly example, we would set the null hypothesis (denoted H₀, and read “H-naught”) to be “10% of cats are polydactyl.”

It’s customary to phrase the hypotheses not in words, but in terms of appropriately-defined parameters.

15 of 32

Null and Alternative Hypotheses

For example, we can define p to be the proportion of all cats that are polydactyl, and �then write the null hypothesis as

H₀: p = 0.1

For the purposes of this class, our null hypothesis will always be a statement of equality.
As previously stated, the alternative hypothesis is some sort of competing claim to the null.
For example, if our null hypothesis is p = 0.1, there are (broadly speaking) four possible alternatives:

Lower-Tailed: H_A: p < 0.1 (“the true proportion of polydactyl cats is less than 10%”)
Upper-Tailed: H_A: p > 0.1 (“the true proportion of polydactyl cats is greater than 10%”)
Two-Sided: H_A: p ≠ 0.1 (“the true proportion of polydactyl cats is not equal to 10%”)
Simple-vs-Simple: H_A: p = 0.2 (“the true proportion of polydactyl cats is 20%”)

Now, we would need to pick only one of these as our alternative hypothesis. Which one to pick is context-dependent; it’s typically considered “safer” to adopt a two-sided alternative in the absence of additional information, however in certain cases a lower- or upper-tailed, or simple-vs-simple hypothesis may be more appropriate.

16 of 32

Null and Alternative Hypotheses

More generally, we call the population parameter under the null the null value.

H₀: p = p₀

Our four possible alternative hypotheses can thus be reformulated as

Lower-Tailed: H_A: p < p₀
Upper-Tailed: H_A: p > p₀
Two-Sided: H_A: p ≠ p₀
Simple-vs-Simple: H_A: p = p_A , for p_A ≠ p₀

Also, note that there should be no overlap between the null and alternative hypotheses; for example, it is incorrect to write a lower-tailed alternative as p ≤ p₀.

17 of 32

Null and Alternative Hypotheses

If we are testing for a mean:

H₀: µ = µ₀

Our four possible alternative hypotheses can thus be reformulated as

Lower-Tailed: H_A: µ < µ₀
Upper-Tailed: H_A: µ > µ₀
Two-Sided: H_A: µ ≠ µ₀
Simple-vs-Simple: H_A: µ = µ_A , for µ_A ≠ µ₀

Again, there should be no overlap between the null and alternative hypotheses.

18 of 32

Test for a Proportion

Alright, so that’s how to set up our hypotheses. How do we actually test our hypotheses?
To start off with, let’s consider a two-sided test for a proportion:

H₀: p = p₀

H_A: p ≠ p₀

As previously indicated, our test should involve the sample proportion P̂ in some way.
If P̂ is much larger than p₀ (e.g. if we observe a sample proportion of 80% polydactyl cats), we’d be tempted to reject the null hypothesis in favor of the alternative.
Similarly, if P̂ is much smaller than p₀ (e.g. if we observe a sample proportion of 1% polydactyl cats), we’d also be tempted to reject the null hypothesis in favor of the alternative.
So, a reasonable test (in words) might be something like:

IF P̂ is far away from p₀ (in either the negative or positive direction), REJECT H₀ in favor of H_A.

19 of 32

Test for a Proportion

We can make this a little more mathematical.

Specifically, saying “P̂ is far away from p₀ (in either the negative or positive direction)” is equivalent to saying |P̂ – p₀ | is large.

Hence, we can reformulate our test as:

IF P̂ is far away from p₀ (in either the negative or positive direction), REJECT H₀ in favor of H_A.

IF |P̂ – p₀| > c , REJECT H₀ in favor of H_A.

Finally, it’s customary to try and standardize random quantities as much as possible. Hence, our test will take the form

20 of 32

Errors

The last piece of the puzzle is to figure out what value c should be.

To do so, we’ll need a bit more information.

Notice that we will either reject the null (in favor of the alternative), or fail to reject the null (in favor of the alternative).
On the flipside, the null is either true or false (though we will never be able to tell for certain).
This leads to 4 states of the world:

State of H₀:
	True	False
Reject H₀
Fail to Rej. H₀

Result of test

21 of 32

Errors

The last piece of the puzzle is to figure out what value c should be.

To do so, we’ll need a bit more information.

Notice that we will either reject the null (in favor of the alternative), or fail to reject the null (in favor of the alternative).
On the flipside, the null is either true or false (though we will never be able to tell for certain).
This leads to 4 states of the world:

State of H₀:
	True	False
Reject H₀	BAD	GOOD
Fail to Rej. H₀	GOOD	BAD

Result of test

22 of 32

Errors

The last piece of the puzzle is to figure out what value c should be.

To do so, we’ll need a bit more information.

Notice that we will either reject the null (in favor of the alternative), or fail to reject the null (in favor of the alternative).
On the flipside, the null is either true or false (though we will never be able to tell for certain).
This leads to 4 states of the world:

State of H₀:
	True	False
Reject H₀	Type I Error	GOOD
Fail to Rej. H₀	GOOD	Type II Error

Result of test

Type I Error: Rejecting the null when it was true
Type II Error: Failing to reject the null when it was false

23 of 32

Errors

Type I and Type II errors are often phrased in the context of the judicial system.
The US judicial system is (supposedly) based on the ideal “innocent until proven guilty.”

Hence, the null hypothesis can be interpreted as “person is innocent.”

Committing a Type I error is akin to convicting an innocent person.
Committing a Type II error is akin to letting a guilty person go free.�
The probability of committing a Type I error is called the level of significance.
Returning to our hypothesis testing problem, we actually select the critical value so as to minimize the chance of committing a Type I error.
So… we should set the level of significance to be 0 then, right?
Well, let’s think about things in the judicial context again. One way to decrease the probability of committing a Type I error is to simply convict fewer people.

But, this has the (unintended) effect of increasing our chances of letting a guilty person go free!

So, a compromise needs to be made.

24 of 32

Level of Significance

In practice, we begin by setting the level of significance (which is typically denoted �by α) a priori.

Some common values for α are 0.01, 0.05, and 0.1, though other values may be desirable based on the context. (For instance, in some medical studies the level of significance can be set to a much higher value, like 0.4 or 0.5)

So, let’s say we set our significance level to be α, and see how we can find the critical value of our test.
Again, our test says

Hence, the probability of committing a Type I error is:

25 of 32

Critical Value

Hence, c must satisfy the equation

Now, you can perhaps see why we standardized our random variable - by the De Moivre-Laplace Theorem, the RV above will converge to a standard normal distribution!
I find it useful to draw a picture:

–c

By construction, the blue shaded area is α (shown in the picture to be 0.05).
This means Φ(-c) = α/2, which tells us that

26 of 32

Two-Sided Test for a Hypothesis

So, we’re finally ready to put everything together and formulate our test!]
Suppose our null and alternative hypotheses are

H₀: p = p₀

H_A: p ≠ p₀

Further suppose we take an α level of significance.
Our test then takes the form:

The absolute value of the standardized sample proportion is what we call the test statistic, and the set of observed values of the test statistic that correspond to a rejection is called the rejection region.

Question for you to try on your own: what’s the rejection region for a two-sided test of a proportion?

27 of 32

Demo: Are We Bikers?

Article Source: https://dailynexus.com/2024-02-16/navigating-the-bike-friendly-campus-challenges-and-solutions-at-ucsb/

28 of 32

Summary: Tests for Proportions

Two-Sided:

Upper-Tailed:

Lower-Tailed:

My advice: draw a picture!

29 of 32

p-Value

Instead of phrasing our test in terms of critical values, we can equivalently formulate �things in terms of what is known as a p-value.
The p-value of an observed value of a test statistic is the probability, under the null, of observing something as or more extreme (in the direction of the null) as what was observed.

Lower-tailed: ℙ(TS < ts)
Upper-tailed: ℙ(TS > ts)
Two-sided: ℙ(|TS| > ts)�

A picture is worth a thousand words!�
We reject the null whenever the p-value�is smaller than the critical value.

30 of 32

Summary: Tests for Means

Two-Sided:

Upper-Tailed:

Lower-Tailed:

(Still) my advice: draw a picture!

31 of 32

Summary: Tests for Means

Two-Sided:

Upper-Tailed:

Lower-Tailed:

(Still) my advice: draw a picture!

32 of 32

Caution!

Though p-values are still used very often throughout statistics, there have been some �mounting concerns over their use.
For one thing, they can be tricky to interpret.

For example, see this Wikipedia page on some common misinterpretations and misuses of p-values.
Make sure you don’t make any of these mistakes (and yes, I might test you on some of these in the future!)

If you go on to take PSTAT 120B/PSTAT 120C, you’ll learn a lot more about p-values (and hypothesis testing at large).
For now, let’s pause here and pick up next time with some comparisons across multiple samples.