4 of 33

FYI - Indicators and Sampling

Can we show:

E(Sample Average) = Population Average

5 of 33

Sample average is more robust than the histogram

6 of 33

Implications of CLT - Automation!!!

For any box (random variable) with finite SE, the sample average will be follow a Normal Distribution with two parameters:

7 of 33

Implications of CLT - Automation!!!

For any box (random variable) with finite SE, the sample average will be follow a Normal Distribution with two parameters:

Location: expectation(Y)
Spread/Uncertainty/Variance: SE(Y)^2 / n

Sometimes referred as “scale” then it’s SE(Y) / √n

8 of 33

Tuning the Normal distribution

9 of 33

One subtle issue with continuous random variables

If Y ~ Normal(𝜇, 𝜎²)

The probability of equally exactly one value is 0, i.e.

P(Y = k) = 0

It’s more natural to talk about an interval of values, e.g.

P(|Y| > 3)
P(-2 ≤ Y ≤ 3)

10 of 33

Motivating example

Census (Reporter)

11 of 33

3 Distributions to ALWAYS think about

Top:

The population
The distribution of the tickets in the box
The distribution of the random variable, Y

Middle:

The sample
The drawn tickets
The data, Y₁, Y₂, …, Y_n

Bottom:

Distribution for Sample Average
Sampling distribution (of the Sample Average, )

12 of 33

CLT tells us how far sample average is from the box average

Key:

“Far” = “How likely are we off by k units or more?”
We don’t know the population!

13 of 33

CLT tells us how far sample average is from the box average

Key:

“Far” = “How likely are we off by k units or more?”
We don’t know the population!

What is

P(|Sample Avg - Expectation|>2)?

14 of 33

CLT tells us how far sample average is from the box average

Key:

“Far” = “How likely are we off by k units or more?”
We don’t know the population!

What is

P(|Sample Avg - Expectation|>2)

if sample size is large enough

= P(|Z| > 2 * SE(Z)) where Z ~ Normal(0, SE(Sample Avg)²)

= 0.95

15 of 33

What if we don’t have enough samples? Chebychev!

Key:

“Far” = “How likely are we off by k units or more?”
We don’t know the population!

What is

P(|Sample Avg - Expectation|>2)

if sample size is small

= P(|Z| > 2 * SE(Z)) where Z ~ ?, E(Z) = 0, and SE(Z) = SE(Sample Avg)

<= ¼

16 of 33

The boiler plate confidence interval calculation for population averages

The interval centers at the sample average as an estimate for the population average (E(sample average) = E(Y))
Quantify the expected amount of chance-error (SE)
Trade-off guarantees with width of the interval (CLT)
Report the final interval

17 of 33

Example

An election poll consists of 400 randomly selected residents shows 33% support for candidate A out of 50000 residents. Please estimate the population % support for candidate A.

18 of 33

The procedure - check the assumptions + reasonableness

Does this population value matter?

Is there a clear decision from knowing the parameter value? (YES, majority wins elections in US)

Randomly selected sample?

Population definition, sampling frame, etc

There are more complicated sampling schemes (beyond this class)

If not, treat it like a case study

19 of 33

The procedure - obtain the estimate for the population parameter

An election poll consists of 400 randomly selected residents shows 33% support for candidate A out of 50000 residents. Please estimate the population % support for candidate A.

estimates

20 of 33

The procedure - quantify chance-error

An election poll consists of 400 randomly selected residents shows 33% support for candidate A out of 50000 residents. Please estimate the population % support for candidate A.

21 of 33

The procedure - quantify chance-error

An election poll consists of 400 randomly selected residents shows 33% support for candidate A out of 50000 residents. Please estimate the population % support for candidate A.

22 of 33

The procedure - quantify chance-error

An election poll consists of 400 randomly selected residents shows 33% support for candidate A out of 50000 residents. Please estimate the population % support for candidate A.

23 of 33

The procedure - determine a “confidence level”

Confidence level is the probability that this procedure will include the population parameter!

P(\|Y\| ≤ k )	Confidence Level
P(\|Y\| ≤ 1)	0.68
P(\|Y\| ≤ 1.645)	0.9
P(\|Y\| ≤ 2)	0.95
P(\|Y\| ≤ 3)	0.997

Y ~ Normal(0, 1)

24 of 33

The procedure - determine a “confidence level”

Confidence level is the probability that this procedure will include the population parameter!

The common default is 95% so k = 2

P(\|Y\| ≤ k )	Confidence Level
P(\|Y\| ≤ 1)	0.68
P(\|Y\| ≤ 1.645)	0.9
P(\|Y\| ≤ 2)	0.95
P(\|Y\| ≤ 3)	0.997

Y ~ Normal(0, 1)

25 of 33

The procedure - construct the final interval

The X_k% confidence interval is then:

An election poll consists of 400 randomly selected residents shows 33% support for candidate A out of 50000 residents. Please estimate the population % support for candidate A.

26 of 33

The procedure - construct the final interval

The X_k% confidence interval is then:

An election poll consists of 400 randomly selected residents shows 33% support for candidate A out of 50000 residents. Please estimate the population % support for candidate A.

27 of 33

Interpretation

What is a 95% confidence interval?

95% is referring to the procedure of has a 95% chance of covering the population mean before the data is collected
After data collection, the calculated interval either has the parameter or does not (100% or 0%)

28 of 33

Some notable choices

Symmetric confidence interval

may not make sense, see Text on “Confidence Intervals for Restricted Parameters”. TLDR: just replace the impossible values with the proper bounds.
Symmetric intervals are often the “shortest confidence interval”

Large sample assumption allowed the use of Normal distribution for the trade-off between k and confidence level
Choice for k=2 is often arbitrary
“Usually” margin of error = k * SE

29 of 33

What happens if we have a small sample?

30 of 33

Simulations of confidence intervals

Figure 26.2 from Text

31 of 33

100 Simulations of varying n and the significance level

Our Box

32 of 33

Interpretation

What is a 95% confidence interval?

95% is referring to the procedure of has a 95% chance of covering the population mean before the data is collected
After data collection, the calculated interval either has the parameter or does not (100% or 0%)
We do not know if we’re lucky or not!
Another view (will make sense soon!):

The interval contains all the numbers we cannot reject at (100 - Confidence Level)% Significance Level

33 of 33

Proposed language

“We estimate the population parameter with a X% confidence interval, our sample suggests (a, b)”