1 of 33

Estimating a population parameter

Wayne Tai Lee

2 of 33

Review

  • Probability
  • Randomness in data and sample averages
  • Central Limit Theorem

3 of 33

Preview

  • Surveys statistics estimate population parameters

4 of 33

FYI - Indicators and Sampling

Can we show:

E(Sample Average) = Population Average

5 of 33

Sample average is more robust than the histogram

6 of 33

Implications of CLT - Automation!!!

  • CLT
    • For any box (random variable) with finite SE, the sample average will be follow a Normal Distribution with two parameters:

7 of 33

Implications of CLT - Automation!!!

  • CLT
    • For any box (random variable) with finite SE, the sample average will be follow a Normal Distribution with two parameters:
      • Location: expectation(Y)
      • Spread/Uncertainty/Variance: SE(Y)^2 / n
        • Sometimes referred as “scale” then it’s SE(Y) / √n

8 of 33

Tuning the Normal distribution

9 of 33

One subtle issue with continuous random variables

If Y ~ Normal(𝜇, 𝜎2)

The probability of equally exactly one value is 0, i.e.

P(Y = k) = 0

It’s more natural to talk about an interval of values, e.g.

  • P(|Y| > 3)
  • P(-2 ≤ Y ≤ 3)

10 of 33

Motivating example

11 of 33

3 Distributions to ALWAYS think about

Top:

  • The population
  • The distribution of the tickets in the box
  • The distribution of the random variable, Y

Middle:

  • The sample
  • The drawn tickets
  • The data, Y1, Y2, …, Yn

Bottom:

  • Distribution for Sample Average
  • Sampling distribution (of the Sample Average, )

12 of 33

CLT tells us how far sample average is from the box average

Key:

  • “Far” = “How likely are we off by k units or more?”
  • We don’t know the population!

13 of 33

CLT tells us how far sample average is from the box average

Key:

  • “Far” = “How likely are we off by k units or more?”
  • We don’t know the population!

What is

P(|Sample Avg - Expectation|>2)?

14 of 33

CLT tells us how far sample average is from the box average

Key:

  • “Far” = “How likely are we off by k units or more?”
  • We don’t know the population!

What is

P(|Sample Avg - Expectation|>2)

if sample size is large enough

= P(|Z| > 2 * SE(Z)) where Z ~ Normal(0, SE(Sample Avg)2)

= 0.95

15 of 33

What if we don’t have enough samples? Chebychev!

Key:

  • “Far” = “How likely are we off by k units or more?”
  • We don’t know the population!

What is

P(|Sample Avg - Expectation|>2)

if sample size is small

= P(|Z| > 2 * SE(Z)) where Z ~ ?, E(Z) = 0, and SE(Z) = SE(Sample Avg)

<= ¼

16 of 33

The boiler plate confidence interval calculation for population averages

  • The interval centers at the sample average as an estimate for the population average (E(sample average) = E(Y))
  • Quantify the expected amount of chance-error (SE)
  • Trade-off guarantees with width of the interval (CLT)
  • Report the final interval

17 of 33

Example

An election poll consists of 400 randomly selected residents shows 33% support for candidate A out of 50000 residents. Please estimate the population % support for candidate A.

18 of 33

The procedure - check the assumptions + reasonableness

  • Does this population value matter?
    • Is there a clear decision from knowing the parameter value? (YES, majority wins elections in US)
  • Randomly selected sample?
    • Population definition, sampling frame, etc
      • There are more complicated sampling schemes (beyond this class)
    • If not, treat it like a case study

19 of 33

The procedure - obtain the estimate for the population parameter

An election poll consists of 400 randomly selected residents shows 33% support for candidate A out of 50000 residents. Please estimate the population % support for candidate A.

estimates

20 of 33

The procedure - quantify chance-error

An election poll consists of 400 randomly selected residents shows 33% support for candidate A out of 50000 residents. Please estimate the population % support for candidate A.

21 of 33

The procedure - quantify chance-error

An election poll consists of 400 randomly selected residents shows 33% support for candidate A out of 50000 residents. Please estimate the population % support for candidate A.

22 of 33

The procedure - quantify chance-error

An election poll consists of 400 randomly selected residents shows 33% support for candidate A out of 50000 residents. Please estimate the population % support for candidate A.

23 of 33

The procedure - determine a “confidence level”

Confidence level is the probability that this procedure will include the population parameter!

P(|Y| ≤ k )

Confidence Level

P(|Y| ≤ 1)

0.68

P(|Y| ≤ 1.645)

0.9

P(|Y| ≤ 2)

0.95

P(|Y| ≤ 3)

0.997

Y ~ Normal(0, 1)

Y ~ Normal(0, 1)

24 of 33

The procedure - determine a “confidence level”

Confidence level is the probability that this procedure will include the population parameter!

The common default is 95% so k = 2

P(|Y| ≤ k )

Confidence Level

P(|Y| ≤ 1)

0.68

P(|Y| ≤ 1.645)

0.9

P(|Y| ≤ 2)

0.95

P(|Y| ≤ 3)

0.997

Y ~ Normal(0, 1)

Y ~ Normal(0, 1)

25 of 33

The procedure - construct the final interval

The Xk% confidence interval is then:

An election poll consists of 400 randomly selected residents shows 33% support for candidate A out of 50000 residents. Please estimate the population % support for candidate A.

26 of 33

The procedure - construct the final interval

The Xk% confidence interval is then:

An election poll consists of 400 randomly selected residents shows 33% support for candidate A out of 50000 residents. Please estimate the population % support for candidate A.

27 of 33

Interpretation

What is a 95% confidence interval?

  • 95% is referring to the procedure of has a 95% chance of covering the population mean before the data is collected
  • After data collection, the calculated interval either has the parameter or does not (100% or 0%)

28 of 33

Some notable choices

  • Symmetric confidence interval
    • may not make sense, see Text on “Confidence Intervals for Restricted Parameters”. TLDR: just replace the impossible values with the proper bounds.
    • Symmetric intervals are often the “shortest confidence interval”
  • Large sample assumption allowed the use of Normal distribution for the trade-off between k and confidence level
  • Choice for k=2 is often arbitrary
  • “Usually” margin of error = k * SE

29 of 33

What happens if we have a small sample?

30 of 33

Simulations of confidence intervals

31 of 33

100 Simulations of varying n and the significance level

Our Box

32 of 33

Interpretation

What is a 95% confidence interval?

  • 95% is referring to the procedure of has a 95% chance of covering the population mean before the data is collected
  • After data collection, the calculated interval either has the parameter or does not (100% or 0%)
  • We do not know if we’re lucky or not!
  • Another view (will make sense soon!):
    • The interval contains all the numbers we cannot reject at (100 - Confidence Level)% Significance Level

33 of 33

Proposed language

“We estimate the population parameter with a X% confidence interval, our sample suggests (a, b)”