1 of 15

Lecture 29

Designing Experiments

DATA 8

Summer 2017

Slides created by John DeNero (denero@berkeley.edu), Ani Adhikari (adhikari@berkeley.edu), Sam Lau (samlau95@berkeley.edu)

2 of 15

Announcements

3 of 15

Discussion Questions

4 of 15

Discussion Question 1

Key fact from last time: sample mean SD = pop SD / √N
SD of sample mean: population SD / 10 = $2k
So, $14k is 2 SD above the population mean
About 95% are within 2 SD of the population mean
About 2.5% are above; about 2.5% are below

Population: Incomes with mean $10k & SD $20k

Sample: 100 chosen uniformly at random with replacement

Question: What's the chance that the sample average is� above $14k?

5 of 15

Discussion Question 2

Population: A perfect bell shape. Mean 10; SD 20

Sample: 100 chosen uniformly at random with replacement

Question: What's the chance that all are below 50?

50 is 2 population SD above the population mean
The chance of drawing one value below 50 is 97.5%
The chance of drawing 100 below 50 is 0.975 ** 100

6 of 15

Discussion Question 3

Parameters that are highly influenced by a few values are difficult to estimate using bootstrap resampling
A 100-person sample is not similar enough to the population to estimate a CI with 99.9999% confidence

You want to estimate the height of the tallest person on campus. You sample 100 people at random and compute a 99.9999% confidence interval using the bootstrap. Its upper bound is 6'4".

A 6'5" person walks by! What might have gone wrong?

7 of 15

Discussion Question 4

You want to estimate the average compensation for SF public workers.

How many people should you sample at random in order to get a 95% confidence interval with a width of $10000 or less?

(Demo)

8 of 15

Choosing a Sample Size

9 of 15

Width of 95% Confidence Interval

CLT says the distribution of a sample mean is roughly normal, centered at population mean
95% confidence interval:

Center ± 2 SDs of the sample mean

Total width: ~4 SDs of the sample mean

10 of 15

Problems:

We have to take a sample before we can decide how big of a sample we need...
And we aren’t guaranteed that our interval will be as narrow as we want.
Can we address these issues?

11 of 15

Attendance

bit.ly/at-d

12 of 15

Discussion Question 5

You want to estimate what percent of voters will vote for Candidate A in an upcoming election.

How many opinions should you sample at random in order to get a 95% confidence interval with a width of 3% or less?

(Demo)

13 of 15

Width of 95% Confidence Interval

CLT says the distribution of a sample proportion is roughly normal, centered at population proportion
95% confidence interval:

Center ± 2 SDs of the sample proportion

Total width: 4 SDs of the sample proportion

= 4 x (SD of 0-1 population)/√(sample size)

14 of 15

Control the Width

Suppose you’re OK with the width being up to 3%

4 x (SD of 0-1 population) / √(sample size) ≤ 0.03

√(sample size) ≥ 4 x (SD of 0-1 population) / 0.03

(Demo)

15 of 15

Bound the 0-1 Population SD

√(sample size) ≥ 4 x (SD of 0-1 population)/0.03

SD of 0-1 population ≤ 0.5

√(sample size) ≥ 4 x 0.5 / 0.03 = 66.6666…

sample size ≥ (66.6666…)² = 4444.44…

sample size: 4445