1 of 28

API CAN CODE �Data Science Practices

Lesson 3.5: Tests and Estimates for One Variable

This work was made possible through generous support from the National Science Foundation (Award # 2141655).

2 of 28

Warmup

Examine the graph to the right.�
What kind of graph is this?�
What variable(s) do you see represented? Is there more �than one? �
What do you notice? �What do you wonder?�
What story does this graph portray?

2

3 of 28

Lesson 3.4 Recap

We learned about using CODAP to create data visualizations in two variables

We typically use scatter plots to visualize relationships between quantitative variables�
We also talked about using two-way dot plots and segmented bar graphs to visualize qualitative relationships

3

4 of 28

Zillow - ZHVI Data

We are going to examine data taken from Zillow Research’s Housing Data page.�

Zillow is a website that you can use to search for properties for sale and find buying & selling resources.�
The dataset we have gives a Zillow Home Value Index (ZHVI), which is supposed to reflect the “typical value for homes” in the middle range for a certain city.

4

5 of 28

Exploratory Practice

Open the Zillow dataset.�
What variables are present?
What types of variables are they?
What states are represented?�
Generate some graphs you think are interesting.
What do they show?

5

6 of 28

Comparing State City ZHVIs

If you haven’t already, create a graph with AvgHomeValue on the x-axis and State on the y-axis. �
This should create a dotplot for MD and another for VA. What do you notice?�
Do you think there are any outliers? Go to the Ruler icon and select “Box Plot,” then go back to the Ruler icon and select “Show Outliers.” Does CODAP identify any outliers? How do you think it’s deciding what is an outlier?

6

7 of 28

7

8 of 28

Comparing State City ZHVIs

Go to the Ruler icon and add a Mean to your comparative dotplot. Then, go back to the Ruler icon, click on “Measures of Spread,” and select “Standard Deviation.”�
How could you use these metrics to help you decide whether Maryland or Virginia was more affordable?�
What other information might you want to use� to decide the more affordable state? �How would you get it?

8

9 of 28

How Much Does a House Cost in DC?

Now, let’s focus on Washington, DC prices. Use this program (remember to clone, rename, and save!) to collect a sample of available properties in DC. �

Go to this RapidAPI page that draws in data from Zillow. “Subscribe to Test,” choose the free option, then go back to “Endpoints” – make sure the “/search (Search for properties by neighborhood, city, or ZIP code)” endpoint is selected – and find your API-Key in the code snippets.

Copy your API-Key and paste it in the appropriate place in �the program. Then, run the program!

9

10 of 28

Offline API Backup!

If the Zillow API is down, you can use a pre-loaded program with the same data. (It just won’t be live-updated.)�
That program can be found here.

10

11 of 28

How Much Does a House Cost in DC?

Copy the output from your program, and then open CODAP and select “Create New Document.”�

Go to “Tables” and select “-- New from Clipboard --” which will paste in the data you just copied from your program.
Look over the data. How big is your sample?
Create any graphs you want to explore the data.

What do you notice? What do you wonder?

11

12 of 28

The Role of “Confidence Intervals”

How would you find the average home price in the entire United States?�
A census would be ideal, but expensive and time-consuming (you’d need the value of EVERY house in EVERY city!) �
A confidence interval lets us use the mean of a sample and build out from that to get a range where we think the mean of the whole city falls

12

13 of 28

The Role of “Confidence Intervals”

We might also want to compare two city housing prices. �
Here, we could make a confidence interval using a sample from each city and compare them. �
If they overlap, it means the city mean housing prices could be the same. �
If they don’t overlap, it means the city mean housing prices are likely different.

13

14 of 28

How Much Does a House Cost in DC?

If you haven’t already, drag price on the x-axis of a new graph to create a dotplot.

Click on the Ruler icon and add Mean, then click on it again, go to “Measures of Spread,” and add “2 Standard Errors” (you’ll need to change the default “1” to a “2”). �

What is the mean of this sample? Add the 2 Standard Error shown on the graph to the Mean, and then subtract the same amount from the Mean. What numbers does this give you? (These are the lower and upper limits of your confidence interval!) How could we interpret these numbers?

14

15 of 28

15

16 of 28

16

17 of 28

Standard Deviation vs. Standard Error

Standard Deviation - the average difference between each observation in a dataset and the mean of that dataset�
Standard Error - the average difference between the mean of each sample and the mean of the population

Key term:

“each observation”�

Key Term:

“each sample”

17

}

18 of 28

Standard Deviation

Most (~68%) of the data is within 1 standard deviation of the mean�
The VAST majority (~95%) of the data is within 2 standard deviation of the mean�
ALMOST ALL (~99%) of the data is within 3 standard deviation of the mean

18

19 of 28

Standard Deviation as a Measure of Spread

19

Populations or samples with a higher standard deviation will be more spread out (see yellow distribution to the right!) �
Populations or samples with a smaller standard deviation will be less spread out (see blue distribution to the right!)

20 of 28

Confidence Intervals

Confidence intervals are constructed using the mean of a sample and the standard error to get an estimate for the population mean

20

Confidence intervals let us build a wider “net” of guesses to try to catch the true population mean than a single mean calculated from our sample. �
Example: Imagine we were trying to predict the average streams per day for your favorite artist. Would we get a good estimate if we checked five days worth of streams and averaged them? Or should we build a broader interval prediction?

21 of 28

Confidence Intervals

The Mean + 2SE gives you an �upper limit estimate; �the Mean - 2SE gives you a �lower limit estimate�
These estimates form a range �known as a 95% confidence interval�
95% of intervals like this one, from samples like this one, will include �the true DC mean housing cost�(and 5% will not!)

21

Does this graphic appear to show a 95% confidence interval? �

How do you know?

22 of 28

Are Housing Prices Different?

Samples can be good or bad at representing a population.

What could make a sample biased?

Even when samples are collected well, random chance might make them unrepresentative.

22

How could a randomly-selected sample of houses be unrepresentative of the population?

23 of 28

Are Housing Prices Different?

Go back to your program; this time, look for the line that says �querystring = {“location”:”washington, dc”}�

Replace the location with another location you’re interested in. If I wanted to look at San Francisco, I would change the line to read: querystring = {“location”:”san francisco, ca”}�

Run the program again to generate a new set of values for your new chosen location!

23

24 of 28

Offline API Backup!

If the Zillow API is down, you can use a pre-downloaded CSV of data from Baltimore, MD. (As before, this backup just won’t be live-updated from the Zillow API.)�
Access that backup program here.

24

25 of 28

Are Housing Prices Different?

Copy the output from your program with the location modification, and then go back to your CODAP page. �

Go to “Tables” and select “-- New from Clipboard --” which will paste in the data you just copied from your program. �
Look over the data and explore your new sample. �
Create a graph and drag price (from the new table this time!) onto the x-axis. What is the mean of this dataset? How about the standard deviation?

25

26 of 28

Are Housing Prices Different?

Using the Ruler icon, display the 2SE confidence interval like you did for the DC dataset.

Calculate the lower and upper limit for your chosen city’s confidence interval, and compare this to the confidence interval you found for DC. �

What do you think it means if the confidence intervals overlap? What do you think it means if they do not overlap?

26

27 of 28

Exit Ticket

Open this CODAP file to see Ravens and Commanders NFL game scores from a past season, as well as a constructed confidence interval intended to capture each team’s mean points scored per game.

How could you use the information provided to decide who you think would win if the Ravens and Commanders played each other again this season?

27

28 of 28

Thanks!

apicancode@umd.edu

28

This work was made possible through generous support from the National Science Foundation (Award # 2141655).