Take time to discuss the four main themes of the course: I. collecting data, II. analyzing data, III. probability, and IV. making inference from the data.

Questions:

What do you know about M&Ms? Is it possible to predict which color will be most popular in a bag of M&M’s? Are the colors evenly distributed?

In this activity:

You will make individual Dot Plots, and we will create one master Dot Plot. You will compare your dot plots with the master and note similarities and differences.

Materials:

1.69 oz. bag of M&Ms, one per student.

Data collection table and dot plots, one per student.

Master dot plot for each color on the board.

An accurate scale.

Procedure:

Collecting Data: Take and open a bag of M&M’s. Weigh the bag (alone). Weigh the contents of the bag. Count the number that fall into each category: Blue, Brown, Green, Orange, Red and Yellow.
Displaying the Data: Record the numerical results on your individual dotplot. Then, properly dispose of the M&Ms by seeing whether or not they really do melt in your mouth, not in your hand.
Record your information on the Master dot plot on the board.
(Continued) I. Collecting Data: Calculate the percentage of orange M&Ms in your sample.
Record the class data using a chart like the one below. If you don’t have 40 data points, repeat the process with new bags until you do.

II. Organizing the Data: Use the table below to organize the data in a meaningful way.

Title: _______________________

Display the data of the relative frequencies using a dot plot.

III. Analyzing the Data:

Describe some general features of the data.
What would you consider to be a “normal” or “typical” percentage of orange M&Ms? Why?

IV. Making Inference from the Data:

Does our data reveal the true percentage of orange M&M’s? If so, what is the true percentage? If not, what DOES it reveal about the true percentage?
How confident are you in your conclusion? What would increase your confidence?

Record each bag’s weight and contents weight in Master tables like the ones below:

M&Ms Bag Weights

M&Ms Contents Weights

Calculate the arithmetic mean, or average weight of both.
Discuss the variability of the recorded weights. How can we describe the variability from bag to bag? How far away are the actual weights from the mean weight? How can we calculate a standard measure of this distance.
Save your recorded data for future discussion of Mean and Standard Deviation, and Standard Normal calculations.

Individual Dotplot

Bag weight: _ Contents weight: ___

2. Hershey Kisses Lab Return to Table of Contents

Before we begin:

In this lab, when new vocabulary is introduced, the word will be italicized, and the definition can be found in the glossary section of your . This lab is adapted from:

What is the Probability of a Kiss? (It’s Not What You Think) Mary Richardson, Susan Haller, Journal of Statistic Education Volume 10, Number 3 (2002).

www.amstat.org/publications/jse/v10n3/haller.html

Questions:

What is the chance that a HERSHEY’S KISS will land on its base when tossed out of a cup onto a table? What is the chance that it will not land on its base? Do we all agree that there are two possible outcomes when the kiss is tossed onto the table?

In this Activity:

Students work in groups of three collecting data, analyzing data and gaining experience with empirical probability and measures of the center of a sample of data.

Materials:

Pencils, ten plain HERSHEY’S KISSES candies, a 16-ounce plastic cup, a flat table or desktop, sticky notes and lab book.

Procedure:

Discuss with your group of three, the subjective probabilities you assign to the kiss landing on its base or on its side and record them in the table below:

HERSHEY’S KISS: Tosses Subjective (THEORETICAL) Probabilities

Probability of base landing	0.5	Percent chance of base landing	50%
Probability of side landing	0.5	Percent chance of side landing	50%
total	1		100%

Assign tasks within your group, Spiller, spills the 10 candies from the cup onto the table, Counter, counts the number of candies that land on their bases, and Recorder, records the results in their data table.
Let Spiller spill the cup ten times, counting and recording each time, and then switch rolls so that each person does each task once.

Toss #	HERSHEY’S KISS # on base
1
2
3
4
5
6
7
8
9
10
TOTAL

Refine your subjective probabilities based on the empirical evidence that you now have.

HERSHEY’S KISS Tosses Subjective Probabilities Refinement

Probability of base landing		Percent chance of base landing
Probability of side landing		Percent chance of side landing
total	1		100%

Combine the results from all of the groups on the board, using each groups totals out of 100. Together, consider the best way to display the data.
Now, ask your instructor to help you create a stem and leaf plot of the combined totals of each student in the class. In later labs, you will learn about more ways to display your data.
The visual data is very helpful, and you may want to adjust the subjective probability you have assigned to the a base-side landing. You may also want to get some measurement of the center of this data. Find the mean and median of each of the three sets of the data that your group collected.

HERSHEY’S KISS Statistics

	Data Set 1	Data set 2	Data set 3
median
mean

Combine the results of your calculations on the board again, and discuss your collective findings.
You are probably ready to come to a consensus on the empirical probability that a HERSHEY’S KISS will land on its base if tossed. Please record this probability, and then consider the following vocabulary regarding this candy toss. The kiss had a _____ percent chance of landing on its base, and so in the sample space of events for this variable L, which stands for Landing, there are two possible values, B, for base or S, for side, and the probability of landing on its base, P(B), is _____.

3. Almond Hershey Kisses Lab Return to Table of Contents

Before we begin:

This lab is meant to be completed after the Hershey Kisses Lab. This lab is adapted from:

What is the Probability of a Kiss? (It’s Not What You Think) Mary Richardson, Susan Haller, Journal of Statistic Education Volume 10, Number 3 (2002).

www.amstat.org/publications/jse/v10n3/haller.html

Questions:

What is the chance that an almond HERSHEY’S KISS will land on its base when tossed out of a cup onto a table? What is the chance that it will not land on its base?

In this Activity:

Students work in groups of three collecting data, analyzing data and gaining more experience with empirical probability and measures of the center of a sample of data. They see first hand what it means to generalize results from one population to another. And they learn about five-number summaries (minimum, first quartile, median, third quartile and maximum) and box plots.

Materials:

Pencils, ten plain HERSHEY’S KISSES, ten almond HERSHEY’S KISSES candies, a 16-ounce plastic cup, a flat table or desktop and sticky notes.

Procedure:

Look at both types of kisses, noting their differences and similarities below:

Estimate the probability that the almond KISS will land on its base when spilled and record your estimate here:

Estimate

As before, assign the jobs of tossing, counting and recording to different people in your group and then spill a cup with 10 almond candies and 10 plain candies onto the table 10 times recording the number of each that landed on its base.

Toss #	Almond KISS # on base	Plain KISS # on base
1
2
3
4
5
6
7
8
9
10
TOTAL

Rotate jobs and toss again. Rotate jobs and toss again. Each person in the group should have recorded 10 separate tosses for each type of candy.
Record your group’s data on sticky notes, one for each toss, using different colors for different candies and then collect all of the data from the class on the whiteboard. Try a back-to-back stem plot or two frequency plots with the same scale. Discuss your findings as a class.
Having numbers that capture what you see on the board can help you discuss the data and compare different data sets. Combine your group’s data and compute the median, minimum and maximum of both the almond KISS numbers and the plain KISS numbers. Begin filling in the table below:

	Almond KISS	Plain KISS
minimum
Q1
median
Q3
maximum
Interquartile Range

Just as the median divides our data into a top half and a bottom half, we can additionally divide those halves into halves. Q1 is the ”middle” value in the first half of the rank-ordered data set. Q2 is the median value in the set. Q3 is the ”middle” value in the second half of the rank-ordered data set. Compute Q1 and Q3, making sure to find the mean of the two middle values if you have an even number of terms. The interquartile range is simply Q3 − Q1. Compute this as well, add all the numbers to the table above and then compare your findings with the other groups.
The statistics you computed above can be displayed graphically in a box plot. The box plot has three main features, a box whose width is the interquartile range, points graphed for the outliers (Something we will discuss in another lab.) and whiskers at the left and right of the box which extend to the minimum and maximum of the data. You can choose the height of your box. An example box plot is shown below:

Construct your own box plots with two boxes, one that represents your plain KISS data and one that represents your almond KISS data.

Compare your findings with the class. Discuss the statistics you computed and the graphical representations of the data you created. Did they help you understand the sample space?

4. Every Graph Tells a Story Return to Table of Contents

In this Activity:

We take a closer look at two stories behind two box and whisker plots.

Materials:

Graph paper and pencils.

Procedure:

The Grape Harvest The box-and whisker plot, with nine separate plots, describe the fruits of the grape harvest in the course of nine years at a small vineyard on the shores of Lake Erie. In this graph Yield (in pounds per row) is graphed against Harvest Year. The caption 468 cases means that there were 468 separate pieces of data, not 468 cases of wine. Make five observations based on this graph, and record them below:

The Draft Lottery The first lottery to select soldiers for the Vietnam War was held in 1970. The idea was to randomly match each of the 366 days in a (possibly leap) year with the integers 1 through 366. Eligible men whose birthday corresponded to 1, the first number picked, were the first to be drafted. The higher the number, the less likely you were to be drafted. To randomize the 366 possible birthdays, all the dates for January were put into small capsules, stirred vigorously, and poured into a large glass container. The capsules containing a birthday for each of the days in the subsequent months were added to the glass bowl in order, February followed by March, then April, etc. until December birthdays were added last.

Then one capsule was drawn at random by a person reaching into the glass and pulling out one capsule. This first capsule, September 14, was assigned draft number 001. The second date drawn, April 24, was assigned number 002. And so on through December, each date being matched with the order in which it was picked. This data is recorded in the 1970 Draft Lottery box plot on the previous page. Each of the box plots represents one month.

Notice that the minimum for September looks to be very close to 1 and that April’s minimum could well be 2.

Comment on this set of box plots which represents the outcome of an allegedly random process. Note especially the trend of the medians of each month as the year progresses.

Discuss your observations.

Read this article. Make additional comments.

5. Student Scores Return to Table of Contents

Questions:

Though they contain a great deal of information, are box plots enough?

In this Activity:

Students will learn about dot plots and compare them with box plots. They should break up into groups of 2 or 3 to work on this lab.

Materials:

You will need a pencil.

Procedure:

Consider the following hypothetical exam score data presented below for three classes of students.

D-period 50 65 67 70 71 71 72 73 73 74 75 75 75 76 92

E-period 69 70 70 70 70 71 71 72 72 72 73 85 86 86 90

Discuss this data with your group. Does it look like data that could have come from three of your classes? Which data represents the most successful class? Which class needs the most work?
By now, you are very familiar with box-plots. Fill in the 5-number summary tables for the classes below:

	A-period	B-period	C-period	D-period	E-period
minimum
Q1
median
Q3
maximum
Interquartile Range

Consider the numbers in the table, and discuss with your group, whether this summary gives you any insight into your class scores.
Divide the task of creating box-plots for this data between the members of your group and compare results when you are all done. Do you have any new conclusions?
Another way to create a quick and helpful picture of your data is to create a dot plot of your data. A dot plot is simply a record of the frequency of each score. The variable that you care about is located on the horizontal axis, and the value of each data point is recorded as a dot located at its value. The dots accumulate vertically above the values. A dot plot for the A Period class is shown. Divide the task of creating dot plots for the remaining classes between the members of your group and compare results when you are done.

Discuss as a class the difference between the three ways of displaying data: in a table, in a box plot and in a dot plot.

6. Matching Dotplots Return to Table of Contents

Procedure:

The following dotplots represent the distributions of the eight variables listed below. The scales of the plots have been omitted intentionally and the order of the plots has been scrambled. Your task is to match the variable with the plot. Provide a brief explanation of your reasoning in each case.

Jersey numbers from the 2014 New England Patriots
Annual snowfall amounts for a sample of U. S. cities
Margin of victory in Red Sox 2014 season games
Prices of properties in the Monopoly board game
Weights of the New England Patriots 2014 team members
Ages at which sample mothers had their first child
Weights of sample of 2014 cars
Scores on a Statistics exam

7. Puppies Lab Return to Table of Contents

Questions:

We have learned about different ways to summarize and display your data and some of the standard vocabulary used when discussing data sets. The shape of the data is an important topic too.

In this Activity:

We use Labrador Retriever data to learn about histograms and the shape of a data set.

Materials:

Graph paper and pencils.

Procedure:

The table below shows the weight in ounces of puppies born at Meadowsweet Kennels in the last six months. Create a dot plot of the data and discuss the shape of the data with your group members. Look at each other’s dot plots and discuss their differences and similarities.

A histogram is another way to graphically display univariate data, and this type of display can quickly communicate the shape of a data set. Whereas the primary consideration (once you have organized your data set from minimum to maximum) in creating a dot plot is how long to make the number line that contains the data, and the dot plots from your group were most likely very similar, a certain amount of design goes into creating a histogram. The summary table below was used to create the histogram shown on the next page.

Fill in the summary table below, and use it to create your own histogram.

Ounces	# of Puppies
13
14
15
16
17
18
19
20
TOTAL	25

Discuss the differences between the two histograms with your group and the class as a whole. Does one more effectively represent the data than another? Compare them with the additional histogram below.

8. Meadowsweet Questions Return to Table of Contents

In this Activity:

We take a closer look at mean, median and modal weights, first using the puppy data and then with other examples.

Materials:

Graph paper and pencils.

Procedure:

What are the mean, median and modal weights of the puppy data?
What are the range and interquartile range of the weights?
Graph these weights using a dot plot and a box plot and compare these graphs to the histogram above.

Describe the shape of the distribution of these weights, using words like skewed right or skewed left, or symmetrical.
What is the typical weight of a Labrador Retriever puppy born in the last six months at the Meadowsweet Kennels? Explain why you chose this value.
Describe how variable the data are about the center.
The data on the Meadowsweet Labrador Retriever puppies sets are slightly skewed. The mean and the median are different from each other. Is there any relationship between the skewedness of the data and the relative size of the mean and median?

Additional Questions:

This example is adapted from How to Lie With Statistics , a classic book by Darrell Huff published in 1954. A company has 25 employees. The president earns $450,000, the financial officer earns $150,000, the two executives earn $100,000, the bookkeeper earns $57,000, three managers earn $50,000, the four floor managers earn $37,000, the timekeeper earns $30,000 and the 12 lowly production workers earn $20,000. The mean, median and modal salaries give very different answers to the question “What is the average (center) pay at this company?”. Describe how these answers cast a very different light on the generosity of the company.
The distribution of salaries on a company is shown below. The median salary is $14,400, and the mean salary is $17,800.

If each of the supervisors is a given $1,000 raise, how will this affect the mean, median, mode, range and interquartile range? Think for a moment before you begin calculations.
If each of the employees gets a $1,000 raise how will this affect the mean, median, mode, range and interquartile range? Again, see if you can reason out the answer without calculations.
What is the total payroll for this company? In other words what is the sum of all the salaries? Once more, think for a moment before you compute the number.

In a boxplot, how much of the data lies inside the box?
(From Workshop Statistics) Using 10 integers from 0 to 100 (repeats allowed) construct three data sets as describe below, one set for a, one set for b, one set for c.

90% of the data are above the mean.
The mean is greater than twice the mode.
The mean and median are different and none of the scores are between the mean and median.

(From Workshop Statistics) Are the following conclusions correct? Discuss with your neighbor.

A real estate agent notes that the mean housing price for an area is $225,700 and concludes that half of the houses in that area cost more than that.
A businesswoman calculates that the median cost of the five business trips she took last month is $750 and concludes that the total cost of the trips was $3,750.
A restaurant owner decides that more than half of her customers prefer chocolate ice cream because chocolate is the mode when customers are offered chocolate, vanilla and strawberry.

A previous president announced, truthfully, that the average net worth of an American family had risen 6%, to approximately $420,000. What he did not announce was that the median net worth was approximately $100,000, less than a quarter of this average. Comment on whether this was a misleading announcement.

9. Sudoku Experiment Part I Return to Table of Contents

Before we begin:

Data can be gathered in many ways, in experiments, through surveys and through observational studies. This lab is explores an experiment, reinforces the use of dot plots and box plots to describe data and introduces measures of center and spread.

This lab was adapted from:

Brophy and Hahn (2014) Engaging Students in a Large Lecture: An Experiment using Sudoku Puzzles. Journal of Statistics Education Volume 22, Number 1,

www.amstat.org/publications/jse/v22n1/brophy.pdf

Questions:

In statistics data can come from three places, observational studies, experiments and surveys. An experiment is a study in which some treatment is imposed on individuals in order to determine whether the treatment changes the outcome. How can a scientist create an experiment that will give meaningful results?

In this Activity:

Statistics class participants will have all participants complete one of two 6 by 6 grid Sudoku puzzles and time how long it takes each person to complete it. Statistics class participants will collect the data. Then, using the class data, each participant will compile the data in various ways and test to see if they can draw any conclusions from their experiment.

Materials:

Enough Sudoku puzzles for all participants and a timer or clock.

Procedure:

Make sure that the pile of puzzles is shuffled. Make sure that participants can time themselves. The best case is to project a stopwatch on a screen so that all participants can see the same stopwatch.
Pass out the puzzle pages face down. Explain that each person will work to complete a different puzzle and that the puzzles vary in difficulty from easy to hard.
When everyone has a puzzle, tell them that when you say ”begin”, they should turn the paper over, read the directions, then do the puzzle. Say ”begin”, and start the stopwatch.
When finished, participants will turn their papers over and remain quiet until everyone is finished.
Collect and check the puzzles, recording the information on a chart on the next page.

Sudoku Data

Type of Puzzle	Correct y/n	Time to Finish min:sec	Experience y/n
numbers	y	3:05	y
letters	y	1:25	y
Greek	y	2:06	y
symbols	y	7:00	y

You may want to summarize your data in the following table:

Sudoku Data Summary

	Puzzle Type			Sudoku Experience
Correct	Symbols	Numbers	Total	Yes	No	Total
NO
YES
TOTAL

Discuss with your group, why this activity is an experiment and not an observational study.
What were the ’treatments’ in this experiment?
Find the five-number summary for each type. Record the data in the table below:

	Number Puzzle	Symbol Puzzle
minimum
Q1
median
Q3
maximum
Interquartile Range

Draw side-by-side box plots. Do you think that the differences that you see are significant?
There are two other numbers that are used to describe the center of a data set. One is the mean or the arithmetic average, and the other is the mode or the value that appears with the greatest frequency. In this case, because time is continuous it makes more sense to compute the mode if we round the time to finish to the nearest minute and compute the mode of those numbers. Compute the mean and the mode of this data and begin filling in the table below:

Sudoku Summary Statistics

	Number Puzzle	Symbol Puzzle
mean
mode
Standard deviation

The standard deviation is, roughly, the average deviation of each data point from the mean. Except for very small data sets, it is cumbersome to compute. Every calculator and statistics software package will compute it very quickly. The number is used to describe the variability or spread of your data. Use your calculator to compute the standard deviation for the time to complete each type of sudoku puzzle, and add this number to the table of summary statistics.
As you gain more experience with mean and standard deviations, you will see how these two numbers can provide a great description of a data set, giving a snapshot of both center and variability. Discuss with your group and then with the class as a whole, which information you find most helpful in representing the data, the 5-number summary or the mean and standard deviation?

Sudoku Experiment Part I

Instructions:

On the other side of this sheet is a six by six grid of squares broken up into six outlined boxes with Greek letters placed in a handful of the thirty-six squares. The Greek letters α, β, δ, ε, λ and μ must each appear once in each of the six outlined boxes, once in each of the six rows and once in each of the six columns. Use logic (i.e. do not guess) to determine what goes in each empty space.

Before you turn the page over and begin, please answer the following question:

Have you ever played Sudoku before today? Yes □ No □

Sudoku Experiment

Instructions:

The Greek letters α, β, δ, ε, λ and μ must each appear once in each of the six boxes, once in each of the six rows and once in each of the six columns. Use logic (i.e. do not guess) to determine what goes in each empty space.

Time to completion: Minutes: __________ Seconds: ________

Sudoku Experiment Part I

Instructions:

On the other side of this sheet is a six by six grid of squares broken up into six outlined boxes with lowercase letters placed in a handful of the thirty-six squares. The lowercase letters a, b, c, d, e and f must each appear once in each of the six outlined boxes, once in each of the six rows and once in each of the six columns. Use logic (i.e. do not guess) to determine what goes in each empty space.

Before you turn the page over and begin, please answer the following question:

Have you ever played Sudoku before today? Yes □ No □

Sudoku Experiment

Instructions:

The lowercase letters a, b, c, d, e and f must each appear once in each of the six boxes, once in each of the six rows and once in each of the six columns. Use logic (i.e. do not guess) to determine what goes in each empty space.

Time to completion: Minutes: __________ Seconds: ________

Sudoku Experiment Part I

Instructions:

On the other side of this sheet is a six by six grid of squares broken up into six outlined boxes with numbers placed in a handful of the thirty-six squares. The numbers 1, 2, 3, 4, 5 and 6 must each appear once in each of the six outlined boxes, once in each of the six rows and once in each of the six columns. Use logic (i.e. do not guess) to determine what goes in each empty space.

Before you turn the page over and begin, please answer the following question:

Have you ever played Sudoku before today? Yes □ No □

Sudoku Experiment

Instructions:

The numbers 1, 2, 3, 4, 5 and 6 must each appear once in each of the six boxes, once in each of the six rows and once in each of the six columns. Use logic (i.e. do not guess) to determine what goes in each empty space.

Time to completion: Minutes: __________ Seconds: ________

Sudoku Experiment Part I

Instructions:

On the other side of this sheet is a six by six grid of squares broken up into six outlined boxes with symbols placed in a handful of the thirty-six squares. The symbols ◾,▵ , √ , ↢ , ⊝ , and ♡ must each appear once in each of the six outlined boxes, once in each of the six rows and once in each of the six columns. Use logic (i.e. do not guess) to determine what goes in each empty space.

Before you turn the page over and begin, please answer the following question:

Have you ever played Sudoku before today? Yes □ No □

Sudoku Experiment

Instructions:

The symbols◾,▵ , √ , ↢ , ⊝ , and ♡ must each appear once in each of the six boxes, once in each of the six rows and once in each of the six columns. Use logic (i.e. do not guess) to determine what goes in each empty space.

Time to completion: Minutes: __________ Seconds: ________

10. The Standard Deviation Return to Table of Contents

Before we begin:

The standard deviation is a powerful and very commonly used tool to measure of the spread or variability of a data set. It has features in common with the Mean Absolute Deviation. One feature it does not share, however, is ease of computation.

In this Activity:

This lab is designed illustrate how to find the standard deviation using a very small and arbitrary data set.

Procedure:

Consider the data 1, 2, 4, 6 and 9. Calculate the mean of this set. The mean is denoted and is read “x bar”. (I write “xbar” for short).
Use the expressions given in the column headings to complete the blanks in the table below.

Score, x	Mean, xbar	Deviation from the Mean (x - xbar)	Squared Deviation (x-xbar)2
1
2
4
6
9

Next compute: S = Sum of the Deviations from the Mean, and SS = Sum of the Squared Deviations from the Mean, and record them below.

S = ________ and SS = ________

Discuss these values with your group. Are you surprised by the results?

This data set contains 5 points, and so the next step is to compute the following value:

________

The denominator is one less than the number of data points. This deserves an explanation, but it is a long one, and better left for another time. How do the units of the numbers you just computed compare to the units of the original data?

Take the square root of your answer to the previous questions, and record the number below:

________

Finally! This number represents the standard deviation. The square root is necessary because if the original units are, say, pounds or dollars then without the square root, the units would be square pounds or square dollars. This value is often denoted Sx and is the standard deviation of this particular data set. Can you retrace your steps and write formulas for Sx?

Now enter this small data set into your calculator to verify your answer. Compare your answers with the ones your calculator produced.
Without doing any calculations how would the standard deviation of a dataset change if you added 10 to each value? Explain your answer.
Without doing any calculations how would the standard deviation of a dataset change if you multiplied each value by 10? Explain your answer.
How would adding a value to the data set that is larger than the largest value affect the calculation?

11. The Normal Distribution Lab Return to Table of Contents

Before we begin:

The normal distribution is probably the most important distribution in all of statistics because it appears so frequently when examining univariate data. If you graph the heights of a group of 100 female basketball players, the IQs of men aged 30 to 40, the birth weights of a large group of babies, the lengths of cod caught in the North Sea, you will find that the numbers very closely approximate a normal curve.

The normal curve accords with common sense. In the context of, say, heights of human females aged 20 to 25, there are very few extremely high or low heights. As we examine heights closer to the mean, there are more females with these heights.

Many completely unrelated data sets exhibit an approximately normal distribution.

Questions:

What is normal?

In this Activity:

We explore the features of a normal distribution.

Materials:

Graph paper, pencils and a graphing calculator or statistical software package.

Procedure:

Bradford College administers a placement test to incoming freshmen to determine their appropriate math placement. One year 216 freshmen take the test, which consists of 20 multiple choice questions.The results below display the possible score on the test, that is, the total number of correctly answered questions, and the count that represents the number of students who received each score. Make a histogram of the data using 20 boxes. Compare your results with your classmates.

Notice that your data is symmetrical and mounded. This score data approximates a normal, or bell-shaped, curve. In the context of the placement test we see that relatively very few students got very low or very high scores. Compute the mean and standard deviations of the scores, and record them here:

mean = ______ standard deviation = ________

How many of the scores are within one standard deviation of the mean?

What proportion of students’ scores are within two standard deviations of the mean? What proportion is this?
How many students scored within three standard deviations of the mean? What proportion is this?
Your answers to the previous questions should have given you what is sometimes called the empirical rule for normal distributions. In data that are approximately normally distributed 68% lie within one standard deviation of the mean, 95% lie within two standard deviations of the mean and practically all of the data (99.7% in theory) lie within three standard deviations of the mean. This tendency is sometimes called the 68-95-99.7 rule. Have you ever heard of it? Do the Bradford College test scores follow this rule? This, like the Pythagorean Theorem, is something to memorize. Below is a sketch of a normal distribution. What is the mean of the distribution shown?

Knowing that 99.7% of the data lies within three standard deviations of the mean, what do you guess is the standard deviation of this distribution?
The normal distribution in which the mean is 0 and the standard deviation is 1 is called the standard normal distribution. Estimate the area between the curve and the x-axis, and deduce another property of a standard normal distribution. Does the standard normal distribution accurately model the data given by Bradford College? (If not, how could you transform the given standard normal distribution so that it would?)
Statisticians have a name for the number of standard deviations from the mean: a z-score. A point with a z-score of 1.5 means that that value lies 1.5 standard deviations above the mean. A z-score of −0.6 means that the point lies 0.6 standard deviations below the mean. Z-scores are very useful for comparing normal distributions with different means and standard deviations. What is the z-score of a score of 4 on the Bradford College test?
Sophie and Pascal are applying to college. Sophie takes the SAT and Pascal takes the ACT. The scores from both of these tests are both normally distributed. The SAT has a mean of 896 and a standard deviation of 174. The ACT has a mean of 20.6 and a standard deviation of 5.2. Using the normal distribution above and the mean and standard deviation information, sketch a normal curve that would represent the scores from each of the two tests.
(Continued) Even without the picture, you can use the mean and standard deviation information to compare Sophie’s score to Pascal’s score. Sophie scores 1080 on the SAT and Pascal scores 28 on the ACT.

Sophie’s score is how many standard deviations above the mean?
Pascal’s score is how many standard deviations above the mean?
Which score has a higher z-score?
Which person do you think performed better on their respective tests?
Mark Pascal’s and Sophie’s z-scores on their respective graphs and discuss your findings.

By now, you may have already decided that the formula for computer a z-score of a point with value x in an approximately normal distribution is:

What are the units of a z-score?

What would a z-score of 0 tell you about the value of a point?
What would a z-score of 4.2 tell you about the value of a point?
Using the normal curve that you drew in problem 9, decide what percentage of students who took the SAT test had a lower score than Sophie? Notice the connection between area under the curve to the left of Sophie’s score and your answer.
Again, using the curve you drew in problem 9, decide whether or not Pascal was successful in his quest to have a score that was better than the scores of 90% of the people taking the ACT.

12. Minimum Wage Lab Return to Table of Contents

Questions:

A scatter plot of a data set gives, as they say, a thousand words of information about bivariate (two-variable) data. It is very helpful to have a common vocabulary to discuss those scatter plots.

In this Activity:

We introduce the vocabulary used to discuss scatter plots and look at many examples.

Materials:

Ruler and pencils

Procedure:

1. As a group, consider the scatterplot that shows the federal minimum wage at five-year increments. This is a graph of minimum wage versus time.

Here are some questions you might ask. What exactly does each data point represent? What shape are the data, linear, curved, clusters? Are there outliers? Do the data show a trend, positive, negative or none? How strong is the pattern, strong, weak, moderate, constant or varying? Does the pattern generalize? Is there an explanation for the pattern?
In your group, pick out the adjectives that best describe the minimum wage data.
Use a ruler to draw a line that seems to you to fit the data as well as possible. Compare with your group members.
Estimate the slope of the line, and then use coordinates of points on or very close to the line to compute the equation of this line. Let x stand for the number of years after 1960, and write the equation below, using y = mx + b form:

Interpret the value of the slope m and the y-intercept b in the context of this story. Could you safely use the line to estimate the minimum wage in the current year? (What is the actual federal minimum wage is currently?)
Discuss your findings as a class.
Now consider the scatter plots shown on the next page and working with your group, fill in the chart given the scatterplots.

13. Least Squares Regression Lab Return to Table of Contents

Before we begin:

This is a short worksheet on the Least Squares Regression line. The goal is to summarize and condense some of the ideas about linear regression that appear in a few workshops in this book.

In this Activity:

You will explore residuals with a very small data set.

Materials:

You will need graph paper, a ruler and a pencil.

Procedure:

Graph the three points (0,0), (2,3) and (4,3), and find the centroid of the triangle that they create. In a data set of any size, this point is call the point of averages and denoted .
Add to your graph the three lines y1 = 2, y2 = - x + 4, and y3 = .
The residual of a point P = (xp, yp) with respect to a line y = f(x) is the vertical distance between the y-value of the point and the y-value of the point of the line given by the x-value of P. It looks simpler than it sounds:

Residual of P with respect to y = yp- f(xp).

The residual of (2,3) with respect to y2 is 3 - (-2 + 4) = 1. Statisticians describe this new data as fit. While all three lines pass through , the line that passes closest to all of the points is the one that we choose to represent our data. Fill in the table below with the three residual values for each line.

Residuals

	y1	y2	y3
(0,0)
(2,3)
(4,3)		3
Sum of residuals

Discuss your findings. Did you learn anything new about the three lines relative to the data?
The squares of the residuals give a clearer picture of the ‘fitness’ of a line. Fill in the table this time with the squares of the residuals.

Residuals

	y1	y2	y3
(0,0)
(2,3)
(4,3)		9
Sum of residuals

The Least Squares Regression Line is the line that makes the sum of the squares of the residuals as small as possible. The line will always pass through the point of averages. When, if ever, will the sum of the squares of the residuals equal zero?

14. Mammal Lab Return to Table of Contents

Before we begin:

A table of data containing the number of days in a gestation period and the life expectancy for different mammals is located on the back page of this lab.

Questions:

How can we summarize our data with more than numbers or a description of its shape?

In this Activity:

You will learn how to model your data with a line.

Materials:

Graph paper and a graphing device.

Procedure:

Enter the data from the Mammal Data table found on the back page of this lab into your graphing device, and create a scatter plot of your data. Add labels to the axes below and sketch your data on the grid.

Draw a line on your scatter plot that you think best summarizes your data. Estimate the slope and y-intercept and fill in below:

y = _____ x + _______

In this case, y represents the predicted life expectancy, and x represents a given gestation period. Interpret the value of the slope and the y-intercept in the context of this data.

Compare your summary line to the lines drawn by others in your class. Are they the same? Same slope? Same y-intercept?
Does your line go through all of the points on your scatter plot? Does it go through any points?
A residual is the error of the regression line. That is, it is the difference between the observed y value or height of the data point and the predicted y value or height on the summary line.

For each point on the scatter plot draw a vertical line from your data point to the point on the summary line that shares an x-value with your data point. The length of each vertical line that you drew represents the absolute value of the residual of each data point with respect to the summary line.

Draw squares using each residual as one side of the square. The area of each square represents the value of the squared residual. The sum of all of the areas of the squares represents the total sum of the squared residuals. Estimate the sum of the squares of your residuals:

Sum = _______

Compare your squares with the whole class. Which line produced the smallest sum of squares?

The Least Squares Regression Line is the line that produces the minimum sum of squared residuals. Use your graphing calculator’s LinReg feature to find the slope and intercept of the Least Squares Regression Line and fill them in below:

y = ____ x + _____

In this case, y represents the life expectancy predicted by the least squares regression line, and x represents a given gestation period.

In addition, record the value that your calculator gives you for the variable r:

r = ___________.

The significance of r will be discussed in the next lab.

Add the line given by LinReg to your scatter plot (You can ask your calculator to do this automatically.) and compare with the lines and add it to the scatter plot on your calculator. Compare with the lines you and your classmates drew.
Discuss this new line as a good predictor of life expectancy given gestation period. Would you accept this line as a good model for your data?
Homework. The three data points corresponding to the human, the hippo and the elephant are far away from the bulk of the data, and small changes in their positions will have a disproportionate effect on the equation of the least squares regression line. Statisticians call these points influential points. Remove these three points from your data set, and as you did in problem 2) estimate what you think is the line that best models this data. Next have your calculator find the exact least squares regression line. (Make sure to note the value of r for this new line.) Compare your new equation to the one you obtained in class with the three influential points still in the data set.

15. Scrabble Letter Lab Return to Table of Contents

Before we begin:

This lab refers to the scores that each letter is assigned in a standard american-english scrabble game. A table of these values is given on the back page of this lab.

Questions:

Given a data set, we can choose a line that best matches the data in many ways. What qualities would you like the line to have?

In this Activity:

You will make a scatter plot of the data, choose a line that might best match the data and also find the least squares regression line.

Materials:

You need paper, pencils, a graphing device and the table of letter scores in the text.

Procedure:

Enter the letter point data from the Scrabble Letter Value table into your calculator in two columns of data and create a scatter plot of your data with tiles on the x-axis and points on the y-axis. In addition, plot the line y = −(1/3)x + 3.5 on your graph. The plotted line has rational slopes and intercepts. How well does it match your data?
Add a third column of data to your table that consists of the residuals of each data point with respect to this line. Using summation notation, write an expression that you could use to compute the sum of all of these residuals.
It is easier to compare residuals if you consider the squares of the residuals rather than the signed residual or the absolute value of the residual. (This also favors a few small residuals rather than one large residual.) Add a new column to your data table that consists of the squared residual values. Using summation notation, write an expression that you could use to compute the sum of all of these squared residuals.
Your calculator can find the sum of the elements in a column of data. Compute the the sums of both the signed residuals and the squared residuals with respect to the given line.

The LinReg feature on your calculator finds the line that minimizes this sum of squared residuals. This line, that minimizes the sum of the squared residuals is called the best fit line. Use the LinReg feature on your calculator to compute the best fit line for this data. Graph the data, the line given earlier and the best fit line. Discuss your findings with your group.
Compute the residuals with respect to the best fit line and the squared residuals with respect to the best fit line, and discuss with the class how to compare those values with those of the original line.

. .

16. Correlation Lab Return to Table of Contents

Before we begin:

Statisticians summarize the characteristics of a set of data in an important number called r, the correlation coefficient. This lab is meant to introduce this number and serve as reference in the future.

Questions:

Wouldn’t it be nice if we could quantify the notions of strength, positive association and the other words that we use to describe a data set?

In this Activity:

This lab involves reading about r and then playing a guessing game.

Materials:

Pencils.

Procedure:

“r” is a statistic that measures the direction and strength of a linear relationship. Data that is perfectly linear and has a positive slope has a measure of r = 1. Data that is perfectly linear but has a negative slope has a measure of r = −1. Data that has no discernable pattern has a correlation value of r = 0. Thus, −1 ≤ r ≤ 1. Have you seen other important mathematical variables that can take on this range of values?
There are two ways to calculate the value of r. Both use the idea that r captures the variation in the x direction, as well as the variation in the y direction. One can look at the standardized scores of each data point and think of r as the average product of these two standardized scores. Here is one version:

Can you guess the definitions for , , sx and sy? If you graph the standardized data and fit a least squares line to them then the value of r will be exactly the slope of the line fitted to the transformed data. Discuss with your group.

One can also think of r as a measure of the variability in the response variable, y, that is explained by the variability of the independent variable, x. In this case:

In words, r2 is the ratio of the sums of squares of the variability that is explained by the model compared to the variability of the most basic model, . Which definition resonates more with you?

There are some important things to keep in mind about correlation. First, correlation is a measure of the strength of a LINEAR relationship. One can calculate r for many types of non-linear relationships, but the measure is meaningless if the relationship is clearly nonlinear from a visual examination of the scatterplot. Correlation is also a measure for quantitative variables only. Can you think of a time when believing something is linear when it is not can cause you to make mistakes in predicted values?
And second, correlation does not imply causation. Just because two quantitative variables are strongly linearly correlated, that does not mean that changes in one variable cause changes to occur in the other variable. Both variables may be responding to changes in a third variable that is not in your model. For example, in a sample of elementary school students, there is a strong positive correlation between shoe size and scores on a standardized test of arithmetic skills. Does this mean that studying arithmetic makes your feet bigger? No, shoe size and arithmetic skill are related to each other because both variables respond to a third variable, age. Can you think of an example where correlation does not mean causation?
To get sense of the measure of correlation, try to guess the correlation r for the scatterplots on the following pages.

These graphics were generated at

http://www.istics.net/Correlations/

17. Scrabble Word Lab Return to Table of Contents

Before we begin:

This lab refers to the scores that each letter is assigned in a standard american-english scrabble game. A table of these values is given in the Scrabble Letter Lab. This lab is adapted from:

Workshop Statistics, Discover with data and Fathum.A Rossman, B Chance, R Lock. Key Curriculum Press, Emeryville 2001, 1-930190-07-7.

In this Activity:

You will generate data from the names in the room, plot the data, and consider the trend, form and strength of data. It is always a good idea to look at your data in the form of a scatterplot before you compute a best fit line and its correlation coefficient.

Materials:

You need paper, pencils, a graphing device and the table of letter scores in the text.

Procedure:

Print your whole name in the top row of the table below, one letter per space.

Count the number of letters in your name, ignoring blanks and spaces.

Number of letters in my name: _____

Using the data given on the back page of the Scrabble Letter Lab, write the point value for each letter in your name in the space below it in the Name Score table above. Add the numbers to compute the scrabble value of your name.

Scrabble value of my name: ________

Repeat these steps with one or more names of your own choosing (”Double 0 Seven” for instance).

Share your score with the class, and fill in the table given on the last page of this lab with the data.
Enter your data into your calculator or graphing utility and make a scatter plot of word length vs point value for each name. Discuss with your group, the strength, trend and form of your data. Comment on the association between the length of names and their Scrabble value. Describe the three features with a few words below:

Strength

Trend

Form

How well do you think you could predict the Scrabble value of a person’s name given the length of the name? Try to predict the scrabble value of these names: Dustin Pedroia and Vladimir Putin. Discuss this question as a group or class.
Can you find examples of pairs of names in which the longer name has a lower value? Can you find groups of three or four such names?

Name	Number of Letters	Scrabble Score
Pat Tecake	9	16

18. Heights Lab Return to Table of Contents

Questions:

Sometimes the r value can be misleading. It is important to also use the residuals to analyze the fit of your fit.

In this Activity:

You will consider both the residuals and the r-value for the data concerning heights in inches and age in years that is located at the beginning of this lab.

Materials:

Graph paper and a graphing calculator.

Procedure:

1. Enter the data from the table below into your calculator.

Use your calculator to make a scatter plot of height vs. age. Find the equation of the least squares line for predicted median height versus age and graph the line on the plot.
Find the value of r that goes with this line. What can you conclude about your regression line based on r. Discuss the data, the line and r as a group.
If L1 contains the ages and L2 contains the heights, then define L3 to be Y1(L1). Store the residuals of the fitted line in L4 by defining L4 = L2 − L3. Plot the residuals vs age on a new graph. Discuss the proper domain and range for these values with your neighbor.
Do the new data, the residuals, seem to be randomly scattered about the x-axis? What can be said about modeling this data with a line?

19. Was Leonardo Correct? Return to Table of Contents

Questions:

Leonardo da Vinci wrote instructions to artists about how to proportion the human body in painting and sculpture. Three of Leonardo’s rules were:

• Height equals the span of the outstretched arms.

• Kneeling height is three-fourths of the standing height.

• The length of the hand is one-ninth of the height.

Discuss as a group. Do these proportions seem reasonable? Is Leonardo really suggesting that there is a linear relationship between these lengths?

In this Activity:

In this activity, you will gather data, compute the best fit line, compare it to Leonardo’s predicted linear models and use the r-value to support your findings.

Materials:

You will need meter sticks, pencils and classmates to complete this activity.

Procedure:

Decide as a class which units of measure you will use. Then working with a partner, measure your height, kneeling height, arm span and hand length, and record it in the table below:

Height	Kneeling Height	Arm Span	Hand Length

Make a data table that includes the measurements from everyone in the class. You can use the table on the next page to record the class data if you wish.
Make three scatterplots of the data, arm span vs height, kneeling height vs standing height and hand length vs height. On each scatter plot, add the lines that Leonardo predicted would model the data.

For the plots that have a linear trend, use your calculator to find the least squares regression line and compute the r value or correlation coefficient.
Discuss the meaning of the regression lines as a group. In particular, discuss the slopes and y-intercepts in the context of this activity.
How well do the lines fit the data? Does the value of r support Leonardo’s rules?

Leonardo’s Lengths

Height	Kneeling Height	Arm Span	Hand Length

Letting Height - H, Kneeling Height = K, Arm Span = A, and Hand Length = L, Leonardo says:

20. Counting F’s Return to Table of Contents

Before we begin:

The text on the following page should not be handed out until you have explained the activity to the students.

Questions:

Surveys are a ubiquitous part of life these days. A well-written survey is very difficult to construct. Let’s say, you wanted to find out how many hours a night each participant slept of the week, how would you phrase your question?

In this Activity:

Students are going to count F’s in a text and compare their results.

Materials:

Enough copies of the text on the next page.

Procedure:

Pass out the text on the next page face down. Tell the participants that they are going to have 1 minute to count all of the F’s in the text.
Start your watch and let them count. Stop your watch and record the number of F’s counted on the board. Did anyone count 34?
Let everyone know that no one found the correct number, and give them 3 more minutes to count the F’s.
Again compare answers.
Discuss with the group the implications of this activity.

THE NECESSITY OF TRAINING HANDS FOR FIRST-CLASS FARMS IN THE FATHERLY HANDLING OF FRIENDLY FARM LIVESTOCK IS FOREMOST IN THE MINDS OF FARM OWNERS. SINCE THE FOREFATHERS OF THE FARM OWNERS TRAINED THE FARM HANDS FOR THE FIRST-CLASS FARMS IN THE FATHERLY HANDLING OF FARM LIVESTOCK, THE OWNERS OF THE FARMS FEEL THEY SHOULD CARRY ON WITH THE FAMILY TRADITION OF TRAINING FARM HANDS IN THE FATHERLY HANDLING OF FARM LIVESTOCK BECAUSE THEY BELIEVE IT IS THE BASIS OF GOOD FUNDAMENTAL FARM EQUIPMENT.

21. Jelly Blubbers Colony Return to Table of Contents

Before we begin:

Make sure that you have plenty of copies of the Jelly Blubbers Colony. This lab was adapted from a Jellyblubber Activity invented by Rex Boggs, a teacher in Queensland, Australia.

Questions:

Sampling is another important activity undertaken by amateur and professional statisticians. If I were curious to find out what Americans did to celebrate Memorial Day, I would probably ask my friends and family what they were doing, but this would not give me a very good sample. How could I improve my sample?

In this Activity:

This lab encourages good sampling practices and techniques. Jellyblubbers are a recently discovered marine species. Scientists have discovered a colony of jellyblubbers and they are trying to determine the width of a typical jellyblubber.

Materials:

Just the Jelly Blubbers Colony page.

Procedure:

You have been handed a sheet of 100 jellyblubbers. Study the sheet for 10 seconds and then record the numbers of 5 jellyblubbers that you think form a representative sample of this jellyblubber population. Use the second sheet, which lists all the blubbers and their widths, to write down the widths of your chosen blubbers.

Number
Width

The sample you chose is called a judgment sample. Compute the mean width of your sample and record it below:

Mean Width

Share your data with the class by adding your result to the table of widths on the board. You can record the class data on the data sheet on the last page of this lab.
Make a dot plot of the data class collected and notice the shape, approximate center and range of the graph. Discuss with the class the shape, approximate center and range of the graph.
Next, use your table of random digits to select ten two-digit numbers from 00 through 99. The pair 00 represents 100, and single digit numbers, like 7 for instance,are represented with two-digits, as 07 for instance. Find the mean of these ten numbers. Contribute your mean to the class data. Again, decide how to compare and comment on the shape, center and range of this new data set. The sample you found using random numbers is called a simple random sample, abbreviated SRS.
The true mean, the actual, computed mean, of the widths is 18.6 cm. Which of the three methods gave a center closest to 19.4? Which method do you think is the more accurate for finding the mean, a judgment sample or an SRS? Why?
If you shake a collection of blubbers, the larger ones tend to sink and the smaller ones rise. You have been a given a sheet divided into five strata. Notice that there is little variability among blubbers within each stratum (singular of strata), but more variability between strata (plural of stratum). Using random numbers and the table of jellyblubber widths by strata, select two blubbers from each stratum and find the mean of these ten numbers. This method is called stratified sampling. Share your data with the class and compare the mean of the class data from the stratified samples with the means from the SRS.
There are more ways to pick samples! A cluster of blubbers is a group of blubbers near each other in the non shaken collection. There is usually a lot of variability within each cluster, but not much variability among clusters. To select a cluster sample pick a random digit between 1 and 20. Call it r. Multiply this digit by 5. Your cluster will consist of that number, 5r, and the four numbers preceding it. For example, if you pick 11 as your random digit then the cluster will consist of blubbers numbered 55, 54, 53, 52, 51. Referring to the table of jellyblubber width values, write down the 5 widths from this cluster and compute their mean. Share your cluster mean data with the class. Decide on the best way to graph the class data generates from these cluster samples.

There is one more method of sampling using the original population, not the stratified population. Pick a random digit between 1 and 20. This number is the first blubber. Add 20 to your random number. This is your second blubber. Continue until you have five blubbers. For example, if you pick 07 as your random number then your sample will consist of blubbers 7, 27, 47, 67, 87. Compute the mean jellyblubber width of your sample. This is called a systematic sample. Comment on the graph generated by these means.
Discuss with your group and the class as a whole the advantages and disadvantages of the different sampling methods.

Jellyblubber Sampling Means

	Sample Width
Sample Number	Judgement	SRS	Stratified	Cluster	Systematic











Mean of Samples

22. Discrimination or Not? Return to Table of Contents

The scenario:

At Main Street Bank last year 48 male bank supervisor were each given a personnel file and asked to judge whether Pat Tecake, the person represented in the file should be recommended for promotion to a branch-manager position or whether the Pat Tecake should not be recommended for promotion. The files given to each of the 48 supervisors were in identical except that half of the files (24) were labelled ”male” and half of the files were labelled ”female”. Of the 48 files reviewed, 35 were recommendation for promotion.

In this Activity:

You explore the notion of chance variation and see how to use simulations to determine whether an outcome can be explained by chance variation.

Materials:

You will need a deck of cards and pencils.

Procedure:

Suppose that the recommendations showed no evidence of discrimination on the basis of gender. How many male Pats would you expect to be recommended for promotion? How many females? Enter these values in the table below

No Discrimination by Gender

	Promotion	No Promotion	Total
Male			24
Female			24
	35	13	48

Now suppose that the recommendations showed strong evidence of discrimination on the basis of gender. How many male Pats might you expect to be recommended for promotion? How many females? Complete the table below to show a possible example of this case.

Discrimination on the Basis of Gender

	Promotion	No Promotion	Total
Male			24
Female			24
	35	13	48

Suppose the evidence for discrimination was inconclusive, neither strongly in favor nor strongly against. Complete the following table to illustrate this situation.

Inconclusive

	Promotion	No Promotion	Total
Male			24
Female			24
	35	13	48

In the actual situation in the study, the results were that 21 of the 24 files labelled ”male” were recommended for promotion, and 14 of the 24 files labelled ”female” were recommended for promotion. Enter these data into the table below.

Actual

	Promotion	No Promotion	Total
Male			24
Female			24
	35	13	48

In the actual situation, what percentage of the recommended candidates were male? Female?

Do you think there is evidence of discrimination against the female candidates? How certain are you?
How likely do you think that mere chance was responsible for the smaller number of females recommended for promotion?
If you were the attorney retained by the female applicants how would you go about collecting evidence to decide whether the results occurred by chance or whether there really was discrimination?

Simulating the Case

If there really were no difference between the male and female candidates then you might as well roll a die or use playing cards to pick male or female completely at random. Here’s how you will do it.

Remove two red cards and two black cards from a full deck of cards. You now have 48 cards, 24 of each color. Let black represent male and red represent female.
Shuffle the cards thoroughly and deal out 35 cards to represent the 35 candidates who were recommended for promotion. You could do this more efficiently by dealing out 13 cards to represent the candidates who were not recommended for promotion.)
Count the number of black cards to represent the number of men recommended for promotion. Record this number in the table below. And then repeat steps 10 and 11 nineteen more times for a total of 20 simulations.

Simulation Data

Trial #	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
# of Black Cards

When your table is complete, record your findings on the grid given by placing a dot above the number of black cards you counted. The dots will stack up as the numbers are repeated.

Estimate the chances that 21 or more black cards (males) would have been selected based on your simulation.
Based on your simulation do you think there is any evidence to support the claim that recommending 21 males out of 35 candidates was due to discrimination rather than to chance variation? In other words, how do your simulated results compare with those of the original study?
How do your simulated results compare with those of your classmates?

23. Sudoku Experiment Part II Return to Table of Contents

Before we begin:

The class will need to have complete Sudoku Experiment Part I in order to complete this lab.

Questions:

Perhaps the differences in times to complete a puzzle were not actually due to which type of puzzles were solved, but occurred only by chance. How could we check to see if the difference in completion times that we recorded were due simply to chance? What does that question mean in a statistical context?

In this Activity:

Students will explore the difference of two means to test the strength of their conclusions. Students will analyze the data again, this time using groups of times that are randomly selected.

Materials:

Students need the data collected in Sudoku Experiment Part I and note cards.

Procedure:

Using one notecard per Sudoku puzzle, write only the time it took to complete the puzzle on the notecard. Once you have recorded each time on a separate notecard, shuffle the note cards. Split the deck into two piles divided roughly in half. Compute the mean time for each of the two piles and find their difference. Record your numbers in the table below:

Sudoku Random Grouping Means

Simulation	1	2	3	4	5	6	7	8	Experimental Data
Group 1
Groups 2
Difference

Shuffle the deck again, and repeat the process of dividing the deck into two parts and computing the mean of each part. Record your data in the your table and then share your data with the whole class.
Make a dot plot of all of the differences between group means.
Describe the distribution of the differences in sample means that were collected.
Did any of the simulations result in a difference value as large as the one initially found?
Use your calculator to compute the standard deviation of the differences. How many standard deviations away from the mean does the original difference fall? Does this suggest that there is a significant difference in the time it takes to do each kind of puzzle, or do you think that the difference was just a matter of chance?

24. Titanic Return to Table of Contents

Before we begin:

This data in this lab comes from:

www.encyclopedia-titanica.org/titanic-statistics.html

and some of the questions come from: Statistics In Action, A Watkins, R Scheaffer, G Cobb. Key Curriculum Press, Emeryville 2008, 978 -1-55953-909-8.

Questions:

You’ve heard expressions like, ”the chance of striking out a right-handed batter given that you are a left-handed pitcher” or ”the chance of having a boy given that you already have 2 girls.” How can you answer these questions effectively?

In this Activity:

In this lab, we explore these expressions of conditional probability. We begin the disastrous event of April 15th, 1912, when the unsinkable Royal Mail Ship Titanic struck an iceberg in the North Atlantic and sank. Only 710 of her 2204 passengers and crew survived. The following two-way table records the data on the fate of her passengers.

Materials:

Pencils and paper.

Procedure:

Calculate the following probabilities. Leave your answers in fraction form.

If one passenger is randomly selected, what is the probability that the passenger was in first class?
If one passenger is randomly selected, what is the probability that this passenger survived?
If one passenger is randomly selected, what is the probability that this passenger was in first class and survived?
If one passenger is randomly selected, what is the probability that this passenger was either in first class or survived or possibly both?
If one of the passengers is randomly selected from the first class passengers, what is the probability that this passenger survived?
If one of passengers is randomly selected from the group of survivors, what is the probability that this passenger was in first class?

Why is the answer to part E above larger than the answer to part C above?
The questions asked in parts E and F are examples of conditional probability. In part E we already know that the passenger is in first class. This is the condition. We could rephrase the question as, ”What is the probability that the passenger survived, given that he/she is in first class?” Rephrase the question to part F using this phrasing.
How are your answers to parts A, C and E above related?
A tree diagram is a great way to help answer the previous question. Below is a part of the tree which illustrates this story.

Each lowercase letter in the diagram should be replace with the probability of the event at the end of the branch. You have already computed some of the probabilities, add in all of the probabilities to the tree.

The conventional shorthand for writing the probability of event S is P(S), so P(first class) = 324/1317. Use your answers to question 5 to find a formula for P(S given T) in terms of P(S and T) and P(T).

One more example before we go.This problem is a classic example of a probability question with a surprising outcome. Suppose there is a rare but serious disease that affects 1 % (0.01) of the population. There is a test that correctly identifies 95% (0.95) of the time that you have the disease if you are in fact infected. The test also correctly identifies 95% of the time that you are not infected if you are in fact not infected.

Suppose that during a routine physical you take the test which turns out positive. The test says that you are infected.

What is the probability that the test reports a positive result if you are infected?
What is the probability that you are infected if the tests reports a positive result?

It may help to assume that the population is one million people. You might also want to use a tree diagram or a two-way table like the one shown earlier in this lab. In the case of the table the variables would be test: positive or test: negative and infected or not infected.

25. Probability and Independent Events Return to Table of Contents

Before we begin:

This is a short worksheet on statistical independence. It seems reasonable to say that the price of a dozen eggs in Exeter and today’s temperature in London are independent events. Knowing the value of one, gives you no further information about the other. But what about events, ”X was in first class” and ”X did not survive”?

In this Activity:

You will explore the notion statistical independence. The statistical definition of independence is that A and B are independent events if and only if P(A given B) = P(A). In other words, knowing the probability of B occurring sheds no light on the conditional probability of A given B.

Materials:

You will need graph paper, a deck of cards and a pencil.

Procedure:

Are the events ”passenger survived” and ”passenger was in first class” independent events? Support your answer with appropriate probability calculations.
Are the events ”passenger survived” and ”passenger was in third class” independent events? Support your answer with appropriate probability calculations.
Use appropriate probability calculations to support the statement: ”Not all passengers on board the Titanic had the same probability of surviving.
From a well-shuffled regular 52-card deck, pick a card at random. Note its color and replace it. Take a second random card from the same deck. Note its color and replace it. What is the probability that both cards are the same color?
From a well-shuffled regular 52-card deck, pick a card at random. Note its color and do not replace it. Take a second random card from the same deck. Note its color and replace it. What is the probability that both cards are the same color?
Draw a tree diagram to illustrate the probabilities in questions 4 and 5.
Would you agree that with the following statement? A and B are independent if and only if P(A and B) = P(A) ∗ P(B).

26. Hand Eye Coordination Return to Table of Contents

Questions:

Do you know if your friends are right-handed or left-handed? Can you use this information to find out if they are right-eyed or left-eye?

In this Activity:

This lab explores the independence of two attributes and gives an example of a two-way table.

Materials:

People, pencils.

Procedure:

Determine whether each person in your group is right-eyed or left-eyed. The procedure for determining which is simple. Pick out an object that is 15 feet or so away from you and face the object. Hold your hands together, palms out, at an arms length, making a small space that you can see through. Look at the chosen object. Now close your right eye, keeping your left eye open. Can you still see the object? If yes, congratulations, you are left-eyed. Repeat, but this time closing your left eye and keeping your right eye open. Can you still see the object? If yes, congratulations, you are right-eyed. Record your personal data below:

Your Hand Eye Information

Name	Eyedness L or R	Handedness L or R

Share your data with the whole group by filling in a large table with the name and hand and eye data. Discuss your findings. Do you see any trends?
A two-way table is a great way to summarize data such as the hand and eye data you have collected. There are four possible combinations for handedness and eyedness, either LL, LR, RL or RR. Count the number of each type of person and record it in the two-way table below:

Hand and Eye Two-Way Table

	Handedness
Eyedness	Right	Left
Right
Left

For a randomly selected member of the class are handedness and eyedness independent? Discuss with your group and the class.

27. Music and Sports Return to Table of Contents

Questions:

How do you compare data that is not numeric like height in inches or time in seconds? What about data such as your favorite ice cream flavor or the color of your bicycle or whether you bike, walk, drive, bus or train to work?

In this Activity:

This lab explores categorical data, which is different than the numeric quantitative data that we have worked with in all of our previous labs.

Materials:

People, pencils, graph paper.

Procedure:

Ask your neighbor the following questions and record them below:

Do you play a sport? Yes □ No □

Do you play a musical instrument (voice included)? Yes □ No □

Now collect all of the data from the class on the board and record it in the table below.

Music and Sports

	Sports?
Music?	Yes	No	Total
Yes
No
Total

Discuss your observations as a group.

Is there an association between playing a sport and playing a musical instrument? If you play a sport, are you more or less likely to play a musical instrument?

Of those who play sports, what proportion also play a musical instrument? Of those who do not play sports, what proportion play a musical instrument? Does there seem to be an association?
One way to visualize this data is to create a graph, similar to a histogram with two bars of equal heights, one labeled Yes, and one labeled No, based whether the person plays sports or not. Then each bar is colored with one color to indicate Does play an instrument and with another color to indicate Does not play an instrument. Now compare areas and see if you change your conclusions about the association between playing a sport and playing an instrument.

28. Expected Value Lab Return to Table of Contents

Questions:

The school basketball team is in a one-and-one situation. This means that each time the team goes up for a free throw, they will get a second free throw if they make the first. Shura is a 60% shooter. This means that in the long run, Shura makes a basket 60% of the time. On any given trip to the free throw line, how many points do you expect Shura to make, 0, if she misses the first shot, 1, if she makes the first shot and misses the second, or 2, if she makes both shots?

In this Activity:

Students will simulate Shura’s free throw attempts and learn about expected value.

Materials:

You will need a pencil or pen, a bobby pin and the spinner provided below.

Procedure:

Discuss with your classmates how many points you expect Shura to score on any trip to the free throw line.
In 100 trips to the line, about how many points will Shura make in total? Discuss again.
Over the long run, what do you think will be Shura’s average number of points per trip to the line? Discuss how to compute this number with your classmates.
Place a pen or pencil through the bobby pin you have been given at the center of the circle to make a spinner. Try flicking the bobby pin to see if you can simulate Shura’s hitting or missing the hoop.
Divide into pairs, and assign one person to record and one person to ’throw’. Use the spinner to simulate 20 trips to the free throw line, and record your results in the spinner’s table. Switch jobs, and simulate 20 more trips, recording your results in the spinner’s table.

Shura’s Simulation

Points	Frequency	Total	Approximate Probability
0			/ 20
1			/ 20
2			/ 20
		20

Combine the data from the whole class and enter the data in the table below.

Shura’s Simulation, Class Data

Points	Total Frequency	Total Frequency/Total Number of Trials
0
1
2

Based on the class data, what is Shura’s most likely score, 0, 1 or 2?
Below is a tree diagram that helps analyze the theoretical probabilities of Shura’s outcomes. Discuss the graph with your partner. Can you explain the notation P(A, B)? What is the sum of the three probabilities of the three different outcomes, 0 points, 1 point or 2 points?

Theoretically, in 100 trips to the free throw line, how many 0’s, 1’s and 2’s can we expect Shura to score?
What is Shura’s average score per trip to the line? This number is called the expected value of Shura’s score.

29. M&M Concentration Return to Table of Contents

In this Activity:

We will use a sample to estimate the percentage of green M&Ms in a population. We will also look at sample variation to explore how confident we are in using a single sample statistic to estimate a parameter.

Materials:

A large bin filled with standard M&Ms (at least 500), napkins and Dixie cups for each student. You will also want to draw a large table on the board with the headings, Sample Number and Percent Green M&M’s.

Procedure:

Scoop a cup full of M&Ms from the bowl. Pour them out onto a napkin and make sure you have a sample of exactly 30 M&Ms. You may have to add a few or take some away. Try to do this without sorting out or adding a particular color. You may want to close your eyes to select which you will add or take away.
Count the green M&Ms present in your sample. Calculate the percentage of your sample that is green. Round to the nearest percentage point.
Record your value on the board with the rest of the class data.
Repeat the sampling process until your class has 100 sample statistics. Make a distribution graph of the results.
Based on your class’s data, what do you think is the true percentage of green M&Ms? If we were to use the first sample statistic to estimate the parameter, the percentage of green M&Ms in the entire bowl, what would our estimate be? Is another sample a better estimate?
Calculate the range of percentage values in the sample statistics. How confident are you that you and your classmates have found the value of this parameter?
Using the class data, find a range of values that would include 90% to 95% of the sample statistics from the class. Are you convinced that the parameter falls into this range?
As a class, repeat the process of collecting sample statistics, but this time, use samples of size 10.
Make a distribution of the class means for the samples of size 10. Discuss any differences that you see as a class. Did a smaller sample size produce different sampling variation? Could you have predicted this outcome?
Suppose you are a quality control inspector at the M&M plant. On March 1 you take a sample of 30 M&Ms and determine that 10 are green. Should you worry that there is a problem with the mixing station? On April 1 you pull a sample of 30 and find that 13 of them are green. Should you report a problem now?
In 2008, Mars stated that their candy mix includes 16% green candies. How does this value compare to yours?
Topics for further discussion include: Were your samples biased? How many samples do you need to make a reasonable guess at the parameter’s value? What have other interested consumers found to be the value of the parameter?

30. Simulation Lab Return to Table of Contents

Questions:

The Blood Bank of the Redwoods in Santa Rosa, California is running low on its pints of type A blood. Long-term statistics kept by the blood bank indicate that 40% of donors have type A blood. In a recent blood drive, the blood type of the donors was tested as they entered the Santa Rosa High School gymnasium to donate. What is the probability that it will take at most 4 donors to find one with type A blood?

In this Activity:

Students will use a table of random digits to simulate donor-testing and explore the empirical probability of finding a type A donor within 4 tests.

Materials:

You will need the table at the back of this book and pencil and paper.

Procedure:

Look at your table of random digits and notice that only the ten numbers 0, 1,...,9 appear in the strings of digits. These digits will represent the population of all blood donors. Each digit represents one person. Assign 4 numbers to represent type those donors who have blood type A. Thus 40% of the digits, or people, in your population will have type A blood. The other 60%, represented by the remaining 6 digits, have another blood type.
Ask your teacher for a row assignment, and start reading the string of numbers on the left end of the row. (The gaps in the list are only for readability and have no other significance.) Read until you find a type A blood donor. When you find a type A blood donor, stop and count how many digits you have read, or donors you have tested since the last type A donor appeared. If your type A appears within the first 4 numbers, record that in the table below as a successful trial and if not, record that as a failed trial. Begin a new trial after each occurrence of a type A blood donor.

Trials

Successful	Failed	Total
		40

Continue this process until you have completed 40 trials. In how many trials did it take at most 4 donors to find one with type A blood?
What is your empirical probability of the blood bank finding a type A donor in at most 4 attempts?
Combine the class results and discuss your findings. Discuss the best way to visualize the results. Discuss the distribution of results.
The theoretical probability is 0.8704. How close is this number to the mean of the class’s results?
How would you change your process if 60% of donors were type A?
How would your process change if you considered success being the occurrence within 4 donors any of type A or type B+ or AB− which long term statistics indicate occur at 40%, 9% and 1% respectively?
How would you change your process if 45% of donors were type A?
One way to determine theoretical probability above is:

1 − P(x ∋ {1,2,3,4})=1 − (0.6)4 = 0.8704.

Another way is:

P(x ≤ 4) = P(x = 1) + P(x = 2) + P(x = 3) + P(x = 4)

P(x ≤ 4) = (0.4) + (0.6)(0.4) + (0.62)(0.4) + (0.63)(0.4) = 0.8704.

Can you think of another?

31. Crop Sampling Return to Table of Contents

Before we begin:

This lab was adapted from:

a lab designed by Carolyn Doetsch, Peter Flanagan-Hyde, Mary Harrison, Josh Tabor and Chuck Tiberio for the North Carolina School of Science and Mathematics Statistics Leadership Institute and found at:

https://courses.ncssm.edu/math/Stat_inst01/PDFS/river.pdf . See also:

https://docplayer.net/27375292-An-exercise-in-sampling-rolling-down-the-river.html

Questions:

At the beginning of the spring a farmer cleared a new field and planted a first crop of corn. The new field is a unique plot of land in that a river runs along one side. The corn looks good in some areas of the field but not in others. The farmer is not sure that harvesting the field is worth the expense. He has decided to harvest 10 plots and use this information to estimate the total yield. Based on this estimate, he will decide whether to harvest the remaining plots and also whether to use this field next year.

In this Activity:

You will explore four different sampling methods: convenience samples, simple random samples, vertical strata and horizontal strata and discuss their merits and pitfalls.

Materials:

A calculator, pencils and paper.

Procedure:

Convenience Sample. The farmer began by choosing 10 plots that would be easy to harvest. They are filled on the grid below:

Simple Random Sample. Feeling uneasy about his plot selection, the farmer talks to his daughter who is taking statistics at Wheatridge High School, and asks if she could suggest a better choice of 10 plots to harvest early and use to estimate his total yield at the end of the growing season. His daughter suggests three other methods. The first is a simple random sample. Use your calculator or a random number table to choose 10 random plots, and mark them on the grid below:

Stratified Sample, Vertical. For this method, consider the field as grouped in vertical columns (called strata). Using your calculator or a random number table, randomly choose one plot from each vertical strata and mark these plots on the grid.

Stratified Sample, Horizontal. For this method, consider the field as grouped in horizontal rows (also called strata). Using your calculator or a random number table, randomly choose one plot from each horizontal strata and mark these plots on the grid.

The actual harvest yields per plot are shown in the table below (Of course, the farmer does not have access to this information ahead of time, or there would be no need use sample plots.):

Compare the different sampling methods discussed by the farmer and his daughter, and record your findings in the table.

Method	Mean yield per plot	Estimate of total yield
Convenience sample
Simple random sample
Vertical strata
Horizontal strata

You have looked at four different methods of choosing plots. Is there a reason, other than convenience, to choose one method over the other?
How did your estimates vary according the different sampling methods you used?
Compare your results to the rest of the class and discuss your findings.

Pool all the results from the class and make a boxplot of the data for the simple random sample, which we will call SRS. Do the same for the vertical strata and the horizontal strata. Discuss the results?
Which sampling method is best for the farmer?
What was the actual yield of the farmer’s field, and how did the boxplots relate to this value?
A year has gone by, and the farmer has installed an irrigation system to try and even out the yields in his field. He and his daughter decide to sample his plots using a SRS, and vertically and horizontally stratified samples. Repeat the process of generating a list of plots to sample for each method, and in each case, mark your chosen plots on the grids below:

The actual harvest yields per plot post irrigation are shown in the table below:

Again, compare the different sampling methods and record your findings in the table.

Method	Mean yield per plot	Estimate of total yield
Convenience sample
Simple random sample
Vertical strata
Horizontal strata

Compare the class box plots of the sample means obtained from the three sampling methods. Discuss with the class.
Based on the results of both the initial sampling and the post irrigation sampling under what conditions is it more useful to use stratified sampling? Random sampling?

32. Anscombe’s Quartet Return to Table of Contents

Questions:

Statistics is a relatively new topic of research and application. Statistics is a very active and dynamic area of research and application.

In this Activity:

Students will read a mathematical article from 1973, and consider an important data set called Anscombe’s Quartet.

Materials:

You will need the following table of data:

Procedure:

Read the Introduction to the article.
Use the data table to create your own scatter plots of the data and discuss with your classmates.
Read the rest of the article for homework and see how Anscombe’s ideas compare with your and your classmates’ ideas.

33. Cereal Box Problem Return to Table of Contents

Before we begin:

This lab is based on the following article:

The Cereal Box Problem Revisted Jesse L. M. Wilkins, Virginia Polytechnic Institute and State University School of Science and Mathematics (Volume 99(3), March 1999),

http://eric.ed.gov/?id=EJ590348

Questions:

Fastbreak Cereal Company wanted to promote their Star Oats cereal, so they made Harry Potter figurines, Harry, Ron, Hermione, Hagrid, Dumbledore and Draco, and they placed a character in each box of Star Oats. You can image what a stir this caused. Assuming equal chances of getting any of the six characters, (i.e. Fastbreak did not withhold all of the boxes with Dumbledore from shipping to increase demand for its cereal beyond the increase certainly guaranteed by the insertion of prizes in its packages.) how many boxes of Star Oats should any kid expect to have to buy, open, consume in order to get a full set of six characters?

In this Activity:

Students will be introduced to Monte Carlo simulation, the process of simulating an event that is difficult, expensive or otherwise impossible to affect. They will explore the answer to the above question and discover an empirical answer. They will also explore ways to understand and generate a theoretical answer to the question of how many boxes of cereal one must open to collect all six prizes.

Materials:

You will need one die for each two students and pencils.

Procedure:

Break into teams of two, pick a die and assign one of the six numbers to each of the Harry Potter characters. This die will allow us to model buying, opening and consuming boxes of cereal without actually buying, opening and consuming boxes of Star Oats.

Characters’ Numbers

Character	Harry	Ron	Hermione	Hagrid	Dumbledore	Draco
Number

Have one person roll the die (buy, open and consume boxes of Star Oats) until they obtain all 6 numbers (characters). Using the table below, the other person can record the numbers as they are obtained and after they have all been obtained, count the number of rolls. Divide the work of generating 100 simulated trials amongst the pairs in your class.

Trial Outcomes

Trial	1	2	3	4	5	6	N
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

When your class is done collecting the data, find the average number of purchases required to obtain all six characters. This average is the expected number of purchases or the expected outcome. Discuss this number with your classmates.
If you have a TI89 or TI84, ask your instructor for the procedure which will simulate as many rolls of the die as you would like and find the average value of all of the simulations. This is a Monte Carlo method. Give it a try, and then discuss your results with your classmates.
The theoretical answer to the question of how many boxes of cereal a person should expect to open in order to find all 6 prizes is 14.7. How does this answer compare with your findings?
If you want, you can modify the procedure given to help you understand the smaller components that go into answering the complicated question posed in this lab. Once you have opened one box and found a prize, how many boxes can you expect to open before you find a second different prize?
There are many approaches to determining the theoretical answer to the question posed in this lab. Here is an outline of one way.

First, the first box must be opened, and a prize obtained, call it P. Once that box has been opened, the chance that a new prize is in the next box is , and the chance that you have to open 2 boxes to obtain a new prize is because you have ⅙ of a chance of getting P again, and then ⅚ of a chance of getting a prize other than P. This goes on for a possibly infinite number of times, but the chance of needing to open yet another box decreases significantly after a certain point. If we multiply the number of boxes opened after the first one by the probability of finding our second new prize in that one, and add all of these numbers together, we obtain the expected number of boxes. It looks like this:

And if we perform the same calculation for the third new prize, the fourth new prize, the fifth new prize and the final prize, we find that the expected number of prizes is:

We will use what we know about convergent geometric series and differentiation to compute this value. You can see that the general form of each of the above sums is:

where 0 < r < 1 and A = 1 − r. And understanding that these series are convergent, we have:

Now remember that A = 1−r to see that S = 1/(1−r). Applying this fact 5 times above, we obtain:

34. Nine-Block Return to Table of Contents

Before we begin:

The words of the authors of this activity:

In sum, the 9-block form is designed to scaffold student insight into the systematicity and rigor of combinatorial analysis by serving as a template that helps students initially see, create, and organize the combinatorial space.

This lab was adapted from:

There Once Was a 9-Block ... A Middle-School Design for Probability and Statistics.Abrahamson, D., Janusz, R. M., and Wilensky, U. Journal of Statistics Education [Online] (2006).

www.amstat.org/publications/jse/v14n1/abrahamson.html

Questions:

How many different ways are there of filling up a nine-block grid with blue and green blocks? How would you count the number of ways?

In this Activity:

In this lab students create 9-blocks and discuss ways to count and record all possibilities. The goal is to create a tower of 9-blocks, and then have that as a visual from which to draw conclusions and test hypotheses.

Materials:

9-block grid sheets, colored pencils, crayons or dot markers.

Procedure:

Fill in each of the squares below with blue or green. Compare your square with your neighbor.

On the grid below draw 5 different possible blue and green 9-blocks. Compare your squares with your neighbor’s squares. How many are the same? How many are different?

How many possible 9-blocks can be made? How many different 9-blocks has your class made so far?
Together devise a strategy for creating all 9-blocks.
Together devise a strategy for displaying all 9-blocks.
How many different 9-blocks have 4 blue squares?
How many different 9-blocks have 6 blue squares?
How many different 9-blocks have a blue square in the top left-hand corner?
How many different 9-blocks have a blue square in each of the corners?
HOMEWORK: Make and display all 4-block possibilities and write three probability questions you might use your display to answer.

35. An Age Discrimination Case Return to Table of Contents

Before we begin:

This lab was adapted from:

Statistics in Action, Understanding a World of Data, A Watkins, R Scheaffer, G, Cobb, Key Curriculum Press, 2008. ISBN 978-1-55953-909-8.

http://www.apstatsmonkey.com/StatsMonkey/TPS2e_files/IntroWestvaco.pdf

Background:

True story: Robert Martin turned 55 in 1991. Earlier that same year, the Westvaco Corporation, which makes paper products, decided to downsize. Several members of their Engineering Department were laid off, including Mr. Martin. Later that year, Martin sued Westvaco, claiming he had been laid off because of his age.

A major piece of his case was based on statistical analysis of the ages of employees at Westvaco.

Questions:

Did Mr. Martin’s case have merit. Justify your position using statistical analysis.

In this Activity:

You will be introduced to and explore exploratory data analysis and statistical inference. This activity is to help you not get so caught up in the details of statistics that you lose sight of what they mean.

Procedure:

The table, Display 1.1 in the next page, presents data provided by the Envelope Division of the Westvaco Corporation (a pulp and paper company originally based in West Virginia and which was purchased by the Meade Corporation in January of 2002) to the lawyers of Robert Martin. Each row corresponds to one worker, and each column corresponds to a characteristic of the worker: job title, hourly vs. salaried, date of birth, date of hire, and age as of January 1, 1991. The “RIF” column indicates how the worker fared in the 5 rounds of layoffs. “1” indicates they were laid off in Round 1, etc. “O” indicates they were not laid off. Define the following in the context of the Westvaco case: Cases, Variables, Variability, Distribution. Why is variability important to the study of Statistics?

If you were on the jury in the Martin v. Westvaco Case, how would you use the information in Display 1.1 to decide if Westvaco tended to lay off older workers?
Throughout this course we have relied on a variety of displays to explore distributions of data. Distribution provides information on the values a particular variable takes on, how often it takes on those values, how spread out they are and whether or not any unusual values are present. S.O.C.S. is an acronym to help us remember the significant parts of our displays, specifically: Shape, Outliers, Center, and Spread. Examine the following dot plots:

Dotplot of Ages of Hourly Workers

Dotplot of Ages of Hourly Workers, Laid-off vs. Retained

Does the dotplot in Display 1.3 show a “clear cut” case of age discrimination, a possible case of discrimination, or no discrimination? Why, or why not?

The dotplots in Display 1.4, below, show the ages of the hourly workers laid off and retained by round. Compare the round-by-round information in Display 1.4 with the summary for all rounds in Display 1.3.

Which display provides a stronger case for discrimination? Why?

The dotplot in Display 1.5 is similar to Display 1.3, except that it includes salaried workers. Compare the plots for the hourly and salaried workers. Which provides stronger evidence in support of a claim for age discrimination? Why?

Display 1.6 is similar to Display 1.4. It shows the distributions of ages and layoffs for salaried workers.

Compare the pattern of salaried workers with the pattern for hourly workers in Display 1.4.

The summary table shown here classifies salaried workers using two Yes/No questions: Under 40? Laid off? Does the pattern in this table support Martin’s claim of discrimination? Why, or why not?

Construct a similar table for salaried workers, using 50 to divide the ages. Does the evidence in this new table provide stronger or weaker support for Martin’s case? Explain. Compare both.

	Laid Off?
Under 50?		Yes	No	TOTAL	% Yes
	Yes
	No
	TOTAL

Whenever you think you have a message from data, you should be careful not to jump to conclusions. The patterns in the Westvaco data might be “real” -- meaning they reflect age discrimination. On the other hand, the patterns might be the result of “chance” -- management wasn’t discriminating on the basis of age, but simply happened to lay off a larger percentage of older workers.

You may feel the analysis so far ignores important facts like worker qualifications. That is true! However, the first step is to decide if, based on the data in Display 1.1, older workers were more likely to be laid off. If not, Martin’s case fails. If so, however, it is up to Westvaco to justify its actions.

Construct a dotplot similar to Display 1.3, comparing the ages of hourly workers who lost their jobs at some point during the first 3 rounds to the ages of hourly workers who still had their jobs at the end of round 3. How do the ages differ?

In question 7, you compared tables for salaried workers. Construct similar tables for hourly workers, using 40 as a cutoff for one and 50 as a cutoff for the other. Which table provides stronger evidence for discrimination? How do the patterns compare to the salaried workers?

STATISTICAL INFERENCE

The exploratory data analysis of the Westvaco Case suggests that older workers were more likely to be laid off than younger ones. One of the main arguments in the case dealt with what those patterns mean.

We don’t have enough information yet to infer that Westvaco has some explaining to do. Could the patterns happen if there were no discrimination?

Consider the ten hourly workers involved in round 2 layoffs.

Condense the data into a single number or summary statistic, such as the mean.

Calculate the average age of the three who lost their jobs.

Your calculation suggests older workers may have been more likely to be laid off, however, could we get an average age of 58 if we picked three of the laid off workers at random? How likely do you think it is that this would happen? If the probability of getting an average age of 58 or more at random turns out to be small, does that favor Martin or Westvaco?

Let’s see what happens when we randomly lay-off three workers from Round 2.

Get 10 index cards from the teacher. Write each of the ten ages of the workers in Round 2 on the cards and mix them thoroughly.
Draw 3 cards and record the ages.
Calculate the average of the three ages in your sample to one decimal place.
Repeat the process until you have 10 sample averages.
Display the distribution of average ages in a summary dotplot.
Estimate the probability: Count how many average ages were 58 or older and divide the total number of simulated trials.
Interpret your result: is it likely that Westvaco could have randomly selected 3 workers with an average age of 58 or older?

Trial	Age 1	Age 2	Age 3	Average Age
1
2
3
4
5
6
7
8
9
10

Consider the class results. Use data from your classmates to determine:

Total number of Sample Averages 58: __________
Total number of Trials: __________
Probability of Randomly Selecting 3 workers with and average age 58:______

Inference is a statistical procedure that involves deciding whether or not an event can reasonably be attributed to chance, or whether you should investigate some other explanation. Why did we estimate the probability of getting an average age greater than or equal to 58 rather than just an average of 58?
How unlikely is “too unlikely”? At what point would you be convinced that something other than chance behavior was responsible for the selection of workers?
Suppose a friend wants to bet with you on the outcome of a coin toss. To check the fairness of his coin, you flip it it 20 times. It comes up heads 19 times. Why does this evidence make it hard to believe the coin is fair? How could you simulate this situation to estimate how unlikely this result is? Further reading.

36. Titanic Revisited Return to Table of Contents

Before We Begin:

This activity was adapted from:

Laying the Foundation

In this activity:

This lesson is about conditional probability: The conditional probability of an event occurring given that another even occurred can be calculated by dividing the probability that both events occur by the probability that the known event happened. In other words the probability of A given B is determined by: .

We look at independence of events. We examine multivariate data.

Questions:

Was race or socioeconomic status a factor in who survived the Titanic?

Independence Screen Shot 2014-11-18 at 8.54.29 PM.png

Materials:

You will need a pencil and paper.

Procedures:

Consider drawing a card from a standard 52 card playing deck. Let evet R represent drawing a red card from the deck, and event Q represent drawing a queen.

The probability of it being a red card is P(R) = ____.
The probability of it being a queen and a red card is P(QR) = ___.
The probability of it being a queen given that the card is red is:

= _____.

Independence: If P(A|B) = P(A) or P(B|A) = P(B), then the events A and B are independent. The logic is that, if the knowledge that event B happened does not change the probability that A happens, then the two events are independent.

Consider the event of rolling a 3 on a fair six-sided die, and the event of getting heads when flipping a coin.

P(Three) = ____.
P(Heads) = ____.
P(ThreeHeads)= ____.

Consider now the conditional probability of rolling a three given the coin is heads.

P(Three|Heads)==______.

The fact that the coin lands heads up does not affect the probability that the roll is a three.

Since P(Three|Heads) = P(Three), the two events are independent.

Independence is less obvious when working from a two-way table (bivariate data) where the probabilities are based on data in the table. One hundred customers at Mr. Salty’s pretzel shop in the mall were surveyed and asked about their purchases.

Screen Shot 2014-11-18 at 9.20.13 PM.png

Using the table we can determine whether the events “bought a pretzel” and “bought a soda” are independent.

The probability of buying a pretzel is: P(pretzel) = ____.
The probability of buying a soda is: P(soda) = ____.
The probability of buying a pretzel and a soda is: P(pretzelsoda) = _____.
Using that information, the probability of buying a soda, given the customer bought a pretzel, can be determined by:

P(Soda|Pretzel) = = _____.

Compare: P(soda) ? P(soda | Pretzel) . Equal? or Not Equal?
Since , the events “bought a pretzel” and “bought a soda” are ______.

For each of the following scenarios,

write the probability equation using set notation,
show the substitution and
state the final answer. Write all answers as fractions in simplest form.

On April 10, 1912, the Titanic set sail from Southampton, England on its maiden voyage. At a length of 882 feet, it was the largest passenger cruise ship ever built. About 400 miles from the New York harbor, on April 14 at 11:40 PM, the Titanic struck an iceberg and sank. After the disaster, the British Parliament launched a formal investigation and issued a report, along with recommendations for future ships. The report is commonly referred to as Lord Mersey’s Report. Data in the two-way table are consistent with the data in the report.

Screen Shot 2014-11-18 at 9.30.54 PM.png

If a name is selected at random from the list of people on board the Titanic, what is the probability that:

the person survived?
the person was a woman?
the person was a child?
the person was a child and survived?
the person was a woman and perished?
the person was a man, given that the person perished?
the person survived, given that the person was a woman?
Based on your answers a-g, are the events “survived” and “woman” independent? Justify your answer by showing the work that leads to your answer. Remember, if two events are independent then P(A) = P(A|B).

There was a great deal of speculation about whether the third class passengers were discriminated against when it came to getting on the lifeboats. This two-way table is also consistent with the report.

Screen Shot 2014-11-18 at 9.38.41 PM.png

What is the probability that a name randomly selected from the passenger and crew list:

is a child in first class?
is a male from the crew?
is a woman, given the person survived?
is a survivor, given the person traveled in first class?
Which of the following is less likely? Justify your answer

Randomly selecting a survivor, given the person traveled in third class.
Randomly selecting a survivor, given the person was a man.

Are the events “first class” and “survived” independent? Justify your answer. (Make calculations to show why your choice is the correct one. Look at previous problems in this set for help on what you must do to determine whether events are independent.)

Some historians have argued that a person’s social status was the greatest determining factor in whether a person survived the Titanic disaster. Using your answer to question 4a and the conditional probabilities from question 5e, compare the effect of gender on survival and the effect of class on survival. Does it appear that class had a greater or lesser effect on survival than gender? Justify your response.

37. Homework, Activities and Exercises Return to Table of Contents

Confirm that the five points in the table all lie on a single line. Write an equation for the line. Use your calculator to make a scatter plot, and graph the line on the same system of axes.
Given the line whose equation is y = 2x + 3 and the points A = (0,0), B = (1,9), C = (2,8), D = (3,3), and E = (4,10), do the following:

Plot the line and the points on the same axes.
Let A′ be the point on the line that has the same x-coordinate as A. Subtract the y-coordinate of A′ from the y-coordinate of A. The result is called the residual of A.
Calculate the other four residuals.
What does a residual tell you about the relation between a point and the line?

The table at right shows data that Morgan collected during a 10-mile bike ride that took 50 minutes. The cumulative distance (measured in miles) is tabled at ten-minute intervals.

Make a scatter plot of this data. Why might you expect the data points to line up? Why do they not line up?
Morgan’s next bike ride lasted for 90 minutes. Estimate its length (in miles), and explain your method. What if the bike ride had lasted t minutes; what would its length be, in miles?

Let P = (1.35, 4.26), Q = (5.81, 5.76), R = (19.63, 9.71), and R′ = (19.63, y), where R′ is on the line through P and Q. Calculate the residual value 9.71 − y.
(Continuation)

Given that Q′ = (5.81, y) is on the line through P and R, find y. Calculate 5.76 − y.
Given that P ′ = (1.35, y) is on the line through Q and R, find y. Calculate 4.26 − y.
Which of the three lines best fits the given data? Why do you think so?

Verify that P = (−1.15, 0.97), Q = (3.22, 2.75), and R = (9.21, 10.68) are not collinear.

Let Q′ = (3.22,y) be the point on the line through P and R that has the same x-coordinate as Q has. Find y, then calculate the residual value 2.75 − y.
Because the segment PR seems to provide the most accurate slope, one might regard PR as the line that best fits the given data. The point Q has as yet played no part in this decision, however. Find an equation for the line that is parallel to PR and that makes the sum of the three residuals zero. In this sense, this is the line of best fit.

Given T = (1.20, 7.48), U = (4.40, 6.12), and V = (8.80, 2.54), find an equation for the line that is parallel to the line TV and that makes the sum of the three residuals zero. This line is called the zero-residual line determined by T, U, and V .
Consider the points A = (−0.5, −8), B = (0.5, −5), and C = (3, 4.5). Calculate the residual for each of these points with respect to the line 4x − y = 7.
Show that the zero-residual line of the points P, Q, and R goes through their centroid.
(Continuation) The zero-residual line makes the sum of the residuals zero. What about the sum of the absolute values of the residuals? Is it possible for this sum to be zero? If not, does the zero-residual line make this sum as small as it can be?
Let P = (2,6), Q = (8,10), and R = (11,2). Find an equation for the zero-residual line, as well as the line of slope 2 through the centroid G of triangle P QR. Find the sum of the residuals of P, Q, and R with respect to the second line. Repeat the investigation using the line of slope −1 through G. Use your results to formulate a conjecture.
What is the sine of the angle whose tangent is 2? First find an answer without using your calculator (draw a picture), then use your calculator to check.
Consider the line y = 1.8x + 0.7.

Find a point whose residual with respect to this line is −1.
Describe the configuration of points whose residuals are −1 with respect to this line.

The median of a set of numbers is the middle number, once the numbers have been arranged in order. If there are two middle numbers, then the median is half their sum. Find the median of (a) 5, 8, 3, 9, 5, 6, 8; (b) 4, 10, 8, 7.
A median-median point for a set of points is the point whose x-value is the median of all the given x-values and whose y-value is the median of all the given y-values. Find the median-median point for the following set of points: (1,2), (2,1), (3,5), (6,4), and (10,7).
The table shows the population of New Hampshire at the start of each of the last six decades.

Write an equation for the line that contains the data points for 1960 and 2010.
Write an equation for the line that contains the data points for 2000 and 2010.
Make a scatter plot of the data. Graph both lines on it.
Use each of these equations to predict the population of New Hampshire at the beginning of 2020. For each prediction, explain why you could expect it to provide an accurate forecast.

The zero-residual line determined by (1,2), (4,k), and (7,8) is y = x − ⅔ . Sketch the line, plot the points, and find the value of k. Be prepared to explain your method.
Plot the following nine non-collinear points: (0.0, 1.0) (1.0, 2.0) (2.0, 2.7) (3.0, 4.0) (4.0, 3.0) (5.0, 4.6) (6.0, 6.2) (7.0, 8.0) (8.0, 8.5)

Use your ruler (clear plastic is best) to draw the line that seems to best fit this data.
Record the slope and the y-intercept of your line.

(Continuation) Extend the zero-residual-line technique to this data set as follows: First, working left to right, separate the data into three groups of equal size (three points in each group for this example). Next, select the summary point for each group by finding its median-median point. Finally, calculate the zero-residual line defined by these three summary points. This line is called the median-median line. Sketch this line, and compare it with your estimated line of best fit.
(Continuation) If the number of data points is not divisible by three, the three groups cannot have the same number of points. In such cases, it is customary to arrange the group sizes in a symmetric fashion. For instance:

Enlarge the data set to include a tenth point, (9.0,9.5), and then separate the ten points into groups, of sizes three, four, and three points, reading from left to right. Calculate the summary points for these three groups.
Enlarge the data set again to include an eleventh point, (10.0,10.5). Separate the eleven points into three groups and calculate the three summary points.

An avid gardener, Gerry Anium just bought 80 feet of decorative fencing, to create a border around a new rectangular garden that is still being designed.

If the width of the rectangle were 5 feet, what would the length be? How much area would the rectangle enclose? Write this data in the first row of the table.
Record data for the next five examples in the table.
Let x be the width of the garden. In terms of x, fill in the last row of the table.
Use your calculator to graph the rectangle’s area versus x, for 0 ≤ x ≤ 40. As a check, you can make a scatter plot using the table data. What is special5about the values x = 0 and x = 40?
Comment on the symmetric appearance of the graph. Why was it predictable?
Find the point on the graph that corresponds to the largest rectangular area that Gerry can enclose using the 80 feet of available fencing. This point is called the vertex.

Using only positive numbers, add the first two odd numbers, the first three odd numbers, and the first four odd numbers. Do your answers show a pattern? What is the sum of the first n odd numbers?
It is often convenient to use what is called sigma notation to describe a series. For example, the preceding sum could have is been written

Use sigma notation to describe the sum of the first n even integers.

Jan had the same summer job for the years 1993 through 1996, earning $250 in 1993, $325 in 1994, $400 in 1995, and $475 in 1996.

Plot the four data points, using the horizontal axis for “year”. You should be able to draw a line through the four points.
What is the slope of this line? What does it represent?
Which points on this line are meaningful in this context?
Guess what Jan’s earnings were for 1992 and 1998, assuming the same summer job.
Write an inequality that states that Jan’s earnings in 1998 were within 10% of the amount you guessed.

The height h (in feet) above the ground of a baseball depends upon the time t (in seconds) it has been in flight. Cameron takes a mighty swing and hits a bloop single whose height is described approximately by the equation h = 80t − 16t2. Without resorting to graphing on your calculator, answer the following questions:

How long is the ball in the air?
The ball reaches its maximum height after how many seconds of flight?
What is the maximum height?
It takes approximately 0.92 seconds for the ball to reach a height of 60 feet. On its way back down, the ball is again 60 feet above the ground; what is the value of t when this happens?

Consider the triangular arrangements of hearts shown below:

In your notebook, continue the pattern by drawing the next triangular array.
Let x equal the number of hearts along one edge of a triangle, and let y equal the corresponding number of hearts in the whole triangle. Make a table of values that illustrates the relationship between x and y for 1 ≤ x ≤ 6. What value of y should be associated with x = 0?
Is the relationship between x and y linear? Explain. Is the relationship quadratic? Explain.
Is y a function of x? Is x a function of y? Explain.
The numbers 1, 3, 6, 10, ... are called triangular numbers. Why? Find an equation for the triangular number relationship. Check it by replacing x with 6. Do you get the same number as there are hearts in the 6th triangle?

These next few problems show how to compute the slope and intercept of the line that best fits the data. Suppose we knew the equation of the line of best fit, y = ax + b, then we could compute the residuals of our 26 data with respect to this line:

First show how to arrive at the expanded expression:

Next, since we are working with a finite number of finite values, we can rewrite the big sum as a sum of sums:

Remember that all of the xk and yk come from our data points and are given. The unknowns in this equation are a and b! Rewrite the equation using the shorthand notation

Now rearrange the terms so that your equation has the form S = Aa2 + Ba + C.

And now, use what you know about quadratic equations to find the a-value that minimizes this quadratic function.
Notice that your equation for a includes a b. Rewrite S as a quadratic function in b instead of a.

And find the value of b that minimizes S. (Is this the same minimum value that you obtained in a previous problem?) Now you have two equations in two unknowns, and you can solve for a and b.
Use your equations, the ones we got are:

to calculate a and b for our data set. Remember that you already have a column for x, x2 and y in your calculator. You may want to add a column with xy. Compare with what the calculator gave you for the values of a and b.

Use your calculator to find the equation of the least-squares line (LinReg) for the five data points (2.0,3.2), (3.0,3.5), (5.0,5.0), (7.0,5.8), and (8.0,6.0). Let G be the centroid of these points — its x-coordinate is the average of the five given x-coordinates, and its y-coordinate is the average of the five given y-coordinates. Verify that G is on the least squares line.
The table at right shows how many seconds are needed for a stone to fall to Earth from various heights (measured in meters). Make a scatter plot of this data. Explain how the data suggests that the underlying relationship is not linear.

Calculate the squares of the times and enter them in a third column. A scatter plot of the relation between the first and third columns does suggest a linear relationship. Use LinReg to find it, letting x stand for height and y stand for the square of the time.
It is now easy to write a nonlinear relation between h and t by expressing t2 in terms of h. Use this equation to predict how long it will take for a stone to fall from a height of 300 meters.

(Continuation) This time, calculate the square roots of the heights and enter them in a new column. A scatter plot of the relation between the second column and the new column should reveal a linear relationship. Find it, then use it to extrapolate how much time is needed for a stone to fall 300 meters.
The following are data about how charged Sasha’s laptop is:

Percent Change

Time	9:11 AM	9:27 AM	9:36 AM	9:48 AM	9:55 AM	10:08 AM	10:17 AM
Charge	41%	56%	64%	74%	79%	86%	91%

What are the variables in this story?
Make a scatterplot of these data.
Draw a line that best fits the data and find its equation.
Interpret the slope and y-intercept in this context.
Based on your findings, when can Sasha expect to have a fully charged battery?

38. APPENDIX A Return to Table of Contents

39. Appendix B Return to Table of Contents

40. GLOSSARY Return to Table of Contents

Association Association describes the strength of the given relationship between two variables.

Back-To-Back Stem Plot Used to compare two sets of data. The leaves for one set of data are on one side of the stem, and the leaves for the other set of data are on the other side. See stem and leaf plot.

Back-To-Back Stem (and Leaf) Plot A graphic device to compare two data sets. The two sets share common stems. The leaves of one set go to the right, the leaves of the other go to the left. See Stem (and Leaf) Plot.

Bar Graph A device to display categorical data using (usually) vertical bars, the heights of which represent the frequency of each category. The bars are separated, unlike a histogram (q. v.).

Bias Bias is the difference between an estimated expected value and the true value being estimated.

Bivariate data Ordered pairs of linked numerical observations.

Boxplot or Box-and-Whisker Plot A graphic device to display 1-variable data using the minimum, maximum, median and the upper and lower quartiles of the data.

Center A one-digit summary of a data set. Usually the mean or median, sometimes the mode.

Conditional Probability The probability that A will occur given the knowledge that B has occurred. The probability of A given B, denoted P r(A|B), is the probability of (A and B)/(probability of B).

Dot Plot A device to display one-variable data. Each data point is represented (usually) by a single dot. The number of dots corresponding to a value is the frequency of that value.

Empirical Probability An estimate of the probability of an event occurring based on a large number of trials.

Event A set of possible outcomes from a random situation.

Expected Value The weighted average of all possible outcomes of an event, the weights being the probabilities of each outcome, denoted

Experiment A procedure to verify or refute a hypothesis. A random, controlled experiment allows valid inferences to be drawn. Compare Observational Study.

Five-Number Summary The minimum, maximum, median and quartiles of a set of observations; the skeleton of a boxplot.

Form The given data can look linear or non-linear, and its form is the basic function that describes the look of the data.

Frequency The number of occurrences of a particular value. Denoted by the height of the bars in a bar graph or histogram.

Histogram A device to display data by bars, the height of which represents the frequency of that observation or group of observations. The bars are drawn without gaps, the width of each bar being the same and representing individual values of a group of values of the same size.

Independent Events A and B are independent events if the probability of A and B occurring is the product of the probabilities of A and the probability of B. Knowing that one of the events has occurred has no effect on the probability of the other event occurring. For example, drawing two cards from a shuffled deck, noting the color of one and replacing it in a deck before drawing the second are independent events. If the first card is not replaced, the two events are not independent.

Inference The drawing of conclusions, usually based on a sample. The process of extrapolating information from a sample to the parent population.

Influential Point A data point for which a small change in position will have a disproportionate effect on the least squares residual line.

Interquartile Range Q3 − Q1 A measure of the variability of a set of data.

Least Squares Regression Line The Least Squares Regression Line is the line that makes the sum of squares of the residuals with respect to that line as small as possible.

Maximum The greatest value of a set.

Mean The arithmetic mean is the average of the values of all the observations. Geometrically the mean is the balance point of a data set.

Mean Absolute Deviation A measure of the variability of a set computed by taking the mean of the absolute values of the difference between each observation and the median or the mean of the set. For the set X = {x1, x2, . . . xn} the MAD is where m(X) is the mean or median of the set X.

Median In an ordered list of observations the middle number if there is an odd number of observations and the mean of the two middle numbers if there is an even number of observations. Half of the observations lie above the median, half below. Geometrically the median divides the data into two equal areas.

Minimum The least value of a set.

Mode The value of the most commonly occurring observation. A useful measure of the center of the distribution of categorical data. The modal value.

Normal Distribution The normal distribution with mean equal to μ and standard deviation equal to σ is given by

Observational Study A method of collecting data from which it not appropriate to drawinferences.

Outcome The result of a random situation.

Outlier An observation that lies outside the overall pattern of the data. In a box plot an outlier is arbitrarily defined to be a value that lies either 1.5 interquartile ranges below the lower quartile or 1.5 interquartile ranges above the upper quartile. A value which invites further study.

Parameter A parameter is a numerical value that states something about an entire population and is often hard to obtain.

Population The entire set of people or objects about which information is sought.

Quartile For a data set with an even number of observations, the First or Lower quartile, denoted Q1 , is the median of the observations in the lower half of the set, the 25th percentile. The Upper or Third quartile, the 75th percentile, denoted Q3, is the median of the observations in the upper half of the set.

Residual The residual of a point with respect to a line is the vertical distance between the y-value of the point and the y-value of the line both given at the x-value of the point of interest.

Sample A subset of a population. The goal of inference is to extrapolate information gleaned from a sample (about which everything is known) to the parent population.

Sample Space The set of all possible outcomes of a chance process. The sum of the probabilities of all the outcomes is 1 or 100 %.

Shape Some of the words used to describe the shape of a one-variable distribution are symmetrical, mound-shaped, skewed left or right, uniform, bimodal, fan-shaped.

Spread The variability of the observations in a data set. Some measures of spread are range, interquartile range, standard deviation, mean absolute deviation.

Standard Deviation is a measure of the variability in a data set. It is used almost exclusively in conjunction with the mean to summarize approximately normally distributed data. Every calculating or computing device will compute the standard deviation of a data set. It is computed as the square root of the mean of the sum of the squares of the deviations (signed differences) of each data point from the mean. For the set X = {x1, x2, . . . xn} the standard deviation is

where m(X) is the mean of the set X. Compare Mean Absolute Deviation.

Statistic A statistic is a numerical value that states something about a sample of a population and can be different for different samples.

Statistics Statistics are more than one statistic.

Stem Plot, Stem-And-Leaf Plot A graphing technique that displays all the individual values in a set. The principal values, which vary for each set, form the stems which are arranged vertically. A stemplot is quickly drawn. The leaves extend horizontally from the stems. When rotated 90 degrees counterclockwise a stem plot looks just like a histogram.

Strength The strength of the pattern is an assessment of how tightly clustered the data points are around the underlying form.

Subjective Probability An estimate of the probability of an event occurring determined from an (educated) guess.

Theoretical Probability The mathematically determined probability of an event occurring.\

Trend The pattern displayed by the data.

41. Reference Materials Return to Table of Contents

Morris Code, Scrabble and the Alphabet. Mary Richardson, John Gabrosek, Diann Reischman, Phyllis Curtiss, Grand Valley State UniJournal of Statistics Education Volume 12, Number 3 (2004). www.amstat.org/publications/jse/v12n3/richardson.html

The Journal of Statistics Education An International Journal on the Teaching and Learning of Statistics, Editor of JSE Michelle Everson, www.amstat.org/publications/jse/jse_users.html

An Exercise in Sampling: Rolling Down the River Doetsch, Flanangan-Hyde, Harrison, Tabor, Tiberio, NCSSM Statistics Leadership Institute (July 2000). www.courses.ncssm.edu/math/Stat_inst01/PDFS/river.pdf

There Once Was a 9-Block ... A Middle-School Design for Probability and Statistics. Abrahamson, D., Janusz, R. M., and Wilensky, U. Journal of Statistics Education [Online] (2006). www.amstat.org/publications/jse/v14n1/abrahamson.html

What is the Probability of a Kiss? (It’s Not What You Think) Mary Richardson, Susan Haller, Journal of Statistic Education Volume 10, Number 3 (2002). www.amstat.org/publications/jse/v10n3/haller.html

The Cereal Box Problem Revisted Jesse L. M. Wilkins, Virginia Polytechnic Institute and State University School of Science and Mathematics (Volume 99(3)) (March 1999) http://eric.ed.gov/?id=EJ590348

Engaging Students in a Large Lecture: An Experiment using Sudoku Puzzles Brophy and Hahn (2014) Journal of Statistics Education Volume 22, Number 1, www.amstat.org/publications/jse/v22n1/brophy.pdf

Graphs in Statistical Analysis F. J. Anscombe The American Statistician, Vol. 27, No. 1 (Feb. 1973), pp.17-21.

www.Illustrativemathematics.org

This site lists all the CCSSM standards in their complete wording. It also gives many examples of exercises which illustrate the standards. A very valuable resource.

Middle Grades Mathematics Project Probability Addison Wesley, 1986, ISBN 0-201-21478-4

Navigating Through Data Analysis in Grades 6-8 NCTM, Reston VA, 2003, ISBN 987-0-87353-547-2.

This is an excellent resource, written long before the CCSSM appeared. It contains lots of good examples and a CD.

Navigating Through Data Analysis in Grades 9-12, NCTM.

Workshop Statistics, Discovery with Data and Fathom, A Rossman, B Chance, R Lock, Key Curriculum Press, 2001. ISBN 1-930190-07-7.

This can be used as a stand-alone statistics text. It comes in versions for Fathom, the graphing calculator, Excel, Minitab.

Statistics in Action, Understanding a World of Data, A Watkins, R Scheaffer, G, Cobb, Key Curriculum Press, 2008. ISBN 978-1-55953-909-8.

Statistics From Data To Decision A Watkins, R Scheaffer, G Cobb. Wiley 2011, ISBN 978-0470-45851-8.

Introduction to Statistics, R Peck, C Olsen, JL Devore, Thompson Books/Cole (2008) ISBN-13: 978-0495557838, ISBN-10: 0495557838

Activity 6 Notes

https://drive.google.com/open?id=1qfhm9430Z980wbF4xNlEELvFjJMLsfjR

https://drive.google.com/open?id=1Wsz42O1Cy6kJUksicy__NJr36CmtvnGE

https://drive.google.com/open?id=14IU9qOwDBoEURbvvXHHwIBHPZ_cs9RRX

https://drive.google.com/open?id=15HFOMcO4geLrZ9uCyWfge_rSortiOBb1

Table of Contents (Activities)

Everything I Ever Wanted to Know About Statistics I learned from M&M’s Return to Table of Contents

A first day activity

Before We Begin:

Questions:

In this activity:

You will make individual Dot Plots, and we will create one master Dot Plot. You will compare your dot plots with the master and note similarities and differences.

Materials:

1.69 oz. bag of M&Ms, one per student.

Master dot plot for each color on the board.

Procedure:

Individual Dotplot

Bag weight: _____________ Contents weight: _______________

2. Hershey Kisses Lab Return to Table of Contents

Questions:

In this Activity:

Materials:

Procedure:

3. Almond Hershey Kisses Lab Return to Table of Contents

Before we begin:

Questions:

In this Activity:

Materials:

Procedure:

4. Every Graph Tells a Story Return to Table of Contents

In this Activity:

Materials:

Procedure:

5. Student Scores Return to Table of Contents

Questions:

In this Activity:

Materials:

Procedure:

6. Matching Dotplots Return to Table of Contents

Procedure:

7. Puppies Lab Return to Table of Contents

Questions:

In this Activity:

Materials:

Procedure:

8. Meadowsweet Questions Return to Table of Contents

In this Activity:

Materials:

Procedure:

Additional Questions:

9. Sudoku Experiment Part I Return to Table of Contents

Before we begin:

Questions:

In this Activity:

Materials:

Procedure:

Sudoku Experiment Part I

Instructions:

Sudoku Experiment

Instructions:

Sudoku Experiment Part I

Instructions:

Sudoku Experiment

Instructions:

Sudoku Experiment Part I

Instructions:

Sudoku Experiment

Instructions:

Sudoku Experiment Part I

Instructions:

Sudoku Experiment

Instructions:

10. The Standard Deviation Return to Table of Contents

Before we begin:

In this Activity:

Procedure:

11. The Normal Distribution Lab Return to Table of Contents

Before we begin:

Questions:

In this Activity:

Materials:

Procedure:

12. Minimum Wage Lab Return to Table of Contents

Questions:

In this Activity:

Bag weight: _ Contents weight: ___