Instructions: We’re going to crowd-source our review today.
Each student should enter at least two tips and as many questions as you have in the corresponding spaces below. If possible, prioritize giving tips for targets that either (a) do not not yet have any tips entered or (b) you feel especially confident with.
Suggestions for tips:
- What have you learned from the feedback received on your graded work?
- What suggestions for “ways you might improve” have I written on your online gradebook?
- What external resources (e.g. websites, videos) have you discovered? (Please enter a hyperlink and brief description.)
At 2pm (and throughout the day as I’m feeling up to it) I will log on and address any questions that have not yet been answered.
Please enter your initials after your tips and questions (JH).
Targets 1.1 through 1.5 (grounded in CCSS 6.SP):
1.1 I can calculate appropriate measures of center and variability for a given data set.
- When you consider variability, I'd suggest you immediately think of MAD or IQR. Those are the two measures CCSS mentions. The range, while easy to compute, tends to be unreliable since it is based on just two points. (JH)
- Using all values of measurement data can help make your case stronger. Knowing what the numbers represent and what conclusions can be made about the data can determine the strength of your argument. (KM)
1.2 I can display numerical data in plots on a number line, including dot plots, histograms, and box plots.
- These are all for quantitative data right? I always get histograms and bar graphs confused - histograms are quantitative and bar graphs are categorical, correct? (HM) -
- ANS: Yes, that’s correct. Histograms are drawn on a number line (which we partition into bins), but bar graphs are always based on categories. (JH)
- When comparing two data sets using histograms, bar graphs, and box plots, be sure to plot them on the same number line. If you do decide to do them on two different number lines, make sure your number lines use the same scale so you can properly compare them. (SS)
- Make sure that when you are showing a histogram, remember to keep your bin size appropriate for the data. (LD)
1.3 I can relate my choice of measures of center and variability to the shape of the data distribution and the context in which the data were gathered.
- Is this where we look at either the IQR or MAD as see how that fits in with the shape of what the data is represented in? (SZ).
- ANS: Yes. The emphasis is on making a thoughtful choice: Mean vs. Median, IQR vs. MAD. (JH)
- If Mean is not a reasonable measure of center, St. Dev. and Mean Abs Dev will not be a reasonable measure of variability either since they are determined by the mean. Therefore if the mean is not accurate, the St. Dev or MAD won’t be either. (HM)
- If your data is skewed in any way (right or left) you do not want to use the mean as a measure of center because the skewness will affect the mean. You should use the median as your choice of measure of center. (SS)
1.4 I can relate numerical summaries of data sets to their context, such as by:
- Reporting numerical values with their units of measurement.
- Describing the nature of the attribute under investigation, including how it was measured.
- Describing the overall pattern and any striking deviations from the overall pattern with reference to the context in which the data were gathered.
- This target is all about linking the numerical summaries to the underlying context. The bulletpoints talk about ways you might do that, but be sure you attend to the “in context” allusions. PS. This wording is almost identical to what appears in CCSS 6.SP.B.5. (JH)
- When calculating the mean, median, IQR, MAD etc, be sure to explain how you are obtaining your numbers Being able to explain your thinking is important so you can get the full score for this standard.(SZ)
- Make sure to understand that normal and symmetric do not have the same meaning. Normal distributions must be symmetric but, symmetric distributions are not always normal. Normal distributions tend to have higher tails. This link shows the comparison of the two types: http://ocw.tufts.edu/Content/1/coursehome/194228/194237 (KZ)
1.5 I can use technology to perform statistical calculations and create graphs to summarize a data set.
- When making histograms, it is a good idea to make the bin sizes the same between two sets of data. This way you can more easily compare the two distributions when they are overlapping each other. (AF)
- Changing the colors on the graph to easily see overlap in a bivariate graph is visually helpful. Its hard to see the similarities and differences if everything is the same color. (KM)
- Make sure to label your axes and set the scale on your axis to match your histogram bins. (KZ)
Targets 2.1 through 2.7 (grounded in CCSS 8.SP):
2.1 I can describe when it is and is not appropriate to represent bivariate measurement data on a scatter plot, and I can create scatter plots (either by hand or by using appropriate technology).
- Remember, the variables that have been measured must be linked by an underlying common object (same day, same state, same person, same store) for a scatter plot to be meaningful. (JH)
- Be careful saying “the variables must be correlated” to produce a scatter plot -- it is common to produce a scatter plot for variables that have no correlation; indeed, that’s one of the ways we might discover that the variables are not correlated.
- Similarly, a meaningful scatter plot can be generated even if neither variable causes or explains the other. Often, there is a third variable, not yet identified, that is the common causal factor. Ice cream sales and sunglasses sales are an example (underlying causal factor: warm, sunny day). (JH)
- Be sure to address when it is not appropriate to examine bivariate data with scatter plots. For example, when the data is non numerical or when the data is not related.(AF)
- Remember that the independent variable goes on the x-axis and the dependent variable goes on the y-axis. Also choose your scaling wisely because it can affect what you see in your scatter plot. (MS)
2.2 I can use scatter plots to draw conclusions about patterns of association between two quantities, describing patterns such as: clustering, outliers, positive or negative association, and linear or nonlinear association.
- What happens if a point seems to be an outlier for one of the variables but not the other? Im sure this happens sometimes. (HM)
- ANS: Yes, this happens. Actually, these are the points that stand out -- points that are outliers for both variables might actually fall right on the trendline. But points that are outliers in x but fairly typical in y are going to buck the trendline and appear abnormal. (JH)
- ANS: Often, these are your high leverage points. You should consider whether there might be a good reason to remove them from your analysis. Remember the large & small mammals data set? The scientist removed the two or three huge mammals (whales?) right away so she could get something useful for the other mammals. (JH)
- Two variables have a positive association if when one variable increases, so does the other. Be careful though, because notice that this indicates that if one variable decreases and the other does as well this is still a positive association. I like to think of variables with negative associations as moving in opposite directions in a sense. If one increases while the other decreases, then the association is negative. (HM)
- If we have a scatterplot that has a cluster or many clusters, be sure to zoom in on those parts to look closer at the relationship (SZ)
- if you do have clusters, you should separate the data and look at each cluster’s associations separately. (SS)
- Ask yourself if the outliers have high leverage and thus affect how your data is represented on the scatter plot, and will change the way that you would interpret the data. (PD)
- If there is an outlier that has high leverage you should exclude that point and look at the data again. (MS)
- Remember that a cluster is more when there is a heavy amount of data points in a central area on a graph where you feel that it would be best to separate the data into two graphs to b able to interpret better. It's not just when there happens to be a few more points in an area. You will see a distant and obvious cluster that doesn't allow you to interpret the graph appropriately. Sp
2.3 I can use technology (e.g. Excel, Geogebra, Desmos, or a graphing calculator) to generate a least-squares linear regression equation for a scatter plot.
- How do you tell Geogebra to ignore an outlier that greatly affects the line of best fit? (what are these called again? high something outliers…) (HM).
- ANS: These are high leverage outliers if they have a big influence on the slope of the regression line. You can turn points on and off one-by-one using this button in Geogebra’s data analysis view (see below). If the slope of the regression line changes a lot, you have found a high leverage point. Another option is to just delete the point manually in the spreadsheet. (JH).
- When creating a least squares linear regression line make sure the spreadsheet view is on and the data is entered into the columns. Then highlight the data you want to analyze. Click on the second box near the top left corner and find the option that says, “Two Variable Regression Analysis”. When you hit analyze it will create a scatterplot. Near the bottom left corner there are options for different regression models. Choose the linear option and you will have a least squares linear regression line (SZ)
2.4 I can use technology (e.g. Geogebra or Desmos) to create a linear model with adjustable parameters for slope and y-intercept and use it to find an approximate best-fit line.
- To create a adjustable line of best fit, In the bottom entry bar (not sure what to call this) type y=mx+b. Then enter m=1, and b=1. This will create items you can turn on or off in the space on the left. Turn on m and b by clicking the circle (this turns on the sliders), you can then play with the sliders to adjust your line of best fit. (HM)
- Making sure when doing the y=mx+b that you make sure to put * in the y=m*x+b, if you don’t do it, it might not work. (LD)
- Good tip from LD. I also usually define m=1 and b=1 first. If you do it the other way, I think geogebra may give you an error because m and b are not yet defined. (JH)
2.5 I can describe how well a linear model fits a scatter plot both formally and informally:
- Informally, by discussing the closeness of the data points to the line.
- Formally, by reporting and discussing the sum of the magnitudes of the residuals.
- How do you interpret the sum of residuals? For instance, when you get the value of the sum of the residuals how do you interpret that in context of your data. (LH)
- Remember, the residuals are (the absolute value of) the differences between the data y-values and the linear model's y-values. They are measured vertically. (JH)
- Also remember that it is important to both formally and informally interpret the graphs because the residuals can give you more information than just lookin at closeness of the data points. I have found that depending on the data the sum of he residuals may actually prove one set of data to fit a linear model better than another even bough the other seemed to have data points closer to the line.Sp
- When describing the closeness to the data points, make sure to look where the linear model falls between all the points; exactly in the middle, closer to the x-axis or closer to y-axis. Then you can make sure by calculating the sum of the residuals to see how good the model is. (CL)
2.6 I can use the equation of a linear model to solve problems in the context of bivariate measurement data, interpreting the slope and y-intercept appropriately.
- For this target, if we were trying to show evidence would we come up with a value and put it into the equation and relate it back to the data? (LH)
- When interpreting the slope and y-intercept, make sure you keep your explanatory and response variables in the right order. It is easy to get them flipped. (HM)
- When interpreting your slope, ask yourself within the context of the independent and dependent variables what is the relationship between the slope and these two variables. (PD)
2.7 I can explore and describe possible patterns of association between two categorical variables collected from the same subjects by displaying relative frequencies in a two-way table and interpreting the results in context. (See CCSS.8.SP.4)
- Be careful when calculating relative frequencies. Do you want to calculate relative frequencies by row or column? Generally, you will want to calculate where your explanatory variable is. If you explanatory variable is found in the columns, find relative frequencies by column, if you explanatory variable is found in the rows, find relative frequencies by row. And always make sure your results make sense within the context of the problem. (HM)
- Relative frequencies are percents and raw data are the numbers collected. To find expected cell count you can use the formula, row total*row column/total population size. (CL)