1 of 59

Chapter 12 Linear Regression and Correlation

OPENSTAX STATISTICS

1

2 of 59

Objectives

By the end of this chapter, the student should be able to:

  • Discuss basic ideas of linear regression and correlation.
  • Create and interpret a line of best fit.
  • Calculate and interpret the correlation coefficient.
  • Calculate and interpret outliers.

2

3 of 59

Introduction

  • Professionals often want to know how two or more numeric variables are related.
  • For example, is there a relationship between the grade on the second math exam a student takes and the grade on the final exam? If there is a relationship, what is the relationship and how strong is it?
  • In this chapter we will begin with correlation, the investigation of relationships among variables that may or may not be founded on a cause and effect model. The variables simply move in the same, or opposite, direction. That is to say, they do not move randomly.
  • Correlation provides a measure of the degree to which this is true.

3

4 of 59

Section 12.1

LINEAR EQUATIONS

4

5 of 59

Linear Equations

  • Linear regression for two variables is based on a linear equation with one independent variable. The equation has the form:
  • y = a + bx

where a and b are constant numbers.

  • The variable x is the independent variable, and y is the dependent variable.
  • The graph of a linear equation of the form y = a + bx is a straight line.

5

6 of 59

Examples of Linear Equations

  •  

6

7 of 59

Example

  • Graph the equation y = –1 + 2x

7

8 of 59

Example - Answer

  • Graph the equation y = –1 + 2x

8

9 of 59

Example

  • A local small business completes federal tax returns for customers. The rate for services is $32 per hour plus a $31.50 one-time charge. The total cost to a customer depends on the number of hours it takes to complete the job. Find the equation that expresses the total cost in terms of the number of hours required to complete the job.

9

10 of 59

Example - Answers

  • Let x = the number of hours it takes to get the job done.
  • Let y = the total cost to the customer.
  • The $31.50 is a fixed cost. If it takes x hours to complete the tax return, then (32)(x) is the cost of the tax return processing only. The total cost is: y = 31.50 + 32x

10

11 of 59

Slope and Y-Intercept of �a Linear Equation

  • For the linear equation y = a + bx, b = slope and a = y-intercept.
  • From algebra recall that the slope is a number that describes the steepness of a line, and the y-intercept is the y coordinate of the point (0, a) where the line crosses the y-axis.

11

  • Three possible graphs of y = a + bx.
  • If b > 0, the line slopes upward to the right.
  • If b = 0, the line is horizontal.
  • If b < 0, the line slopes downward to the right.

12 of 59

Example

  • Svetlana tutors to make extra money for college. For each tutoring session, she charges a one-time fee of $25 plus $15 per hour of tutoring. A linear equation that expresses the total amount of money Svetlana earns for each session she tutors is y = 25 + 15x.
  • What are the independent and dependent variables? What is the y-intercept and what is the slope? Interpret them using complete sentences.

12

13 of 59

Example - Answers

  • The independent variable (x) is the number of hours Svetlana tutors each session. The dependent variable (y) is the amount, in dollars, Svetlana earns for each session.
  • The y-intercept is 25 (a = 25). At the start of the tutoring session, Svetlana charges a one-time fee of $25 (this is when x = 0). The slope is 15 (b = 15). For each session, Svetlana earns $15 for each hour she tutors.

13

14 of 59

Section 8.2

SCATTER PLOTS

14

15 of 59

Scatter Plots

  • Before we take up the discussion of linear regression and correlation, we need to examine a way to display the relation between two variables x and y.
  • The most common and easiest way is a scatter plot.
  • A scatter plot shows the direction of a relationship between the variables.
  • A clear direction happens when there is either:
    • High values of one variable occurring with high values of the other variable or low values of one variable occurring with low values of the other variable.
    • High values of one variable occurring with low values of the other variable.

15

16 of 59

Scatter Plots, cont.

  • You can determine the strength of the relationship by looking at the scatter plot and seeing how close the points are to a line.
  • When you look at a scatterplot, you want to notice the overall pattern and any deviations from the pattern.
  • The following scatterplot examples illustrate these concepts.

16

17 of 59

Some Scatter Plots

Remember, all the correlation coefficient tells us is whether or not the data are linearly related. In panel (d) the variables obviously have some type of very specific relationship to each other, but the correlation coefficient is zero, indicating no linear relationship exists.

18 of 59

Some More Scatter Plots

19 of 59

Some More Scatter Plots

20 of 59

Some More Scatter Plots

21 of 59

Section 8.3

THE REGRESSION EQUATION

21

22 of 59

Regression Line

  • In this chapter, we are interested in scatter plots that show a linear pattern.
  • If we think that the points show a linear relationship, we would like to draw a line on the scatter plot. This line can be calculated through a process called linear regression.
  • If x is the independent variable and y the dependent variable, then we can use a regression line to predict y for a given value of x.
  • Typically, you have a set of data whose scatter plot appears to "fit" a straight line. This is called a Line of Best Fit or Least-Squares Line.
  • Note: Computer spreadsheets, statistical software, and many calculators can quickly calculate the best-fit line and create the graphs. The calculations tend to be tedious if done by hand.

22

23 of 59

Reminder about The Regression Line

  • Remember, it is always important to plot a scatter diagram first. If the scatter plot indicates that there is a linear relationship between the variables, then it is reasonable to use a best fit line to make predictions for y given x within the domain of x-values in the sample data, but not necessarily for x-values outside that domain. You could use the line to predict the final exam score for a student who earned a grade of 73 on the third exam. You should NOT use the line to predict the final exam score for a student who earned a grade of 50 on the third exam, because 50 is not within the domain of the x-values in the sample data, which are between 65 and 75.

23

24 of 59

Example

  • A random sample of 11 statistics students produced the following data, where x is the third exam score out of 80, and y is the final exam score out of 200. Can you predict the final exam score of a random student if you know the third exam score? See Excel spreadsheet.

24

x (third exam score)

y (final exam score)

65

175

67

133

71

185

71

163

66

126

75

198

67

153

70

163

71

159

69

151

69

159

25 of 59

Example - Answers

25

26 of 59

Least Squares Method

  • The third exam score, x, is the independent variable and the final exam score, y, is the dependent variable. We will plot a regression line that best "fits" the data.
  • If each of you were to fit a line "by eye," you would draw different lines. We can use what is called a least-squares regression line to obtain the best fit line.
  • Consider the following diagram. Each point of data is of the the form (xy) and each point of the line of best fit using least-squares linear regression has the form (xŷ).

The ŷ is read "y hat" and is the estimated value of y. It is the value of y obtained using the regression line. It is not generally equal to y from data.

27 of 59

The Correlation Coefficient r

  •  

27

28 of 59

The Correlation Coefficient r, cont.

  • What the VALUE of r tells us:
    • The value of r is always between –1 and +1: –1 ≤ r ≤ 1.
    • The size of the correlation r indicates the strength of the linear relationship between x and y. Values of r close to –1 or to +1 indicate a stronger linear relationship between x and y.
    • If r = 0 there is absolutely no linear relationship between x and y (no linear correlation).
    • If r = 1, there is perfect positive correlation. If r = –1, there is perfect negative correlation.
  • In both these cases, all of the original data points lie on a straight line. Of course, in the real world, this will not generally happen.

28

29 of 59

The Correlation Coefficient r, cont.

  • What the SIGN of r tells us
    • A positive value of r means that when x increases, y tends to increase and when x decreases, y tends to decrease (positive correlation).
    • A negative value of r means that when x increases, y tends to decrease and when x decreases, y tends to increase (negative correlation).
    • The sign of r is the same as the sign of the slope, b, of the best-fit line.
  • Strong correlation does not suggest that x causes y or y causes x. We say "correlation does not imply causation."

29

30 of 59

The Correlation Coefficient r, cont.

30

  1. A scatter plot showing data with a positive correlation. 0 < r < 1
  2. A scatter plot showing data with a negative correlation. –1 < r < 0
  3. A scatter plot showing data with zero correlation. r = 0

31 of 59

Interpreting the Intercept and Slope

  • For the linear equation y = a + bxb = slope and a = y-intercept.
  • From algebra recall that the slope is a number that describes the steepness of a line, and the y-intercept is the y coordinate of the point (0, a) where the line crosses the y-axis.
    • From calculus the slope is the first derivative of the function. For a linear function the slope is dy / dx = b where we can read the mathematical expression as "the change in y (dy) that results from a change in x (dx) = b * dx".
  • Again, here are some possible situations in linear regression:

32 of 59

The Coefficient of Determination

  • The variable r2 is called the coefficient of determination and is the square of the correlation coefficient, but is usually stated as a percent, rather than in decimal form.
  • It has an interpretation in the context of the data:
    • r2 , when expressed as a percent, represents the percent of variation in the dependent (predicted) variable y that can be explained by variation in the independent (explanatory) variable x using the regression (best-fit) line.
    • 1 – r2 , when expressed as a percentage, represents the percent of variation in y that is NOT explained by variation in x using the regression line.

32

33 of 59

Example

  •  

34 of 59

Example - Answers

  •  

35 of 59

Example - Answers

The line of best fit is: ŷ = –173.51 + 4.83x

    • The correlation coefficient is r = 0.6631
    • The coefficient of determination is r2 = 0.66312 = 0.4397

Interpretation of r2 in the context of this example:

  • Approximately 44% of the variation (0.4397 is approximately 0.44) in the final-exam grades can be explained by the variation in the grades on the third exam, using the best-fit regression line.
  • Therefore, approximately 56% of the variation (1 – 0.44 = 0.56) in the final exam grades can NOT be explained by the variation in the grades on the third exam, using the best-fit regression line. (This is seen as the scattering of the points about the line.)

36 of 59

Example

  • A random sample of 11 statistics students produced the following data, where x is the third exam score out of 80, and y is the final exam score out of 200. Can you predict the final exam score of a random student if you know the third exam score?
  • a. What would you predict the final exam score to be for a student who scored a 66 on the third exam?
  • b. What would you predict the final exam score to be for a student who scored a 90 on the third exam?

37 of 59

Example - Answers

  •  

38 of 59

Section 8.4

TESTING THE SIGNIFICANCE OF THE CORRELATION

COEFFICIENT

38

39 of 59

Testing the Significance of the �Correlation Coefficient

  • We perform a hypothesis test of the "significance of the correlation coefficient" to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.
  • The symbol for the population correlation coefficient is ρ, the Greek letter "rho."
    • ρ = population correlation coefficient (unknown)
    • r = sample correlation coefficient (known; calculated from sample data)
  • The hypothesis test lets us decide whether the value of the population correlation coefficient ρ is "close to zero" or "significantly different from zero". We decide this based on the sample correlation coefficient r and the sample size n.

39

40 of 59

Performing the Hypothesis Test

  • Null Hypothesis: Ho: ρ = 0
  • Alternate Hypothesis: Ha: ρ ≠ 0
  • WHAT THE HYPOTHESES MEAN IN WORDS:
    • Null Hypothesis Ho: The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship(correlation) between x and y in the population.
    • Alternate Hypothesis Ha: The population correlation coefficient IS significantly DIFFERENT FROM zero. There IS A SIGNIFICANT LINEAR RELATIONSHIP (correlation) between x and y in the population.

Note:

    • If r is significant and the scatter plot shows a linear trend, the line can be used to predict the value of y for values of x that are within the domain of observed x values.
    • If r is not significant OR if the scatter plot does not show a linear trend, the line should not be used for prediction.
    • If r is significant and if the scatter plot shows a linear trend, the line may NOT be appropriate or reliable for prediction OUTSIDE the domain of observed x values in the data.

40

41 of 59

Performing the Hypothesis Test, cont.

DRAWING A CONCLUSION:

  • There are two methods of making the decision. The two methods are equivalent and give the same result.
    • Method 1: Using the p-value
    • Method 2: Using a table of critical values
  • In this chapter of this textbook, we will always use a significance level of 5%, α = 0.05
  • You can use Excel to calculate a p-value easily from your data
  • Note: Using the p-value method, you could choose any appropriate significance level you want; you are not limited to using α = 0.05. But the table of critical values provided in this textbook assumes that we are using a significance level of 5%, α = 0.05. (If we wanted to use a different significance level than 5% with the critical value method, we would need different tables of critical values that are not provided in this textbook.)

41

42 of 59

Test Statistic

  • There are two methods of making the decision concerning the hypothesis. The test statistic to test this hypothesis is:

This is a t-statistic and operates in the same way as other t tests. Calculate the t-value and compare that with the critical value from the t-table at the appropriate degrees of freedom and the level of confidence you wish to maintain. If the calculated value is in the tail then cannot accept the null hypothesis that there is no linear relationship between these two independent random variables. If the calculated t-value is NOT in the tailed then cannot reject the null hypothesis that there is no linear relationship between the two variables.

43 of 59

Shorthand for Testing the Significance of r

  • A quick shorthand way to test correlations is the relationship between the sample size and the correlation.
  • If:

then this implies that the correlation between the two variables demonstrates that a linear relationship exists and is statistically significant at approximately the 0.05 level of significance. As the formula indicates, there is an inverse relationship between the sample size and the required correlation for significance of a linear relationship.

44 of 59

Misuse of Correlation Coefficients

  • Correlations may be helpful in visualizing the data, but are not appropriately used to "explain" a relationship between two variables.
  • Perhaps no single statistic is more misused than the correlation coefficient. Citing correlations between health conditions and everything from place of residence to eye color have the effect of implying a cause and effect relationship.
  • The correlation coefficient is, of course, innocent of this misinterpretation. It is the duty of the analyst to use a statistic that is designed to test for cause and effect relationships and report only those results if they are intending to make such a claim.
  • The problem is that passing this more rigorous test is difficult so lazy and/or unscrupulous "researchers" fall back on correlations when they cannot make their case legitimately.

45 of 59

Example

  • Suppose you computed r = –0.624 with 14 data points. df = 14 – 2 = 12. The critical values are –0.532 and 0.532. Since –0.624 < –0.532, r is significant and the line can be used for prediction.
  • Suppose you computed r = 0.776 and n = 6. df = 6 – 2 = 4. The critical values are –0.811 and 0.811. Since –0.811 < 0.776 < 0.811, r is not significant, and the line should not be used for prediction.

45

46 of 59

Example with the Final Exam Example

Consider the third exam / final exam example from earlier. The line of best fit is: ŷ = –173.51+4.83x with r = 0.6631 and there are n = 11 data points. Can the regression line be used for prediction? Given a third-exam score (x value), can we use the line to predict the final exam score (predicted y value)?

    • H0ρ = 0
    • Haρ ≠ 0
    • α = 0.05
  • Use the "95% Critical Value" table for r with df = n – 2 = 11 – 2 = 9.
  • The critical values are –0.602 and +0.602
  • Since 0.6631 > 0.602, r is significant.
  • Decision: Reject the null hypothesis.
  • Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between the third exam score (x) and the final exam score (y) because the correlation coefficient is significantly different from zero.

46

47 of 59

A Few More Examples

Suppose you computed the following correlation coefficients.

  • r = –0.567 and the sample size, n, is 19. The df = n – 2 = 17. The critical value is –0.456. –0.567 < –0.456 so r is significant.
  • r = 0.708 and the sample size, n, is nine. The df = n – 2 = 7. The critical value is 0.666. 0.708 > 0.666 so r is significant.
  • r = 0.134 and the sample size, n, is 14. The df = 14 – 2 = 12. The critical value is 0.532. 0.134 is between –0.532 and 0.532 so r is not significant.
  • r = 0 and the sample size, n, is five. No matter what the dfs are, r = 0 is between the two critical values so r is not significant.

47

48 of 59

Sections�12.5 and 12.6

PREDICTION AND OUTLIERS

48

49 of 59

Outliers

  • In some data sets, there are values (observed data points) called outliers.
  • Outliers are observed data points that are far from the least squares line.
  • It is possible that an outlier is a result of erroneous data. Other times, an outlier may hold valuable information about the population under study and should remain included in the data.
  • The key is to examine carefully what causes a data point to be an outlier.

49

50 of 59

Identifying Outliers

  • We could guess at outliers by looking at a graph of the scatterplot and best fit-line. However, we would like some guideline as to how far away a point needs to be in order to be considered an outlier.
  • As a rough rule of thumb, we can flag any point that is located further than two standard deviations above or below the best-fit line as an outlier.
  • The standard deviation used is the standard deviation of the residuals or errors.
  • We can do this visually in the scatter plot by drawing an extra pair of lines that are two standard deviations above and below the best-fit line.
  • Any data points that are outside this extra pair of lines are flagged as potential outliers.

50

51 of 59

How does the outlier affect �the best fit line?

  • If we were to measure the vertical distance from any data point to the corresponding point on the line of best fit and that distance is at least 2s, then we would consider the data point to be "too far" from the line of best fit.
  • We need to find and graph the lines that are two standard deviations below and above the regression line. Any point that is outside these two lines are a potential outlier.
  • Note: When outliers are deleted, the researcher should either record that data was deleted, and why, or the researcher should provide results both with and without the deleted data. If data is erroneous and the correct values are known (e.g., student one actually scored a 70 instead of a 65), then this correction can be made to the data.

51

52 of 59

Example

  • A random sample of 11 statistics students produced the following data, where x is the third exam score out of 80, and y is the final exam score out of 200. Prove that this data set has at least 1 outlier. Create a table that compares the output of the regression to the actual values and determine if any other outliers exist.

53 of 59

Example – Answers

54 of 59

Identifying Outliers �With Technology

54

55 of 59

Example

  • A random sample of 11 statistics students produced the following data, where x is the third exam score out of 80, and y is the final exam score out of 200. Remove any outliers, then compute the regression equation again.

56 of 59

Example - Answers

  • ŷ = –355.19 + 7.39x and r = 0.9121
  • The new line with r = 0.9121 is a stronger correlation than the original (r = 0.6631) because r = 0.9121 is closer to one. This means that the new line is a better fit to the ten remaining data values. The line can better predict the final exam score given the third exam score.

57 of 59

Example

  • The Consumer Price Index (CPI) measures the average change over time in the prices paid by urban consumers for consumer goods and services. The CPI affects nearly all Americans because of the many ways it is used. One of its biggest uses is as a measure of inflation. By providing information about price changes in the Nation's economy to government, business, and labor, the CPI helps them to make economic decisions. The President, Congress, and the Federal Reserve Board use the CPI's trends to formulate monetary and fiscal policies. In the following table, x is the year and y is the CPI.

57

58 of 59

Example, cont.

  1. Draw a scatterplot of the data.
  2. Calculate the least squares line. Write the equation in the form

ŷ = a + bx.

  1. Draw the line on the scatterplot.
  2. Find the correlation coefficient. Is it significant?
  3. What is the average CPI for the year 1990?

58

59 of 59

Example - Answers

  1. See graph to the right.
  2. ŷ = –3204 + 1.662x is the equation of the line of best fit.
  3. r = 0.8694
  4. The number of data points is n = 14. Use the 95% Critical Values of the Sample Correlation Coefficient table at the end of Chapter 12. n – 2 = 12. The corresponding critical value is 0.532. Since 0.8694 > 0.532, r is significant.�ŷ = –3204 + 1.662(1990) = 103.4 CPI
  5. Using technology, we find that s = 25.4 ; graphing the lines Y2 = –3204 + 1.662X – 2(25.4) and Y3 = –3204 + 1.662X + 2(25.4) shows that no data values are outside those lines, identifying no outliers. (Note that the year 1999 was very close to the upper line, but still inside it.)

59