Coefficient of Determination, R2
Objective 1
4-2
R2 is the coefficient of determination, literally, the correlation coefficient, r, squared
4-4
The coefficient of determination, R2, measures the proportion of total variation in the response variable that is explained by the least-squares regression line.
The coefficient of determination is a number between 0 and 1, inclusive. That is, 0 < R2 < 1.
If R2 = 0 the line has no explanatory value
If R2 = 1 means the line explains 100% of the variation in the response variable.
4-5
The data to the right are based on the study for drilling rock. The researchers wanted to determine whether the time it takes to dry drill a distance of 5 feet in rock increases with the depth at which the drilling begins. So, depth at which drilling begins is the predictor variable, x, and time (in minutes) to drill five feet is the response variable, y.
4-6
Sample Statistics
Mean St_Dev
Depth 126.2 52.2
Time 6.99 0.781
Correlation: 0.773
Regression Analysis
The regression equation is still
y = 5.53 + 0.0116 * x
Or, since x is depth and y is Time:
Time = 5.53 + 0.0116 Depth
4-7
Suppose we were asked to predict the time to drill an additional 5 feet, but we did not know the current depth of the drill. What would be our best “guess”?
Sample Statistics
Mean St_Dev
Depth 126.2 52.2
Time 6.99 0.781
Correlation: 0.773
4-8
Suppose we were asked to predict the time to drill an additional 5 feet, but we did not know the current depth of the drill. What would be our best “guess”?
ANSWER:
We would just use the mean of all the available data, the mean time to drill an additional 5 feet: 6.99 minutes
4-9
Now suppose that we are asked to predict the time to drill an additional 5 feet with our regression equation if the current depth of the drill is 160 feet?
ANSWER:
Our “guess” increased from 6.99 minutes to 7.39 minutes based on the knowledge that drill depth is positively associated with drill time.
4-10
4-11
The difference between the observed value of the response variable and the mean value of the response variable is called the total deviation and is equal to:
The difference between the predicted value of the response variable and the mean value of the response variable is called the explained deviation and is equal to:
The difference between the observed value of the response variable and the predicted value of the response variable is called the unexplained deviation and is equal to:
4-12
Total Deviation
Unexplained Deviation
Explained Deviation
+
=
4-13
Total Deviation
Unexplained Deviation
Explained Deviation
+
=
We want this statistic not just for one point, but for all the points that we are doing the regression analysis for, therefore:
4-14
Total Variation = Unexplained Variation + Explained Variation
1 =
Unexplained Variation
Explained Variation
Unexplained Variation
Explained Variation
Total Variation
Total Variation
Total Variation
Total Variation
+
= 1 –
R2 =
4-15
To determine R2 for the linear regression model simply square the value of the linear correlation coefficient.
Squaring the linear correlation coefficient to obtain the coefficient of determination works only for the least-squares linear regression model
4-16
EXAMPLE Determining the Coefficient of Determination
Find and interpret the coefficient of determination for the drilling data.
Because the linear correlation coefficient, r, is 0.773, we have that
R2 = 0.7732 = 0.5975 = 59.75%.
So, 59.75% of the variability in drilling time is explained by the least-squares regression line.
4-17
Draw a scatter diagram for each of these data sets. For each data set, the variance of y is 17.49.
4-18
Data Set A Data Set B Data Set C
Data Set A: 99.99% of the variability in y is explained by the least-squares regression line
Data Set B: 94.7% of the variability in y is explained by the least-squares regression line
Data Set C: 9.4% of the variability in y is explained by the least-squares regression line
General regression equation formula
Slope of the regression
y-intercept of the regression
Coefficient of determination
Correlation coefficient