2.3.2
Least-Squares Regression
Vocabulary
Residual
The difference between the actual observed value of y for a given value of x and the predicted value of y for the same value of x.
(What actually happens - what was predicted)
Residual = Actual y - Predicted y
or
Residual = y - y
Regression Line (or Least Squares Regression Line)
A line that models how a response variable y changes as the explanatory variable x changes.
Expressed in the form y = a + bx
y (pronounced “y-hat”) is the predicted value of the response variable for any given value of the explanatory variable (x)
Any time you see a “hat” on something, it means “predicted”
Vocabulary
Extrapolation
The use of a regression line for prediction outside the interval of x-values used to obtain the line. [in math - going outside the plausible domain]
The further we extrapolate, the less reliable the predictions.
Residuals
Residual:
Actual y - Predicted y
6.8 - 5.4 = 1.4
With that point, the LSRM model is underestimating actual observed value since the line is below the point (Positive residual value).
If the LSRM /LSRL line is higher/above a point, the LSRM is overestimating the observed value (Negative residual value).
4
Actual Point (5, 6.8)
Predicted Point (5, 5.4)
Residuals
The goal is to minimize all the residual “gaps” and get the sum of all the residuals as close to 0 as possible.
The best LSRM accomplishes this goal.
5
Actual Point (5, 6.8)
Predicted Point (5, 5.4)
Least Squares Regression Model
Also called:
Linear Regression Model / Line
Line of Best Fit (from IM1)
3 Ways to calculate the LSRM
4: LinReg(ax+b)
8: LinReg(a+bx)
Least Squares Regression Model
Method 1: Algebraically using Formulas
We use the formulas when we don’t have individual data, just means and St. Dev.
Least Squares Regression Model
Method 1: Algebraically using Formulas
LSRM Formula
Y-int Formula
Slope Formula
slope
y-int
Mean of all
y-values
Mean of all
x-values
Correlation
coeff.
St. Dev. of residuals
of x
St. Dev. of residuals
of y
Notice the LSRM Formula and Y-int Formula are incredibly similar. This means the LSRM line passes through the point (x, y). There is usually a multiple choice question about this fact.
Least Squares Regression Model
Method 1: Algebraically using Formulas
Example:
A random sample of 15 high school students was selected from the U.S. Census At School database. The foot length (cm.) and height (cm.) of each student in the sample was recorded. The mean foot length was 24.76 cm. with a standard deviation of the residuals of 2.71 cm. The mean height was 171.43 cm. with a standard deviation of the residuals 10.69 cm. The correlation between foot length and height is r = 0.697. Find the equation of the least squares regression line for predicting height from foot length.
Least Squares Regression Model
Method 2: Reading a statistical software printout (maybe most common)
Example measuring Wind Velocity (mph) vs. Electricity Production (amperes)
Explanatory Variable
(Key Word)
Sometimes says “Intercept”
a = y-int
b = slope
Amperes = 0.137 + 0.240(mph)
Least Squares Regression Model
Method 2: Reading a statistical software printout (maybe most common)
Example File Size (kilobytes) vs. Printing Time (seconds)
Write the LSRM Equation
Predict the print time for a file size of 20 kb.
Time = 11.6559 + 3.47812(kb.)
81.2183 sec.
Least Squares Regression Model
Method 2: Reading a statistical software printout (maybe most common)
Correlation Coefficient (r)
Computer outputs also give you the R2 values.
Square-root them to find r.
Least Squares Regression Model
Method 3: From raw data put into your calculator (also common)
Graphing Calculator
(STAT → EDIT)
y=a+bx tells you the format
a = y-intercept (or slope depending on if you did option 4 or 8)
b = slope (or y-int)
r = correlation coefficient
r2 - “r-squared”
Fruit snacks Activity | |
Hand Size (cm) | Number of Candy |
21.5 | 13 |
20 | 12 |
19.5 | 12 |
20 | 14 |
23.5 | 20 |
22.5 | 14 |
21 | 12 |
22 | 14 |
22.5 | 15 |
21 | 13 |
18 | 8 |
20 | 15 |
22 | 14 |
Least Squares Regression Model
Method 3: From raw data put into your calculator (also common)
Important Note
Response = y-int + slope(Explanatory)
If you want to go the algebra route, it would look like:
y = 1.10 + 1.64x
But: DEFINE YOUR VARIABLES
y = predicted Response,
x = Explanatory
Failure to emphasize the LSRM tells you the PREDICTED Response variable will result in a P instead of an E.
How well does the LSRM fit the Data?
The correlation coefficient (r) tells us mainly two things:
But eventually we will have a LSRM line going through the data and we will use the equation for that to make those predictions.
How well does the LSRM fit the Data?
If we take our correlation coefficient (r) and square it, we will get an answer.
We get R2
How well does the LSRM fit the Data?
R2 tells us two things:
How well does the LSRM fit the Data?
R2
The official interpretation of R2:
“R2 % of the variability in the Response Variable can be explained by the approximate linear relationship with the Explanatory Variable.”
Knowing how to interpret R2 is really all you need to do with R2
Interpretations
Unit 2 is all about interpreting numbers. Expect a lot of it on the Test.
These are the most common ones you will come across.
r - There is a Strong/Weak, Positive/Negative relationship between the Explanatory and Response Variables (Use Context).
Slope - “For every 1 unit increase in the explanatory variable, our model predicts an average increase of y units in the response variable.”
y-Intercept - “At an explanatory variable value of 0 units, our model predicts a response variable value of y units.”
(This sentence might seem like nonsense, but that’s ok. Write it anyway.)
Interpretations
Unit 2 is all about interpreting numbers. Expect a lot of it on the Test.
These are the most common ones you will come across.
R2
“R2 % of the variation in the Response variable can be explained by the approximate linear relationship with the Explanatory variable.”
Interpretations
Unit 2 is all about interpreting numbers. Expect a lot of it on the Test.
These are the most common ones you will come across.
Standard Deviation of Residuals (s) - “s is the typical distance between the actual Response variable and the predicted Response variable for a given Explanatory variable.”
Z-Scores and Standard Deviations (It sneaks into the test every now and then).
Remember, Z-Scores just count the number of St. Dev. of something.
“For every 1 SD you move over on the x-axis, you move r SDs on the y-axis.”
Practice
The worldwide Cost of Living Survey City Rankings determine the cost of living in the 25 most expensive cities in the world. These rankings scale New York City as 100, and express the cost of living in other cities as a percentage of the New York cost. For example, the table indicates that in Tokyo the cost of living was 65% higher than New york in 2000, but dropped to only 34% higher in 2001.
Practice
Practice