1 of 23

2.3.2

Least-Squares Regression

2 of 23

Vocabulary

Residual

The difference between the actual observed value of y for a given value of x and the predicted value of y for the same value of x.

(What actually happens - what was predicted)

Residual = Actual y - Predicted y

or

Residual = y - y

Regression Line (or Least Squares Regression Line)

A line that models how a response variable y changes as the explanatory variable x changes.

Expressed in the form y = a + bx

y (pronounced “y-hat”) is the predicted value of the response variable for any given value of the explanatory variable (x)

Any time you see a “hat” on something, it means “predicted”

3 of 23

Vocabulary

Extrapolation

The use of a regression line for prediction outside the interval of x-values used to obtain the line. [in math - going outside the plausible domain]

The further we extrapolate, the less reliable the predictions.

4 of 23

Residuals

Residual:

Actual y - Predicted y

6.8 - 5.4 = 1.4

With that point, the LSRM model is underestimating actual observed value since the line is below the point (Positive residual value).

If the LSRM /LSRL line is higher/above a point, the LSRM is overestimating the observed value (Negative residual value).

4

Actual Point (5, 6.8)

Predicted Point (5, 5.4)

5 of 23

Residuals

The goal is to minimize all the residual “gaps” and get the sum of all the residuals as close to 0 as possible.

The best LSRM accomplishes this goal.

5

Actual Point (5, 6.8)

Predicted Point (5, 5.4)

6 of 23

Least Squares Regression Model

Also called:

Linear Regression Model / Line

Line of Best Fit (from IM1)

3 Ways to calculate the LSRM

  1. Algebraically using formulas on AP Formula sheet. (“puzzle” problems)

  1. Reading a statistical software printout (quite common)

  1. From raw data put into your calculator (also common)

4: LinReg(ax+b)

8: LinReg(a+bx)

7 of 23

Least Squares Regression Model

Method 1: Algebraically using Formulas

We use the formulas when we don’t have individual data, just means and St. Dev.

8 of 23

Least Squares Regression Model

Method 1: Algebraically using Formulas

LSRM Formula

Y-int Formula

Slope Formula

slope

y-int

Mean of all

y-values

Mean of all

x-values

Correlation

coeff.

St. Dev. of residuals

of x

St. Dev. of residuals

of y

Notice the LSRM Formula and Y-int Formula are incredibly similar. This means the LSRM line passes through the point (x, y). There is usually a multiple choice question about this fact.

9 of 23

Least Squares Regression Model

Method 1: Algebraically using Formulas

Example:

A random sample of 15 high school students was selected from the U.S. Census At School database. The foot length (cm.) and height (cm.) of each student in the sample was recorded. The mean foot length was 24.76 cm. with a standard deviation of the residuals of 2.71 cm. The mean height was 171.43 cm. with a standard deviation of the residuals 10.69 cm. The correlation between foot length and height is r = 0.697. Find the equation of the least squares regression line for predicting height from foot length.

10 of 23

Least Squares Regression Model

Method 2: Reading a statistical software printout (maybe most common)

Example measuring Wind Velocity (mph) vs. Electricity Production (amperes)

Explanatory Variable

(Key Word)

Sometimes says “Intercept”

a = y-int

b = slope

Amperes = 0.137 + 0.240(mph)

11 of 23

Least Squares Regression Model

Method 2: Reading a statistical software printout (maybe most common)

Example File Size (kilobytes) vs. Printing Time (seconds)

Write the LSRM Equation

Predict the print time for a file size of 20 kb.

Time = 11.6559 + 3.47812(kb.)

81.2183 sec.

12 of 23

Least Squares Regression Model

Method 2: Reading a statistical software printout (maybe most common)

Correlation Coefficient (r)

Computer outputs also give you the R2 values.

Square-root them to find r.

13 of 23

Least Squares Regression Model

Method 3: From raw data put into your calculator (also common)

Graphing Calculator

  1. Turn Diagnostics On (if you have a TI-83/84
  2. Input your two lists of data

(STAT → EDIT)

    • Make note of the name of each list
  1. We want to calculate a Linear Regression line
    1. STAT → CALC
    2. 4: LinReg(ax+b) or 8: LinReg(a+bx)
    3. Xlist = Explanatory variable list
    4. Ylist = Response variable list
    5. Piece everything together from the values it gives you

y=a+bx tells you the format

a = y-intercept (or slope depending on if you did option 4 or 8)

b = slope (or y-int)

r = correlation coefficient

r2 - “r-squared”

Fruit snacks Activity

Hand Size (cm)

Number of Candy

21.5

13

20

12

19.5

12

20

14

23.5

20

22.5

14

21

12

22

14

22.5

15

21

13

18

8

20

15

22

14

14 of 23

Least Squares Regression Model

Method 3: From raw data put into your calculator (also common)

Important Note

  • Your graphing calculator uses a and b just because. In stats we like to use words instead of variables a lot of times.
  • Even though the calculator uses y=a+bx, we need to use something a little different:

Response = y-int + slope(Explanatory)

If you want to go the algebra route, it would look like:

y = 1.10 + 1.64x

But: DEFINE YOUR VARIABLES

y = predicted Response,

x = Explanatory

Failure to emphasize the LSRM tells you the PREDICTED Response variable will result in a P instead of an E.

15 of 23

How well does the LSRM fit the Data?

The correlation coefficient (r) tells us mainly two things:

  1. How strong or weak the relationship is between the explanatory and response variables
  2. How good of a predictor the data is due to that strong or weak relationship

But eventually we will have a LSRM line going through the data and we will use the equation for that to make those predictions.

How well does the LSRM fit the Data?

If we take our correlation coefficient (r) and square it, we will get an answer.

We get R2

16 of 23

How well does the LSRM fit the Data?

R2 tells us two things:

  1. Tells us how well our LSRM fits the data
  2. Tells us what percent of the variation in response variables is accounted for by the data.
    1. Variation being the typical distance each y-value is from the regression line, not the mean.
    2. If you were to draw a horizontal line across your scatterplot at the mean y-value, the “residuals”, or gaps between each point and this horizontal line would be, on average, larger than the gaps between the points and the LSRM line.

17 of 23

How well does the LSRM fit the Data?

R2

  • Often called the Coefficient of Determinitation
  • Always a value between 0% - 100%, and always written and described as a PERCENT
  • It gives the percentage of variability that is accounted for by the LSRM.
    • The remaining percent (100%-R2) gives us the amount of variability accounted for by the residuals. Since residuals are errors, we want this value to be low.
  • R2 essentially gives us the overall strength of our regression line model.
  • The closer R2 is to 100%, the more useful our LSRM is.

The official interpretation of R2:

R2 % of the variability in the Response Variable can be explained by the approximate linear relationship with the Explanatory Variable.

Knowing how to interpret R2 is really all you need to do with R2

18 of 23

Interpretations

Unit 2 is all about interpreting numbers. Expect a lot of it on the Test.

These are the most common ones you will come across.

r - There is a Strong/Weak, Positive/Negative relationship between the Explanatory and Response Variables (Use Context).

Slope - “For every 1 unit increase in the explanatory variable, our model predicts an average increase of y units in the response variable.”

y-Intercept - “At an explanatory variable value of 0 units, our model predicts a response variable value of y units.”

(This sentence might seem like nonsense, but that’s ok. Write it anyway.)

19 of 23

Interpretations

Unit 2 is all about interpreting numbers. Expect a lot of it on the Test.

These are the most common ones you will come across.

R2

“R2 % of the variation in the Response variable can be explained by the approximate linear relationship with the Explanatory variable.”

20 of 23

Interpretations

Unit 2 is all about interpreting numbers. Expect a lot of it on the Test.

These are the most common ones you will come across.

Standard Deviation of Residuals (s) - “s is the typical distance between the actual Response variable and the predicted Response variable for a given Explanatory variable.

Z-Scores and Standard Deviations (It sneaks into the test every now and then).

Remember, Z-Scores just count the number of St. Dev. of something.

“For every 1 SD you move over on the x-axis, you move r SDs on the y-axis.”

21 of 23

Practice

The worldwide Cost of Living Survey City Rankings determine the cost of living in the 25 most expensive cities in the world. These rankings scale New York City as 100, and express the cost of living in other cities as a percentage of the New York cost. For example, the table indicates that in Tokyo the cost of living was 65% higher than New york in 2000, but dropped to only 34% higher in 2001.

22 of 23

Practice

23 of 23

Practice