1 of 29

2.7 Regression Example

A tale of regression

2 of 29

Linear Regression

Is a statistical technique for modeling the linear relationship between a dependent quantitative variable y and one or more quantitative explanatory variables (or independent variables) denoted x.

3 of 29

Part 1: Modeling Body Fat Percentage

Hydrostatic Weighing: uses Archimedes principle which states that when a body is submerged in water, there is a buoyant counter force equal to the weight of the water which is displaced

Pros

  • Gold Standard
  • +/- 1.5% error
  • Repeat measures usually prove consistent

Cons

  • Requires a lot of equipment & space
  • Time consuming
  • Requires training
  • Being under water may be difficult and anxiety provoking

4 of 29

Other Methods

Measure parts of the body and use statistics!

JMP→Sample Data Library→Body Fat.jmp

Can we use someone’s abdomen circumference to predict body fat percentage?

5 of 29

Data Background

  • 252 men sometime around 1985
  • The estimates are based on a combination of underwater weighing and body circumference measurements

Variables

Importance: Accurate measurement of body fat is inconvenient and costly. Can we create a linear model using data that is more easily gathered?

6 of 29

Descriptive Statistics: Body Fat %

Shape: Normally distributed

Center: 19%

Spread: 47.5%

Oddities: There was one 40 year old whose body fat percentage was 0!

There was a 51 year old with 47.5% body fat who was an outlier

7 of 29

Descriptive Statistics: Abdomen Circumference

Shape: Normally distributed

Center: ~91cm

Spread: 78.7cm

Oddities: Two men had abdomen circumferences that were outliers in this data set with measurements of 148.1cm & 126.2cm, however their body fat % was not an outlier.

8 of 29

Abdomen Circumference vs. Body Fat %

Shape: Linear

Strength: Moderately Strong

Direction: Positive

*This point was removed and the model was run again. It did not affect the model parameters significantly.

Oddities: The person that has an abdomen circumference of 150 cm with body fat measured at 35% might be an influential point.*

Correlation Coefficient: r = 0.813 indicates that there is a strong positive linear relationship between body fat % and abdomen circumference.

9 of 29

Prediction Equation

On average, with every 10 centimeter increase in abdomen circumference, body fat percentage will increase by 6.3%.

10 of 29

Nondeterministic Language is a Must!

Examples of nondeterministic language include

  • Predicted body fat %
  • Expected body fat %
  • Estimated body fat %
  • Typical body fat %
  • Average body fat %
  • Body fat %, on average
  • Our model predicts
  • and so on

However, “about,” “approximately,” and “according to the model” do not satisfy grading requirements.

11 of 29

Residual Plots

  • Should look like the most boring scatterplot you have ever encountered.
  • There should be no shape, the strength should be weak (at best), and there should be no direction.
  • You want the plots to look like someone sneezed out the data points.
  • If it does not, either a linear model is not a good fit or there could be other predictor variables that need to be included in the analysis.

12 of 29

A PERFECT RESIDUAL PLOT.

13 of 29

Residual Scatterplot

  • Randomly scattered fairly equally above and below the x-axis
  • There is no shape, no curves, and in general no pattern
  • No interesting features (like direction or shape)
  • It stretches horizontally, with about the same amount of scatter throughout
  • There are no bends (it should also have no outliers…)

14 of 29

Shape and Spread of the Residuals

The shape of the distribution of the residuals is normal and 95% of the residuals fall within +/- 2 standard deviations from the mean, therefore we can continue with a linear model.

95%

15 of 29

r2 (Sometimes called the “Coefficient of Determination”)

The r2 value of 0.68 tells us that 68% of the variability in the percentage of body fat in a male can be explained by the linear association with the circumference of their abdomen.

Other factors that might explain the other 32% of variation could be things like their height, the amount of exercise, genetics, or other unmeasured variables.

16 of 29

Conclusions

Abdomen circumference is a fairly good indicator of body fat percentage. For every centimeter increase in abdomen circumference, body fat percentage increases a little over half of a percent.

Additionally, this method makes body fat % easy to estimate.

However, it is not 100% accurate and there will be some large variation in real life compared to the model.

Users of this method should proceed with caution.

17 of 29

Future Studies

Recommendations

  • Test for other highly correlated variables to use in the model that are also easy to measure
  • Do a similar analysis for women
  • Redo this analysis today to compare with the results of the 1985 data
  • Include people of different ethnic backgrounds for comparison
  • Research other methods to calculate body fat % besides body measurements and hydrostatic measurements

18 of 29

Part 2: Violating conditions and assumptions

a.k.a. What not to do in your presentations

19 of 29

Something seems off...

Violates the

Influential Point

Condition

20 of 29

The numbers look good!

r = 0.754

Y = 0.7673985 + 0.0014444*X

r2 = 0.57

Looks good, right?!

21 of 29

Violates the

“Straight Enough” Condition

22 of 29

23 of 29

Also Violates the

EQUAL VARIANCE Condition

24 of 29

Errors Increase (or Decrease) Systematically

Violates the

DOES THE PLOT THICKEN

condition

25 of 29

Influential Points

  • Proceed with caution in the presence of influential points.
  • Run your linear model with and without the point(s). If the parameters change significantly, you may have a case for removing the point(s).
  • It is NOT acceptable to drop an observation just because it is an influential point. They can be legitimate observations and are sometimes the most interesting ones. It’s important to investigate the nature of it before deciding.

26 of 29

Reality Check: Is the Regression Reasonable?

  • Don’t fit a straight line to a nonlinear relationship.
  • Beware of extraordinary points.
  • Don’t extrapolate beyond the data.
  • Don’t infer that x causes y just because there is a good linear model for their relationship.
  • Don’t choose a model based on r2 alone.
  • Examine the residual plot. It should look boring.
  • If the results are surprising, then either you’ve learned something new about the world or your analysis is wrong.
  • Always be skeptical and ask yourself if the answer is reasonable.

27 of 29

Really Helpful Resources

28 of 29

Linear Regression Cheat Sheets

29 of 29