2.7 Regression Example
A tale of regression
Linear Regression
Is a statistical technique for modeling the linear relationship between a dependent quantitative variable y and one or more quantitative explanatory variables (or independent variables) denoted x.
Part 1: Modeling Body Fat Percentage
Hydrostatic Weighing: uses Archimedes principle which states that when a body is submerged in water, there is a buoyant counter force equal to the weight of the water which is displaced
Pros
| Cons
|
Other Methods
Measure parts of the body and use statistics!
JMP→Sample Data Library→Body Fat.jmp
Can we use someone’s abdomen circumference to predict body fat percentage?
Data Background
Variables
Importance: Accurate measurement of body fat is inconvenient and costly. Can we create a linear model using data that is more easily gathered?
Descriptive Statistics: Body Fat %
Shape: Normally distributed
Center: 19%
Spread: 47.5%
Oddities: There was one 40 year old whose body fat percentage was 0!
There was a 51 year old with 47.5% body fat who was an outlier
Descriptive Statistics: Abdomen Circumference
Shape: Normally distributed
Center: ~91cm
Spread: 78.7cm
Oddities: Two men had abdomen circumferences that were outliers in this data set with measurements of 148.1cm & 126.2cm, however their body fat % was not an outlier.
Abdomen Circumference vs. Body Fat %
Shape: Linear
Strength: Moderately Strong
Direction: Positive
*This point was removed and the model was run again. It did not affect the model parameters significantly.
Oddities: The person that has an abdomen circumference of 150 cm with body fat measured at 35% might be an influential point.*
Correlation Coefficient: r = 0.813 indicates that there is a strong positive linear relationship between body fat % and abdomen circumference.
Prediction Equation
On average, with every 10 centimeter increase in abdomen circumference, body fat percentage will increase by 6.3%.
Nondeterministic Language is a Must!
Examples of nondeterministic language include
However, “about,” “approximately,” and “according to the model” do not satisfy grading requirements.
Residual Plots
A PERFECT RESIDUAL PLOT.
Residual Scatterplot
Shape and Spread of the Residuals
The shape of the distribution of the residuals is normal and 95% of the residuals fall within +/- 2 standard deviations from the mean, therefore we can continue with a linear model.
95%
r2 (Sometimes called the “Coefficient of Determination”)
The r2 value of 0.68 tells us that 68% of the variability in the percentage of body fat in a male can be explained by the linear association with the circumference of their abdomen.
Other factors that might explain the other 32% of variation could be things like their height, the amount of exercise, genetics, or other unmeasured variables.
Conclusions
Abdomen circumference is a fairly good indicator of body fat percentage. For every centimeter increase in abdomen circumference, body fat percentage increases a little over half of a percent.
Additionally, this method makes body fat % easy to estimate.
However, it is not 100% accurate and there will be some large variation in real life compared to the model.
Users of this method should proceed with caution.
Future Studies
Recommendations
Part 2: Violating conditions and assumptions
a.k.a. What not to do in your presentations
Something seems off...
Violates the
Influential Point
Condition
The numbers look good!
r = 0.754
Y = 0.7673985 + 0.0014444*X
r2 = 0.57
Looks good, right?!
Violates the
“Straight Enough” Condition
Also Violates the
EQUAL VARIANCE Condition
Errors Increase (or Decrease) Systematically
Violates the
DOES THE PLOT THICKEN
condition
Influential Points
Reality Check: Is the Regression Reasonable?
Really Helpful Resources
Linear Regression Cheat Sheets