Section 8a��Simple Linear Regression & Correlation��one continuous predictor (X),� one continuous outcome (Y)�
Ex: Riddle, J. of Perinatology (2006) 26, 556–561
50th percentile for birth weight (BW) in g
as a function of gestational age
Birth Wt (g) =42 exp(0.1155 gest age)
or
Loge(BW) = 3.74 + 0.1155 gest age
In general: BW = A exp(B gest age), A & B change for different percentiles
log(BW) = C + B gest age where C=log(A)
Example: Nishio et. al. Cardiovascular Revascularization Medicine 7 (2006) 54– 60
Simple Linear Regression statistics
Statistics for the association between a continuous X and a continuous Y.
A linear relation is given by an equation
Y = a + b X + errors (errors=e=Y-Ŷ)
Ŷ = predicted Y = a + b X
a = intercept, b =slope= rate of change
r = correlation coefficient, R2=r2
R2= proportion of Y’s variation due to X
SDe=residual SD=RMSE=√mean square error
Ex: X=age (yrs) vs Y=SBP (mmHg)
SBP = 81.5 + 1.22 age + error
SDe = 18.6 mm Hg, r = 0.718, R2 = 0.515
“Residual” error
residual error = e = Y – Ŷ
The “residual error” is the difference between the actual observed value Y and the model (equation) predicted value, Ŷ.
The sum and mean of the ei’s will always be zero. The residual error standard deviation, SDe, is a measure of how close the observed Y values are to their equation predicted values (Ŷ). When r=R2=1, SDe=0.
age vs SBP in women - Predicted SBP (mmHg) = 81.5 + 1.22 age, r=0.72, R2=0.515
Patient | X | Y | Predicted=Ŷ | error=e | | Patient | X | Y | Predicted=Ŷ | error=e |
1 | 22 | 131 | 108.41 | 22.59 | | 17 | 49 | 133 | 141.41 | -8.41 |
2 | 23 | 128 | 109.63 | 18.37 | | 18 | 49 | 128 | 141.41 | -13.41 |
3 | 24 | 116 | 110.85 | 5.15 | | 19 | 50 | 183 | 142.64 | 40.36 |
4 | 27 | 106 | 114.52 | -8.52 | | 20 | 51 | 130 | 143.86 | -13.86 |
5 | 28 | 114 | 115.74 | -1.74 | | 21 | 51 | 133 | 143.86 | -10.86 |
6 | 29 | 123 | 116.97 | 6.03 | | 22 | 51 | 144 | 143.86 | 0.14 |
7 | 30 | 117 | 118.19 | -1.19 | | 23 | 52 | 128 | 145.08 | -17.08 |
8 | 32 | 122 | 120.63 | 1.37 | | 24 | 54 | 105 | 147.53 | -42.53 |
9 | 33 | 99 | 121.86 | -22.86 | | 25 | 56 | 145 | 149.97 | -4.97 |
10 | 35 | 121 | 124.30 | -3.30 | | 26 | 57 | 141 | 151.19 | -10.19 |
11 | 40 | 147 | 130.41 | 16.59 | | 27 | 58 | 153 | 152.42 | 0.58 |
12 | 41 | 139 | 131.64 | 7.36 | | 28 | 59 | 157 | 153.64 | 3.36 |
13 | 41 | 171 | 131.64 | 39.36 | | 29 | 63 | 155 | 158.53 | -3.53 |
14 | 46 | 137 | 137.75 | -0.75 | | 30 | 67 | 176 | 163.42 | 12.58 |
15 | 47 | 111 | 138.97 | -27.97 | | 31 | 71 | 172 | 168.31 | 3.69 |
16 | 48 | 115 | 140.19 | -25.19 | | 32 | 77 | 178 | 175.64 | 2.36 |
| | | | | | 33 | 81 | 217 | 180.53 | 36.47 |
| X | Y | Predicted | error | |
| | | | |
Mean | 46.7 | 138.6 | 138.6 | 0.0 | | | | | | |
SD | 15.5 | 26.4 | 18.9 | 18.6 | | | | | | |
Mean error is always zero
Confidence intervals (CI)�Prediction intervals (PI)
Model: predicted SBP=Ŷ=81.5 + 1.22 age
For age=50, Ŷ=81.5+1.22(50) = 142.6 mm Hg
95% CI: Ŷ ± 2 SEM, 95% PI: Ŷ ± 2 SDe
SEM=3.3 mm Hg ↔ 95% CI is (136.0, 149.2)
SDe=18.6 mm Hg ↔ 95% PI (104.8,180.4)
The Ŷ=142.6 is predicted mean for all age 50 and the predicted value for one individual age 50. Both interpretations are correct.
Model fit - R square statistic
Y = Ŷ + e
The variation in Y has two sources:
Variation of Ŷ – this is accounted for by X
Variation of e – this is NOT accounted for by X
total Y variation = Ŷ variation + e variation
R2 = R square = Ŷ variation / total variation =
100 R2 is the percent of the total variation
“accounted for” by X (or by Xs in multiple regr).
For SBP vs age, R2=0.515=51.5%
R squared statistic�measure of model accuracy
SDy2 = “all” the variation in Y
SDe2 = the variation NOT accounted for by X (more generally, not accounted for by Xs)
SDy2 – SDe2 = the variation that IS accounted for by X (or all the Xs)
R2=(SDy2 – SDe2)/ SDy2 = proportion of total variation “accounted for” by Xs.
SDe2 = (1-R2) SDy2
R2 definition & interpretation
R2 is the proportion of the total (squared) variation in Y that is “accounted for” by X.
R2= r2 = (SDy2– SDe2)/SDy2 =1- (SDe2/SDy2)
SDy√(1-r2) = SDe
Under Gaussian theory, 95% of the errors are within +/- 2 SDe of their corresponding predicted Y value, Ŷ.
Y=Ŷ+e, Var(Y) = Var(Ŷ+e) = Var(Ŷ) + Var(e)
Var(Ŷ) = Var(Y) – Var(e) = SDy2 –SDe2
R2=Var(Ŷ)/Var(Y) = (SDy2 –SDe2) / SDy2
How big should R2 be?
SBP SD = 26.4 mm Hg, SDe=18.6
95% PI: Ŷ± 2(18.6) or Ŷ± 37.2 mm Hg
How big does R2 have to be to make
95% PI: Ŷ ± 10 mm Hg? 🡪 SDe≈ 5 mm Hg
R2=1-(SDe/SDy)2=
1-(5/26.4)2 = 1-0.036=0.964 or 96.4%
(with age only, R2 = 0.515)
Correlation coefficient (r)
Instead of regressing Y on X,
create Ys = (Y-mean Y)/SDy =“standardized Y”
Xs = (X-mean X)/SDx =“standardized X”
Ys & Xs are both in SD units & have mean=0
If regress Ys on Xs: Ys = a + b Xs + error
a is always 0. “b” is called “r”, the correlation coefficient, the “slope” in SD units.
r ranges from -1 to 1. r stays the same if the roles of Ys and Xs are reversed. A value of r=0 implies no (linear) association between X and Y.
Correlation-interpretation, |r| < 1
Slope is related to correlation�(simple regression)
slope = correlation x (SDy/SDx)
b = r (SDy/SDx) b=1.22=0.7178(26.4/15.5)
where SDy is the SD of the Y variable
SDx is the SD of the X variable
r = b (SDx/SDy) 0.7178=1.22(15.5/26.4)
r = b SDx/√ b2 SDx2 + SDe2
SDe = residual error SD, SDx = SD of X
when b=0, r=0, when b>0, r>0, when b<0, r<0
Computing slope & correlation
Pearson vs Spearman corr=r
Pearson r – Assumes relationship between Y and X is linear except for noise.
“parametric” (inspired by bivariate normal model). Strongly affected by outliers.
Spearman rs – Based on ranks of Y and X. Assume relation between Y and X is monotone (non increasing, non decreasing). “Non parametric”. Less affected by outliers.
Pearson vs Spearman (rank)
original data | | ranks | | |
BMI | HbA1c | | BMI | HbA1c |
32.4 | 2.6 | | 3 | 2 |
33.9 | 3.1 | | 6 | 5 |
29.9 | 2.5 | | 1 | 1 |
41.3 | 5.0 | | 8 | 8 |
32.7 | 3.6 | | 4 | 6 |
32.1 | 4.5 | | 2 | 7 |
38.7 | 5.2 | | 7 | 9 |
33.2 | 2.8 | | 5 | 4 |
48.1 | 2.8 | | 9 | 4 |
| | | | |
Pearson | 0.24 | | Spearman | 0.48 |
Pearson r vs Spearman rs
n=9, r =0.25, rs = 0.48
Limitation of linear models
Mean increase in X=19.9
Mean increase in Y= 39.9
Do X and Y have a positive correlation?
Example: X is the increase in minutes of exercise. Y is the increase in Math test scores in the same group.
Mean changes in X and Y do not necessarily imply correlation
Mean changes do not imply correlation�
Just because there is a positive mean X change and a positive mean Y change does not necessarily imply that X and Y are correlated.
mean X change =19.9, mean Y change=39.9
r=0
Limitations of Linear Statistics�Example of a nonlinear relationship
Correlations are misleading if relation is not linear or at least monotone
Limits to linear regression – “Pathological” Behavior��Ŷ = 3 + 0.5 X, r = 0.817, SDe = 13.75, n=11�(for all four datasets below)�
Weisberg, Applied Linear Regression, p 108
Simpson’s paradox
Final grade (Y) vs hours studying (X), r = -0.7981 – negative relationship ?
Simpson’s paradox (cont.)
Controlling for type of course – relations are now positive
“Ecologic” fallacy
Wrong unit of analysis-City, not person
Ecologic Fallacy- Must look at the correct unit of analysis
truncating X, true r=0.9, R2=0.81
Full data
Interpreting correlation in experiments
Since r=b(SDx/SDy), an artificially lowered SDx will also lower r.
R2, b and SDe when X is systematically changed
Data R2 b SDe
Complete data 0.81 0.90 0.43
(“truth”)
Truncated 0.47 1.03 0.43
(X < -1 SD deleted)
center deleted 0.91 0.90 0.45
( -1 SD< X < 1 SD deleted)
extremes deleted 0.58 0.92 0.42
(X < -1 SD deleted, X > 1 SD deleted)
Assumes intrinsic relation between X and Y is linear except for error.
Attenuation of regression coefficients�when there is random error in X (true slope=β= 4.0)
Negligible errors in X:
Y=1.149 + 3.959 X
SE(b) = 0.038
Noisy errors in X:
Y=-2.132 + 3.487 X
SE(b) = 0.276
Attenuation (random error in X) drives the statistic toward its null value.
Example:
X=smoking (y/n), Y=lung cancer (y/n)
Compute odds ratio = OR
Randomly misclassify smokers/non smokers
What happens to the OR?
OR gets closer to 1.0 (conservative)
Attenuation gives conservative results.
Checking for linearity – smoothing & splines
Basic idea: In a plot of Y vs X, also plot Ŷ vs X where
Ŷi = ∑ Wni Yi where ∑ Wni=1, Wni>0.
The “weights” Wni, are larger near Yi and smaller far from Yi.
Smooth: define a moving “window” of a given width around the ith data point and fit a mean (weighted moving average) in this window.
Spline: break the X axis into non-overlapping bins and fit a polynomial within each bin such that the “ends” all “match”.
The size of the window or bins control the amount of smoothing.
We smooth until we obtain a smooth curve but go no further.
Check for linearity - Smoothing
1. Make a bin (window), take a (possibly weighted) average of all Y values in the bin or carry out a regression in the bin. The predicted value (mean) in the middle of the bin is the smoothed value (Ŷ).
2. Move bin over (overlapping) and repeat.
3. Connect the predicted values (Ŷs) with lines across all bins.
34
Smoothing to check for linearity�(kernel smoothing)
35
Move window over- compute smoothed value
36
Move window over- compute smoothed value
37
Splines to check for linearity
Spline- Break the X axis into equally spaced non overlapping “bins” (segments). Fit a polynomial (usually a quadratic or cubic) within each bin such that Y at the “ends” (“knots”) all “match” (are piecewise continuous) and their first derivatives (slopes) are also continuous.
38
Draftsman’s spline
39
Splines vs smoothing �(JMP-Fitting linear models)
40
Spline example-constant spline�same as “polychotomizing” x
41
Linear and (restricted) cubic spline
42
Generally use restricted cubic splines with 3-5 knots to check linearity
Restricted cubic spline-RCS
Cubic equation: Ŷ = b0 + b1X + b2X2 + b3X3
Fit an equation within each non overlapping bin (segment).
Restrictions on b0, b1, b2, b3:
Ŷ must be same in adjacent bins at the (X) knots.
dŶ/dX = b1 + 2b2X + 3b3X2
must be the same in adjacent bins at the knots (makes “smooth” connection).
Checking for linearity via spline
Fit a straight line to the data (linear model)
Fit a restricted cubic spline to the same data (non linear model). Compare fit (R square, SDe). F test for this comparison:
F=([RSSlinear-RSSspline]/k)/(SDespline)2
If the straight line fits about as well as the cubic spline, (p value from F not significant) can say the relationship between Y and X is linear. (k=df spline-df linear, RSS= residual error sum of squares)
RSSe = (n-k) SDe2
Smoothing example�IGFBP by BMI
Smoothing
Insufficient smoothing
Over smoothing
IGFBP by BMI
Smoothing example�IGFBP by BMI
Smoothing
Smoothing example�IGFBP by BMI
Insufficient smoothing
Smoothing example�IGFBP by BMI
Over smoothing
ANDRO by BMI – is underlying relationship linear?
Check linearity - ANDRO by BMI�assume linear, R2=0.153�assume spline, R2=0.165�p value = 0.8130 (LR test, null is linear)
Linear
Spline