1 of 51

Section 8a��Simple Linear Regression & Correlation��one continuous predictor (X),� one continuous outcome (Y)�

2 of 51

Ex: Riddle, J. of Perinatology (2006) 26, 556–561

50th percentile for birth weight (BW) in g

as a function of gestational age

Birth Wt (g) =42 exp(0.1155 gest age)

or

Loge(BW) = 3.74 + 0.1155 gest age

In general: BW = A exp(B gest age), A & B change for different percentiles

log(BW) = C + B gest age where C=log(A)

3 of 51

Example: Nishio et. al. Cardiovascular Revascularization Medicine 7 (2006) 54– 60

4 of 51

Simple Linear Regression statistics

Statistics for the association between a continuous X and a continuous Y.

A linear relation is given by an equation

Y = a + b X + errors (errors=e=Y-Ŷ)

Ŷ = predicted Y = a + b X

a = intercept, b =slope= rate of change

r = correlation coefficient, R2=r2

R2= proportion of Y’s variation due to X

SDe=residual SD=RMSE=√mean square error

5 of 51

Ex: X=age (yrs) vs Y=SBP (mmHg)

SBP = 81.5 + 1.22 age + error

SDe = 18.6 mm Hg, r = 0.718, R2 = 0.515

6 of 51

“Residual” error

residual error = e = Y – Ŷ

The “residual error” is the difference between the actual observed value Y and the model (equation) predicted value, Ŷ.

The sum and mean of the ei’s will always be zero. The residual error standard deviation, SDe, is a measure of how close the observed Y values are to their equation predicted values (Ŷ). When r=R2=1, SDe=0.

7 of 51

age vs SBP in women - Predicted SBP (mmHg) = 81.5 + 1.22 age, r=0.72, R2=0.515

Patient

X

Y

Predicted=Ŷ

error=e

Patient

X

Y

Predicted=Ŷ

error=e

1

22

131

108.41

22.59

17

49

133

141.41

-8.41

2

23

128

109.63

18.37

18

49

128

141.41

-13.41

3

24

116

110.85

5.15

19

50

183

142.64

40.36

4

27

106

114.52

-8.52

20

51

130

143.86

-13.86

5

28

114

115.74

-1.74

21

51

133

143.86

-10.86

6

29

123

116.97

6.03

22

51

144

143.86

0.14

7

30

117

118.19

-1.19

23

52

128

145.08

-17.08

8

32

122

120.63

1.37

24

54

105

147.53

-42.53

9

33

99

121.86

-22.86

25

56

145

149.97

-4.97

10

35

121

124.30

-3.30

26

57

141

151.19

-10.19

11

40

147

130.41

16.59

27

58

153

152.42

0.58

12

41

139

131.64

7.36

28

59

157

153.64

3.36

13

41

171

131.64

39.36

29

63

155

158.53

-3.53

14

46

137

137.75

-0.75

30

67

176

163.42

12.58

15

47

111

138.97

-27.97

31

71

172

168.31

3.69

16

48

115

140.19

-25.19

32

77

178

175.64

2.36

33

81

217

180.53

36.47

X

Y

Predicted

error

Mean

46.7

138.6

138.6

0.0

SD

15.5

26.4

18.9

18.6

Mean error is always zero

8 of 51

Confidence intervals (CI)�Prediction intervals (PI)

Model: predicted SBP=Ŷ=81.5 + 1.22 age

For age=50, Ŷ=81.5+1.22(50) = 142.6 mm Hg

95% CI: Ŷ ± 2 SEM, 95% PI: Ŷ ± 2 SDe

SEM=3.3 mm Hg ↔ 95% CI is (136.0, 149.2)

SDe=18.6 mm Hg ↔ 95% PI (104.8,180.4)

The Ŷ=142.6 is predicted mean for all age 50 and the predicted value for one individual age 50. Both interpretations are correct.

9 of 51

Model fit - R square statistic

Y = Ŷ + e

The variation in Y has two sources:

Variation of Ŷ – this is accounted for by X

Variation of e – this is NOT accounted for by X

total Y variation = Ŷ variation + e variation

R2 = R square = Ŷ variation / total variation =

100 R2 is the percent of the total variation

“accounted for” by X (or by Xs in multiple regr).

For SBP vs age, R2=0.515=51.5%

10 of 51

R squared statistic�measure of model accuracy

SDy2 = “all” the variation in Y

SDe2 = the variation NOT accounted for by X (more generally, not accounted for by Xs)

SDy2 – SDe2 = the variation that IS accounted for by X (or all the Xs)

R2=(SDy2 – SDe2)/ SDy2 = proportion of total variation “accounted for” by Xs.

SDe2 = (1-R2) SDy2

11 of 51

R2 definition & interpretation

R2 is the proportion of the total (squared) variation in Y that is “accounted for” by X.

R2= r2 = (SDy2– SDe2)/SDy2 =1- (SDe2/SDy2)

SDy√(1-r2) = SDe

Under Gaussian theory, 95% of the errors are within +/- 2 SDe of their corresponding predicted Y value, Ŷ.

Y=Ŷ+e, Var(Y) = Var(Ŷ+e) = Var(Ŷ) + Var(e)

Var(Ŷ) = Var(Y) – Var(e) = SDy2 –SDe2

R2=Var(Ŷ)/Var(Y) = (SDy2 –SDe2) / SDy2

12 of 51

How big should R2 be?

SBP SD = 26.4 mm Hg, SDe=18.6

95% PI: Ŷ± 2(18.6) or Ŷ± 37.2 mm Hg

How big does R2 have to be to make

95% PI: Ŷ ± 10 mm Hg? 🡪 SDe5 mm Hg

R2=1-(SDe/SDy)2=

1-(5/26.4)2 = 1-0.036=0.964 or 96.4%

(with age only, R2 = 0.515)

13 of 51

Correlation coefficient (r)

Instead of regressing Y on X,

create Ys = (Y-mean Y)/SDy =“standardized Y”

Xs = (X-mean X)/SDx =“standardized X”

Ys & Xs are both in SD units & have mean=0

If regress Ys on Xs: Ys = a + b Xs + error

a is always 0. “b” is called “r”, the correlation coefficient, the “slope” in SD units.

r ranges from -1 to 1. r stays the same if the roles of Ys and Xs are reversed. A value of r=0 implies no (linear) association between X and Y.

14 of 51

Correlation-interpretation, |r| < 1

15 of 51

Slope is related to correlation�(simple regression)

slope = correlation x (SDy/SDx)

b = r (SDy/SDx) b=1.22=0.7178(26.4/15.5)

where SDy is the SD of the Y variable

SDx is the SD of the X variable

r = b (SDx/SDy) 0.7178=1.22(15.5/26.4)

r = b SDx/√ b2 SDx2 + SDe2

SDe = residual error SD, SDx = SD of X

when b=0, r=0, when b>0, r>0, when b<0, r<0

16 of 51

Computing slope & correlation

  •  

17 of 51

Pearson vs Spearman corr=r

Pearson r – Assumes relationship between Y and X is linear except for noise.

“parametric” (inspired by bivariate normal model). Strongly affected by outliers.

Spearman rs – Based on ranks of Y and X. Assume relation between Y and X is monotone (non increasing, non decreasing). “Non parametric”. Less affected by outliers.

18 of 51

Pearson vs Spearman (rank)

original data

ranks

BMI

HbA1c

BMI

HbA1c

32.4

2.6

3

2

33.9

3.1

6

5

29.9

2.5

1

1

41.3

5.0

8

8

32.7

3.6

4

6

32.1

4.5

2

7

38.7

5.2

7

9

33.2

2.8

5

4

48.1

2.8

9

4

Pearson

0.24

Spearman

0.48

19 of 51

Pearson r vs Spearman rs

n=9, r =0.25, rs = 0.48

20 of 51

Limitation of linear models

21 of 51

Mean increase in X=19.9

Mean increase in Y= 39.9

Do X and Y have a positive correlation?

Example: X is the increase in minutes of exercise. Y is the increase in Math test scores in the same group.

Mean changes in X and Y do not necessarily imply correlation

22 of 51

Mean changes do not imply correlation

Just because there is a positive mean X change and a positive mean Y change does not necessarily imply that X and Y are correlated.

mean X change =19.9, mean Y change=39.9

r=0

23 of 51

Limitations of Linear Statistics�Example of a nonlinear relationship

Correlations are misleading if relation is not linear or at least monotone

24 of 51

Limits to linear regression – “Pathological” Behavior��Ŷ = 3 + 0.5 X, r = 0.817, SDe = 13.75, n=11�(for all four datasets below)�

Weisberg, Applied Linear Regression, p 108

25 of 51

Simpson’s paradox

Final grade (Y) vs hours studying (X), r = -0.7981 – negative relationship ?

26 of 51

Simpson’s paradox (cont.)

Controlling for type of course – relations are now positive

27 of 51

“Ecologic” fallacy

Wrong unit of analysis-City, not person

28 of 51

Ecologic Fallacy- Must look at the correct unit of analysis

29 of 51

truncating X, true r=0.9, R2=0.81

Full data

30 of 51

Interpreting correlation in experiments

Since r=b(SDx/SDy), an artificially lowered SDx will also lower r.

R2, b and SDe when X is systematically changed

Data R2 b SDe

Complete data 0.81 0.90 0.43

(“truth”)

Truncated 0.47 1.03 0.43

(X < -1 SD deleted)

center deleted 0.91 0.90 0.45

( -1 SD< X < 1 SD deleted)

extremes deleted 0.58 0.92 0.42

(X < -1 SD deleted, X > 1 SD deleted)

Assumes intrinsic relation between X and Y is linear except for error.

31 of 51

Attenuation of regression coefficients�when there is random error in X (true slope=β= 4.0)

Negligible errors in X:

Y=1.149 + 3.959 X

SE(b) = 0.038

Noisy errors in X:

Y=-2.132 + 3.487 X

SE(b) = 0.276

32 of 51

Attenuation (random error in X) drives the statistic toward its null value.

Example:

X=smoking (y/n), Y=lung cancer (y/n)

Compute odds ratio = OR

Randomly misclassify smokers/non smokers

What happens to the OR?

OR gets closer to 1.0 (conservative)

Attenuation gives conservative results.

33 of 51

Checking for linearity – smoothing & splines

Basic idea: In a plot of Y vs X, also plot Ŷ vs X where

Ŷi = ∑ Wni Yi where ∑ Wni=1, Wni>0.

The “weights” Wni, are larger near Yi and smaller far from Yi.

Smooth: define a moving “window” of a given width around the ith data point and fit a mean (weighted moving average) in this window.

Spline: break the X axis into non-overlapping bins and fit a polynomial within each bin such that the “ends” all “match”.

The size of the window or bins control the amount of smoothing.

We smooth until we obtain a smooth curve but go no further.

34 of 51

Check for linearity - Smoothing

1. Make a bin (window), take a (possibly weighted) average of all Y values in the bin or carry out a regression in the bin. The predicted value (mean) in the middle of the bin is the smoothed value (Ŷ).

2. Move bin over (overlapping) and repeat.

3. Connect the predicted values (Ŷs) with lines across all bins.

34

35 of 51

Smoothing to check for linearity�(kernel smoothing)

35

36 of 51

Move window over- compute smoothed value

36

37 of 51

Move window over- compute smoothed value

37

38 of 51

Splines to check for linearity

Spline- Break the X axis into equally spaced non overlapping “bins” (segments). Fit a polynomial (usually a quadratic or cubic) within each bin such that Y at the “ends” (“knots”) all “match” (are piecewise continuous) and their first derivatives (slopes) are also continuous.

38

39 of 51

Draftsman’s spline

39

40 of 51

Splines vs smoothing �(JMP-Fitting linear models)

  1. Difficult to carry out smoothing in high dimensions (more than one X). That is, smoothing is essentially bivariate, not multivariate.
  2. Smoothing is completely defined by Y. Spline functional form is independent of Y.
  3. Splines only add k-2 parameters for k knot points.
  4. Splines are the “natural” test for curvature vs linearity.

40

41 of 51

Spline example-constant spline�same as “polychotomizing” x

41

42 of 51

Linear and (restricted) cubic spline

42

Generally use restricted cubic splines with 3-5 knots to check linearity

43 of 51

Restricted cubic spline-RCS

Cubic equation: Ŷ = b0 + b1X + b2X2 + b3X3

Fit an equation within each non overlapping bin (segment).

Restrictions on b0, b1, b2, b3:

Ŷ must be same in adjacent bins at the (X) knots.

dŶ/dX = b1 + 2b2X + 3b3X2

must be the same in adjacent bins at the knots (makes “smooth” connection).

44 of 51

Checking for linearity via spline

Fit a straight line to the data (linear model)

Fit a restricted cubic spline to the same data (non linear model). Compare fit (R square, SDe). F test for this comparison:

F=([RSSlinear-RSSspline]/k)/(SDespline)2

If the straight line fits about as well as the cubic spline, (p value from F not significant) can say the relationship between Y and X is linear. (k=df spline-df linear, RSS= residual error sum of squares)

RSSe = (n-k) SDe2

45 of 51

Smoothing example�IGFBP by BMI

Smoothing

Insufficient smoothing

Over smoothing

46 of 51

IGFBP by BMI

47 of 51

Smoothing example�IGFBP by BMI

Smoothing

48 of 51

Smoothing example�IGFBP by BMI

Insufficient smoothing

49 of 51

Smoothing example�IGFBP by BMI

Over smoothing

50 of 51

ANDRO by BMI – is underlying relationship linear?

51 of 51

Check linearity - ANDRO by BMI�assume linear, R2=0.153�assume spline, R2=0.165�p value = 0.8130 (LR test, null is linear)

Linear

Spline