1 of 25

Data Mining_Anoop Chaturvedi

1

Swayam Prabha

Course Title

Multivariate Data Mining- Methods and Applications

Lecture 11

Principal Component and Least Angle Regression

By

Anoop Chaturvedi

Department of Statistics, University of Allahabad

Prayagraj (India)

Slides can be downloaded from https://sites.google.com/view/anoopchaturvedi/swayam-prabha

2 of 25

Example: LASSO

Data ⇒ mtcars ⇒ glmnet package in R

Input variables ⇒ mpg, wt, drat, qsec,

Output variable ⇒ hp

Full Model:

Train set: 22, Test: 10

Data Mining_Anoop Chaturvedi

2

 

Estimate

Std. Error

t value

Pr(>|t|)

Intercept

473.779

105.213

4.503

0.000116

mpg

2.877

2.381

1.209

0.237319

wt

26.037

13.514

1.927

0.064600 .

drat

4.819

15.952

0.302

0.764910

qsec

-20.751

3.993

5.197

1.79 e-05

3 of 25

 

Data Mining_Anoop Chaturvedi

3

4 of 25

Final Model:�������

Data Mining_Anoop Chaturvedi

4

5 of 25

Predict hp for mpg=25, wt=3.0, drat=4.5, qsec= 20.5.�

  • Ridge regression shrinks all coefficients towards zero
  • Lasso regression can remove predictors from the model by shrinking the coefficients completely to zero.
  • glmnet() function of R can be used to fit the model with some other penalty functions such as elastic net.

Data Mining_Anoop Chaturvedi

5

6 of 25

 

Data Mining_Anoop Chaturvedi

6

7 of 25

 

Data Mining_Anoop Chaturvedi

7

8 of 25

 

Data Mining_Anoop Chaturvedi

8

9 of 25

 

Data Mining_Anoop Chaturvedi

9

10 of 25

 

Data Mining_Anoop Chaturvedi

10

11 of 25

 

Data Mining_Anoop Chaturvedi

11

12 of 25

 

Data Mining_Anoop Chaturvedi

12

13 of 25

Example: PC Regression for MTCARS dataset (32 observations)

Response variable ⇒ hp

Predictor variables ⇒ mpg, disp, drat, wt, qsec

R-package used: pls()

Cross-validated using 10 random segments

Data Mining_Anoop Chaturvedi

13

14 of 25

Test RMSE calculated by the 10-fold cross-validation

Only intercept ⇒ Test RMSE 69.66, Add first PC ⇒ 43.30. Add second PC ⇒ 34.25, Adding additional PCs increases test RMSE.

Optimal number of PCs in the final model=2.

Data Mining_Anoop Chaturvedi

14

15 of 25

Cross-validation plots:

Data Mining_Anoop Chaturvedi

15

16 of 25

Data Mining_Anoop Chaturvedi

16

Adding two PCs improves the model fit. Adding more PCs worsens the fit.

17 of 25

Prediction MSE=56.86549

Coefficients of final model

Data Mining_Anoop Chaturvedi

17

18 of 25

Least Angle Regression (LARS)

Forward stepwise regression ⇒ At each step identify the best variable to include in the active set, and then update the least squares fit.

Least angle regression ⇒ Utilizes as much of a predictor as it deserves.

First, it selects the variable most correlated with the response and moves its coefficient continuously toward its least squares value.

Its correlation with the evolving residual decreases in absolute value.

Data Mining_Anoop Chaturvedi

18

19 of 25

As soon as another variable catches up in terms of correlation with the residual, the process is paused and the second variable joins the active set, and their coefficients are moved together in a way that keeps their correlations tied and decreasing.

The process is continued until all the variables are in the model, and ends at the full least-squares fit.

Data Mining_Anoop Chaturvedi

19

20 of 25

  •  

Data Mining_Anoop Chaturvedi

20

21 of 25

  •  

Data Mining_Anoop Chaturvedi

21

22 of 25

  •  

Data Mining_Anoop Chaturvedi

22

23 of 25

Disadvantages of the LAR method:

  • With high dimensional multicollinear independent variables, the selected variables may not be the actual causal variables.
  • LARS is based upon an iterative refitting of the residuals. So it is sensitive to the effects of noise.
  • Since high dimensional data often have multicollinearity across some variables, the problem that LARS has with correlated variables may limit its application to high dimensional data.

Data Mining_Anoop Chaturvedi

23

24 of 25

  •  

Data Mining_Anoop Chaturvedi

24

25 of 25

  •  

Data Mining_Anoop Chaturvedi

25