1 of 25

�

Data Mining_Anoop Chaturvedi

Swayam Prabha

Course Title

Multivariate Data Mining- Methods and Applications

Lecture 11

Principal Component and Least Angle Regression

Anoop Chaturvedi

Department of Statistics, University of Allahabad

Prayagraj (India)

Slides can be downloaded from https://sites.google.com/view/anoopchaturvedi/swayam-prabha

2 of 25

Example: LASSO

Data ⇒ mtcars ⇒ glmnet package in R

Input variables ⇒ mpg, wt, drat, qsec,

Output variable ⇒ hp

Full Model:

Train set: 22, Test: 10

Data Mining_Anoop Chaturvedi

	Estimate	Std. Error	t value	Pr(>\|t\|)
Intercept	473.779	105.213	4.503	0.000116
mpg	2.877	2.381	1.209	0.237319
wt	26.037	13.514	1.927	0.064600 .
drat	4.819	15.952	0.302	0.764910
qsec	-20.751	3.993	5.197	1.79 e-05

3 of 25

Data Mining_Anoop Chaturvedi

4 of 25

Final Model:��

Data Mining_Anoop Chaturvedi

5 of 25

Predict hp for mpg=25, wt=3.0, drat=4.5, qsec= 20.5.�

Ridge regression shrinks all coefficients towards zero
Lasso regression can remove predictors from the model by shrinking the coefficients completely to zero.
glmnet() function of R can be used to fit the model with some other penalty functions such as elastic net.

Data Mining_Anoop Chaturvedi

6 of 25

Data Mining_Anoop Chaturvedi

7 of 25

Data Mining_Anoop Chaturvedi

8 of 25

Data Mining_Anoop Chaturvedi

9 of 25

Data Mining_Anoop Chaturvedi

10 of 25

Data Mining_Anoop Chaturvedi

11 of 25

Data Mining_Anoop Chaturvedi

12 of 25

Data Mining_Anoop Chaturvedi

13 of 25

Example: PC Regression for MTCARS dataset (32 observations)

Response variable ⇒ hp

Predictor variables ⇒ mpg, disp, drat, wt, qsec

R-package used: pls()

Cross-validated using 10 random segments

Data Mining_Anoop Chaturvedi

14 of 25

Test RMSE calculated by the 10-fold cross-validation

Only intercept ⇒ Test RMSE 69.66, Add first PC ⇒ 43.30. Add second PC ⇒ 34.25, Adding additional PCs increases test RMSE.

Optimal number of PCs in the final model=2.

Data Mining_Anoop Chaturvedi

15 of 25

Cross-validation plots:

Data Mining_Anoop Chaturvedi

16 of 25

Data Mining_Anoop Chaturvedi

Adding two PCs improves the model fit. Adding more PCs worsens the fit.

17 of 25

Prediction MSE=56.86549

Coefficients of final model

Data Mining_Anoop Chaturvedi

18 of 25

Least Angle Regression (LARS)

Forward stepwise regression ⇒ At each step identify the best variable to include in the active set, and then update the least squares fit.

Least angle regression ⇒ Utilizes as much of a predictor as it deserves.

First, it selects the variable most correlated with the response and moves its coefficient continuously toward its least squares value.

Its correlation with the evolving residual decreases in absolute value.

Data Mining_Anoop Chaturvedi

19 of 25

As soon as another variable catches up in terms of correlation with the residual, the process is paused and the second variable joins the active set, and their coefficients are moved together in a way that keeps their correlations tied and decreasing.

The process is continued until all the variables are in the model, and ends at the full least-squares fit.

Data Mining_Anoop Chaturvedi

20 of 25

Data Mining_Anoop Chaturvedi

21 of 25

Data Mining_Anoop Chaturvedi

22 of 25

Data Mining_Anoop Chaturvedi

23 of 25

Disadvantages of the LAR method:

With high dimensional multicollinear independent variables, the selected variables may not be the actual causal variables.
LARS is based upon an iterative refitting of the residuals. So it is sensitive to the effects of noise.
Since high dimensional data often have multicollinearity across some variables, the problem that LARS has with correlated variables may limit its application to high dimensional data.

Data Mining_Anoop Chaturvedi

1 of 25

2 of 25

3 of 25

4 of 25

5 of 25

6 of 25

7 of 25

8 of 25

9 of 25

10 of 25

11 of 25

12 of 25

13 of 25

14 of 25

15 of 25

16 of 25

17 of 25

18 of 25

19 of 25

20 of 25

21 of 25

22 of 25

23 of 25

24 of 25

25 of 25