�
Data Mining_Anoop Chaturvedi
1
Swayam Prabha
Course Title
Multivariate Data Mining- Methods and Applications
Lecture 11
Principal Component and Least Angle Regression
By
Anoop Chaturvedi
Department of Statistics, University of Allahabad
Prayagraj (India)
Slides can be downloaded from https://sites.google.com/view/anoopchaturvedi/swayam-prabha
Example: LASSO
Data ⇒ mtcars ⇒ glmnet package in R
Input variables ⇒ mpg, wt, drat, qsec,
Output variable ⇒ hp
Full Model:
Train set: 22, Test: 10
Data Mining_Anoop Chaturvedi
2
| Estimate | Std. Error | t value | Pr(>|t|) |
Intercept | 473.779 | 105.213 | 4.503 | 0.000116 |
mpg | 2.877 | 2.381 | 1.209 | 0.237319 |
wt | 26.037 | 13.514 | 1.927 | 0.064600 . |
drat | 4.819 | 15.952 | 0.302 | 0.764910 |
qsec | -20.751 | 3.993 | 5.197 | 1.79 e-05 |
Data Mining_Anoop Chaturvedi
3
Final Model:�������
Data Mining_Anoop Chaturvedi
4
Predict hp for mpg=25, wt=3.0, drat=4.5, qsec= 20.5.�
Data Mining_Anoop Chaturvedi
5
Data Mining_Anoop Chaturvedi
6
Data Mining_Anoop Chaturvedi
7
Data Mining_Anoop Chaturvedi
8
Data Mining_Anoop Chaturvedi
9
Data Mining_Anoop Chaturvedi
10
Data Mining_Anoop Chaturvedi
11
Data Mining_Anoop Chaturvedi
12
Example: PC Regression for MTCARS dataset (32 observations)
Response variable ⇒ hp
Predictor variables ⇒ mpg, disp, drat, wt, qsec
R-package used: pls()
Cross-validated using 10 random segments
Data Mining_Anoop Chaturvedi
13
Test RMSE calculated by the 10-fold cross-validation
Only intercept ⇒ Test RMSE 69.66, Add first PC ⇒ 43.30. Add second PC ⇒ 34.25, Adding additional PCs increases test RMSE.
Optimal number of PCs in the final model=2.
Data Mining_Anoop Chaturvedi
14
Cross-validation plots:
Data Mining_Anoop Chaturvedi
15
Data Mining_Anoop Chaturvedi
16
Adding two PCs improves the model fit. Adding more PCs worsens the fit.
Prediction MSE=56.86549
Coefficients of final model
Data Mining_Anoop Chaturvedi
17
Least Angle Regression (LARS)
Forward stepwise regression ⇒ At each step identify the best variable to include in the active set, and then update the least squares fit.
Least angle regression ⇒ Utilizes as much of a predictor as it deserves.
First, it selects the variable most correlated with the response and moves its coefficient continuously toward its least squares value.
Its correlation with the evolving residual decreases in absolute value.
Data Mining_Anoop Chaturvedi
18
As soon as another variable catches up in terms of correlation with the residual, the process is paused and the second variable joins the active set, and their coefficients are moved together in a way that keeps their correlations tied and decreasing.
The process is continued until all the variables are in the model, and ends at the full least-squares fit.
Data Mining_Anoop Chaturvedi
19
Data Mining_Anoop Chaturvedi
20
Data Mining_Anoop Chaturvedi
21
Data Mining_Anoop Chaturvedi
22
Disadvantages of the LAR method:
Data Mining_Anoop Chaturvedi
23
Data Mining_Anoop Chaturvedi
24
Data Mining_Anoop Chaturvedi
25