Session 1:
Penalized Regression
Patrick Hall
Teaching Faculty, Department of Decision Sciences
George Washington University
Session I Agenda
Supervised Learning
Linear Regression
Regression
Regression: Ordinary Least-Squares (OLS) Method
Elements of Statistical Learning:
y = f(X) = Ꞵ0 + Ꞵ1x1 + Ꞵ2x2 + 𝛜
ŷ = g(X) = Ꞵ0 + Ꞵ1x1 + Ꞵ2x2
Regression: Ordinary Least-Squares (OLS) Method
Elements of Statistical Learning:
y = f(X) = Ꞵ0 + Ꞵ1x1 + Ꞵ2x2 + 𝛜
ŷ = g(X) = Ꞵ0 + Ꞵ1x1 + Ꞵ2x2
𝛜 (i)
ŷ(i) = g(x(i)1, x(i)2 )
Ꞵ0
Ꞵ1
(x(i)1, x(i)2 )
x(i)
ŷ(i) - Ꞵ0
OLS: Training
Error:
Gradient:
Ꞵ
Error
-▽ (Gradient Descent)
Image: Elements of Statistical Learning, Figure 3.2.
Normal Equation:
(i)
(i)
(i)
(i)
(i)
OLS: Interpretation
provider_charge ~ medicare_payment + num_service
Analysis of Variance | | | | | |
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
Model | 2 | 3.85E+11 | 1.92E+11 | 1148.9 | <.0001 |
Error | 3334 | 5.58E+11 | 167376011 | | |
Corrected Total | 3336 | 9.43E+11 | | | |
Root MSE | 12937 | R-Square | 0.408 |
Dependent Mean | 24721 | Adj R-Sq | 0.4076 |
Coeff Var | 52.33355 | | |
Parameter Estimates | | | | | | | |
Variable | Label | DF | Parameter Estimate | Standard Error | t Value | Pr > |t| | Variance Inflation |
Intercept | Intercept | 1 | -1219.43 | 598.38 | -2.04 | 0.0416 | 0 |
AVE_ave_medicare_payment | Average Medicare Payment | 1 | 3.83 | 0.08 | 47.88 | <.0001 | 1.02 |
AVE_num_service | Number of Services | 1 | -5.84 | 1.17 | -4.96 | <.0001 | 1.02 |
OLS: Interpretation
provider_charge ~ medicare_payment + num_service
Analysis of Variance | | | | | |
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
Model | 2 | 3.85E+11 | 1.92E+11 | 1148.9 | <.0001 |
Error | 3334 | 5.58E+11 | 167376011 | | |
Corrected Total | 3336 | 9.43E+11 | | | |
Root MSE | 12937 | R-Square | 0.408 |
Dependent Mean | 24721 | Adj R-Sq | 0.4076 |
Coeff Var | 52.33355 | | |
Parameter Estimates | | | | | | | |
Variable | Label | DF | Parameter Estimate | Standard Error | t Value | Pr > |t| | Variance Inflation |
Intercept | Intercept | 1 | -1219.43 | 598.38 | -2.04 | 0.0416 | 0 |
AVE_ave_medicare_payment | Average Medicare Payment | 1 | 3.83 | 0.08 | 47.88 | <.0001 | 1.02 |
AVE_num_service | Number of Services | 1 | -5.84 | 1.17 | -4.96 | <.0001 | 1.02 |
SST = SSM + SSE
F = MSM/MSE, scaled ratio of the model variance to the error/residual variance.
Interpreted here as “rejecting the null hypothesis that all regression parameters equal 0,” i.e. the regression model is valid.
R2 = SSM/SST, always interpreted as “the proportion of variance in the response variable explained by the model.”
R2 adjusted for more than one variable.
VIF > 10 is considered an indicator of possible multicollinearity problems.
Standard error of the coefficient – should be much smaller than the coefficient. (Std. deviation for the coefficient.)
t-test for the coefficient, here interpreted as “rejecting the null hypothesis that this coefficient is equal to 0,” i.e. this variable is “significant.”
Estimated parameter for the input, here interpreted as “holding all other inputs constant, for a one unit increase in average Medicare payment, average provider charge will increase by 3.83 units on average.”
OLS: Requirements
Requirements | If broken … |
Linear relationship between inputs and targets; normal y, normal errors | Inappropriate application/unreliable results; use a machine learning technique; use GLM |
N > p | Underspecified/unreliable results; use LASSO or elastic net penalized regression |
No strong multicollinearity | Ill-conditioned/unstable/unreliable results; Use ridge(L2/Tikhonov)/elastic net penalized regression |
No influential outliers | Biased predictions, parameters, and statistical tests; use robust methods, i.e. IRLS, Huber loss, investigate/remove outliers |
Constant variance/no heteroskedasticity | Lessened predictive accuracy, invalidates statistical tests; use GLM in some cases |
Limited correlation between input rows (no autocorrelation) | Invalidates statistical tests; use time-series methods or machine learning technique |
Contemporary Approaches: �Generalized Linear Models (GLM)
Contemporary Approaches: �Generalized Linear Models (GLM)
Family/distribution defines mean and variance of Y
Nonlinear link function between linear component and E(Y)
Linear component
Family/distribution allows for nonconstant variance
Contemporary Approaches:
Iteratively Reweighted Least Squares
“Inner Loop”
“Outer Loop”
(i)
(i)
N-1
i=0
P-1
j=0
Penalized Linear Models
Contemporary Approaches: �Penalized Regression
Contemporary Approaches: Elastic Net
P-1
j=0
P-1
j=0
N-1
i=0
(i)
(i)
Contemporary Approaches: Elastic Net
P-1
j=0
P-1
j=0
N-1
i=0
(i)
(i)
Least squares minimization – finds β’s for linear relationship.
L2/Ridge/Tinkhonov Penalty – helps address multicollinearity.
L1/LASSO penalty – for variable selection.
P-1
j=0
P-1
j=0
Values of coefficients as optimization proceeds
Logistic Regression
Classification: Overview
ESL, Figure 4.1 (pg 129)
Adapted from Introduction to Statistical Learning Methods with R
Many possible classification techniques
Issues with Linear Regression
Adapted from Introduction to Statistical Learning Methods with R
Logistic Regression
Adapted from Introduction to Statistical Learning Methods with R
Logistic Regression
Adapted from Introduction to Statistical Learning Methods with R
PAY_0
“Log of Odds” or Logit
Adapted from Introduction to Statistical Learning Methods with R
Logistic Regression: Training
Objective function:
Gradient:
Ꞵ
Error
-▽ (Gradient Descent)
Maximum likelihood estimation:
(i)
(i)
(i)
(i)
(i)
(i)
(i)
Logistic Regression: Interpretation
Probability to Log Odds:
For categorical
For interval:
Logistic Regression: Interpretation
Log odds = -0.54
Odds = e-0.54 = 0.58
“Holding all other variables constant, for a one unit increase in age, the odds of the event occurring change by a factor of 0.58 on average.”
Probability to Log Odds:
For categorical
Log odds ratio against reference level = 1.2
Odds ratio against reference level = e1.2 = 3.32
Probability/event rate in training data = 3.32/(1 + 3.32) = 0.76
“Holding all other variables constant, a person being male changes the odds of the event occurring by a factor of 3.32 over the reference level on average.”
For interval:
Reading