1 of 28

Session 1:

Penalized Regression

Patrick Hall

Teaching Faculty, Department of Decision Sciences

George Washington University

2 of 28

Session I Agenda

Introduction
Linear Regression
Generalized Linear Models (GLM)
Penalized Linear Models
Logistic Regression
Python Code Example: L2 Gradient Descent on Credit Data
Reading

3 of 28

Supervised Learning

Regression is a form of supervised learning where the data contains known past outcomes for a numeric target variable, e.g., credit limit amounts.

Classification is the other major type of supervised learning. In classification, the data contains known past outcomes for a categorical target variable, e.g., credit lending yes/no decisions.

4 of 28

Linear Regression

5 of 28

Regression

Regression is the task of learning a target function f that maps each attribute set X into a continuous-valued output y.
The goal of regression is to find a target function that can fit the input data with minimum error.
A standard approach is to apply the methods of least squares, which attempts to find the model parameters (regression coefficients) that minimize the sum of the squared error (residual sum of squares).

6 of 28

Regression: Ordinary Least-Squares (OLS) Method

Elements of Statistical Learning:

y = f(X) = Ꞵ₀ + Ꞵ₁x₁ + Ꞵ₂x₂ + 𝛜

ŷ = g(X) = Ꞵ₀ + Ꞵ₁x₁ + Ꞵ₂x₂

7 of 28

Regression: Ordinary Least-Squares (OLS) Method

Elements of Statistical Learning:

y = f(X) = Ꞵ₀ + Ꞵ₁x₁ + Ꞵ₂x₂ + 𝛜

ŷ = g(X) = Ꞵ₀ + Ꞵ₁x₁ + Ꞵ₂x₂

𝛜⁽ⁱ⁾

ŷ⁽ⁱ⁾ = g(x⁽ⁱ⁾₁, x⁽ⁱ⁾₂)

Ꞵ₀

Ꞵ₁

(x⁽ⁱ⁾₁, x⁽ⁱ⁾₂)

x⁽ⁱ⁾

ŷ⁽ⁱ⁾ - Ꞵ₀

8 of 28

OLS: Training

Error:

Gradient:

Ꞵ

Error

-▽ (Gradient Descent)

Image: Elements of Statistical Learning, Figure 3.2.

Normal Equation:

(i)

9 of 28

OLS: Interpretation

provider_charge ~ medicare_payment + num_service

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	3.85E+11	1.92E+11	1148.9	<.0001
Error	3334	5.58E+11	167376011
Corrected Total	3336	9.43E+11

Root MSE	12937	R-Square	0.408
Dependent Mean	24721	Adj R-Sq	0.4076
Coeff Var	52.33355

Parameter Estimates
Variable	Label	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|	Variance Inflation
Intercept	Intercept	1	-1219.43	598.38	-2.04	0.0416	0
AVE_ave_medicare_payment	Average Medicare Payment	1	3.83	0.08	47.88	<.0001	1.02
AVE_num_service	Number of Services	1	-5.84	1.17	-4.96	<.0001	1.02

10 of 28

OLS: Interpretation

provider_charge ~ medicare_payment + num_service

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	3.85E+11	1.92E+11	1148.9	<.0001
Error	3334	5.58E+11	167376011
Corrected Total	3336	9.43E+11

Root MSE	12937	R-Square	0.408
Dependent Mean	24721	Adj R-Sq	0.4076
Coeff Var	52.33355

Parameter Estimates
Variable	Label	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|	Variance Inflation
Intercept	Intercept	1	-1219.43	598.38	-2.04	0.0416	0
AVE_ave_medicare_payment	Average Medicare Payment	1	3.83	0.08	47.88	<.0001	1.02
AVE_num_service	Number of Services	1	-5.84	1.17	-4.96	<.0001	1.02

SST = SSM + SSE

F = MSM/MSE, scaled ratio of the model variance to the error/residual variance.

Interpreted here as “rejecting the null hypothesis that all regression parameters equal 0,” i.e. the regression model is valid.

R² = SSM/SST, always interpreted as “the proportion of variance in the response variable explained by the model.”

R² adjusted for more than one variable.

VIF > 10 is considered an indicator of possible multicollinearity problems.

Standard error of the coefficient – should be much smaller than the coefficient. (Std. deviation for the coefficient.)

t-test for the coefficient, here interpreted as “rejecting the null hypothesis that this coefficient is equal to 0,” i.e. this variable is “significant.”

Estimated parameter for the input, here interpreted as “holding all other inputs constant, for a one unit increase in average Medicare payment, average provider charge will increase by 3.83 units on average.”

11 of 28

OLS: Requirements

Requirements	If broken …
Linear relationship between inputs and targets; normal y, normal errors	Inappropriate application/unreliable results; use a machine learning technique; use GLM
N > p	Underspecified/unreliable results; use LASSO or elastic net penalized regression
No strong multicollinearity	Ill-conditioned/unstable/unreliable results; Use ridge(L2/Tikhonov)/elastic net penalized regression
No influential outliers	Biased predictions, parameters, and statistical tests; use robust methods, i.e. IRLS, Huber loss, investigate/remove outliers
Constant variance/no heteroskedasticity	Lessened predictive accuracy, invalidates statistical tests; use GLM in some cases
Limited correlation between input rows (no autocorrelation)	Invalidates statistical tests; use time-series methods or machine learning technique

12 of 28

Contemporary Approaches: �Generalized Linear Models (GLM)

13 of 28

Contemporary Approaches: �Generalized Linear Models (GLM)

Family/distribution defines mean and variance of Y

Nonlinear link function between linear component and E(Y)

Linear component

Family/distribution allows for nonconstant variance

14 of 28

Contemporary Approaches:

Iteratively Reweighted Least Squares

“Inner Loop”

“Outer Loop”

(i)

N-1

i=0

P-1

j=0

15 of 28

Penalized Linear Models

16 of 28

Contemporary Approaches: �Penalized Regression

17 of 28

Contemporary Approaches: Elastic Net

P-1

j=0

P-1

j=0

N-1

i=0

(i)

18 of 28

Contemporary Approaches: Elastic Net

P-1

j=0

P-1

j=0

N-1

i=0

(i)

Least squares minimization – finds β’s for linear relationship.

L2/Ridge/Tinkhonov Penalty – helps address multicollinearity.

L1/LASSO penalty – for variable selection.

P-1

j=0

P-1

j=0

Values of coefficients as optimization proceeds

19 of 28

Logistic Regression

20 of 28

Classification: Overview

ESL, Figure 4.1 (pg 129)

Consider the simulated Default dataset
Default status of credit card payment based on annual income and monthly credit card balance
Default rate is about 3% (orange-defaulted; blue otherwise)
Predict Yes or No of loan payment status

Adapted from Introduction to Statistical Learning Methods with R

In linear regression models, the prediction is numeric
For classification, the prediction is categorical:

yes/no
low/medium/high

Many possible classification techniques

Logistic Regression
Trees – Random Forest and Gradient Boosting
Neural Networks

21 of 28

Issues with Linear Regression

Consider a model predicting a medical condition of a patient based on the individual’s symptoms where there are 3 possible diagnoses (outcomes); and using a dummy variable approach:

1 if stroke
2 if drug overdose
3 if epileptic seizures

This numerical coding for linear regression model implies an ordering of outcomes; and assumes that the difference between stroke and drug overdose is same as the difference between drug overdose and epileptic seizure.
Different orders of encoding are also reasonable and acceptable.
Linear regression often times yield estimates outside the [0,1] interval, making them impossible to interpret as probability estimates.
Violates assumptions for distribution of error.

Adapted from Introduction to Statistical Learning Methods with R

22 of 28

Logistic Regression

Consider the Default data set, where the response falls into Yes or No
Rather than modeling this response directly, logistic regression models the probability of the response categories (levels)

Adapted from Introduction to Statistical Learning Methods with R

23 of 28

Logistic Regression

Probability that gives output between 0 and 1 for all values of input variables.
Many functions meet these criteria – in particular, the logistic function:

S-Shaped curve
Maximum likelihood method to fit the model

Adapted from Introduction to Statistical Learning Methods with R

PAY_0

24 of 28

“Log of Odds” or Logit

After some mathematical manipulation, we find that:

By taking the logarithm of both sides, we arrive at:

Adapted from Introduction to Statistical Learning Methods with R

25 of 28

Logistic Regression: Training

Objective function:

Gradient:

Ꞵ

Error

-▽ (Gradient Descent)

Maximum likelihood estimation:

(i)

26 of 28

Logistic Regression: Interpretation

Probability to Log Odds:

For categorical

p = event rate for that level
odds = p/(1-p)
odds ratio = odds_level/odds_{reference level}
log odds ratio = ln(odds ratio)

For interval:

p = change in event rate for one unit increase; this is not constant
odds = odds_level - odds_{level +1}, this is constant
Log odds = ln(odds)

27 of 28

Logistic Regression: Interpretation

Log odds = -0.54

Odds = e^-0.54= 0.58

“Holding all other variables constant, for a one unit increase in age, the odds of the event occurring change by a factor of 0.58 on average.”

Probability to Log Odds:

For categorical

p = event rate for that level
odds = p/(1-p)
odds ratio = odds_level/odds_{reference level}
log odds ratio = ln(odds ratio)

Log odds ratio against reference level = 1.2

Odds ratio against reference level = e^1.2= 3.32

Probability/event rate in training data = 3.32/(1 + 3.32) = 0.76

“Holding all other variables constant, a person being male changes the odds of the event occurring by a factor of 3.32 over the reference level on average.”

For interval:

p = change in event rate for one unit increase; this is not constant
odds = odds_level - odds_{level +1}, this is constant
Log odds = ln(odds)

28 of 28

Reading

Elements of Statistical Learning, Sections 3.1 - 3.4, Section 4.4
Elastic net: Regularization and variable selection via the elastic net
LASSO: Regression Shrinkage and Selection via the Lasso
Computation and glmnet: Regularization Paths for Generalized Linear Models via Coordinate Descent