1 of 28

Session 1:

Penalized Regression

Patrick Hall

Teaching Faculty, Department of Decision Sciences

George Washington University

2 of 28

Session I Agenda

  • Introduction
  • Linear Regression
  • Generalized Linear Models (GLM)
  • Penalized Linear Models
  • Logistic Regression
  • Python Code Example: L2 Gradient Descent on Credit Data
  • Reading

3 of 28

Supervised Learning

  • Regression is a form of supervised learning where the data contains known past outcomes for a numeric target variable, e.g., credit limit amounts.

  • Classification is the other major type of supervised learning. In classification, the data contains known past outcomes for a categorical target variable, e.g., credit lending yes/no decisions.

4 of 28

Linear Regression

5 of 28

Regression

  • Regression is the task of learning a target function f that maps each attribute set X into a continuous-valued output y.
  • The goal of regression is to find a target function that can fit the input data with minimum error.
  • A standard approach is to apply the methods of least squares, which attempts to find the model parameters (regression coefficients) that minimize the sum of the squared error (residual sum of squares).

6 of 28

Regression: Ordinary Least-Squares (OLS) Method

Elements of Statistical Learning:

y = f(X) = Ꞵ0 + Ꞵ1x1 + Ꞵ2x2 + 𝛜

ŷ = g(X) = Ꞵ0 + Ꞵ1x1 + Ꞵ2x2

7 of 28

Regression: Ordinary Least-Squares (OLS) Method

Elements of Statistical Learning:

y = f(X) = Ꞵ0 + Ꞵ1x1 + Ꞵ2x2 + 𝛜

ŷ = g(X) = Ꞵ0 + Ꞵ1x1 + Ꞵ2x2

𝛜 (i)

ŷ(i) = g(x(i)1, x(i)2 )

0

1

(x(i)1, x(i)2 )

x(i)

ŷ(i) - Ꞵ0

8 of 28

OLS: Training

Error:

Gradient:

Error

-▽ (Gradient Descent)

Image: Elements of Statistical Learning, Figure 3.2.

Normal Equation:

(i)

(i)

(i)

(i)

(i)

9 of 28

OLS: Interpretation

provider_charge ~ medicare_payment + num_service 

Analysis of Variance

Source

DF

Sum of

Squares

Mean

Square

F Value

Pr > F

Model

2

3.85E+11

1.92E+11

1148.9

<.0001

Error

3334

5.58E+11

167376011

Corrected Total

3336

9.43E+11

Root MSE

12937

R-Square

0.408

Dependent Mean

24721

Adj R-Sq

0.4076

Coeff Var

52.33355

Parameter Estimates

Variable

Label

DF

Parameter

Estimate

Standard

Error

t Value

Pr > |t|

Variance

Inflation

Intercept

Intercept

1

-1219.43

598.38

-2.04

0.0416

0

AVE_ave_medicare_payment

Average Medicare Payment

1

3.83

0.08

47.88

<.0001

1.02

AVE_num_service

Number of Services

1

-5.84

1.17

-4.96

<.0001

1.02

10 of 28

OLS: Interpretation

provider_charge ~ medicare_payment + num_service 

Analysis of Variance

Source

DF

Sum of

Squares

Mean

Square

F Value

Pr > F

Model

2

3.85E+11

1.92E+11

1148.9

<.0001

Error

3334

5.58E+11

167376011

Corrected Total

3336

9.43E+11

Root MSE

12937

R-Square

0.408

Dependent Mean

24721

Adj R-Sq

0.4076

Coeff Var

52.33355

Parameter Estimates

Variable

Label

DF

Parameter

Estimate

Standard

Error

t Value

Pr > |t|

Variance

Inflation

Intercept

Intercept

1

-1219.43

598.38

-2.04

0.0416

0

AVE_ave_medicare_payment

Average Medicare Payment

1

3.83

0.08

47.88

<.0001

1.02

AVE_num_service

Number of Services

1

-5.84

1.17

-4.96

<.0001

1.02

 

 

SST = SSM + SSE

F = MSM/MSE, scaled ratio of the model variance to the error/residual variance.

Interpreted here as “rejecting the null hypothesis that all regression parameters equal 0,” i.e. the regression model is valid.

R2 = SSM/SST, always interpreted as “the proportion of variance in the response variable explained by the model.”

R2 adjusted for more than one variable.

VIF > 10 is considered an indicator of possible multicollinearity problems.

Standard error of the coefficient – should be much smaller than the coefficient. (Std. deviation for the coefficient.)

t-test for the coefficient, here interpreted as “rejecting the null hypothesis that this coefficient is equal to 0,” i.e. this variable is “significant.”

Estimated parameter for the input, here interpreted as “holding all other inputs constant, for a one unit increase in average Medicare payment, average provider charge will increase by 3.83 units on average.”

11 of 28

OLS: Requirements

Requirements

If broken …

Linear relationship between inputs and targets; normal y, normal errors

Inappropriate application/unreliable results; use a machine learning technique; use GLM

N > p

Underspecified/unreliable results; use LASSO or elastic net penalized regression

No strong multicollinearity

Ill-conditioned/unstable/unreliable results; Use ridge(L2/Tikhonov)/elastic net penalized regression

No influential outliers

Biased predictions, parameters, and statistical tests; use robust methods, i.e. IRLS, Huber loss, investigate/remove outliers

Constant variance/no heteroskedasticity

Lessened predictive accuracy, invalidates statistical tests; use GLM in some cases

Limited correlation between input rows

(no autocorrelation)

Invalidates statistical tests; use time-series methods or machine learning technique

12 of 28

Contemporary Approaches: �Generalized Linear Models (GLM)

 

13 of 28

Contemporary Approaches: �Generalized Linear Models (GLM)

 

Family/distribution defines mean and variance of Y

Nonlinear link function between linear component and E(Y)

Linear component

Family/distribution allows for nonconstant variance

14 of 28

Contemporary Approaches:

Iteratively Reweighted Least Squares

 

“Inner Loop”

“Outer Loop”

(i)

(i)

N-1

i=0

P-1

j=0

15 of 28

Penalized Linear Models

16 of 28

Contemporary Approaches: �Penalized Regression

17 of 28

Contemporary Approaches: Elastic Net

 

P-1

j=0

P-1

j=0

N-1

i=0

(i)

(i)

18 of 28

Contemporary Approaches: Elastic Net

 

P-1

j=0

P-1

j=0

N-1

i=0

(i)

(i)

Least squares minimization – finds β’s for linear relationship.

 

L2/Ridge/Tinkhonov Penalty – helps address multicollinearity.

L1/LASSO penalty – for variable selection.

 

P-1

j=0

P-1

j=0

Values of coefficients as optimization proceeds

19 of 28

Logistic Regression

20 of 28

Classification: Overview

ESL, Figure 4.1 (pg 129)

  • Consider the simulated Default dataset
  • Default status of credit card payment based on annual income and monthly credit card balance
  • Default rate is about 3% (orange-defaulted; blue otherwise)
  • Predict Yes or No of loan payment status

Adapted from Introduction to Statistical Learning Methods with R

  • In linear regression models, the prediction is numeric
  • For classification, the prediction is categorical:
    • yes/no
    • low/medium/high

Many possible classification techniques

    • Logistic Regression
    • Trees – Random Forest and Gradient Boosting
    • Neural Networks

21 of 28

Issues with Linear Regression

  • Consider a model predicting a medical condition of a patient based on the individual’s symptoms where there are 3 possible diagnoses (outcomes); and using a dummy variable approach:
      • 1 if stroke
      • 2 if drug overdose
      • 3 if epileptic seizures
  • This numerical coding for linear regression model implies an ordering of outcomes; and assumes that the difference between stroke and drug overdose is same as the difference between drug overdose and epileptic seizure.
  • Different orders of encoding are also reasonable and acceptable.
  • Linear regression often times yield estimates outside the [0,1] interval, making them impossible to interpret as probability estimates.
  • Violates assumptions for distribution of error.

Adapted from Introduction to Statistical Learning Methods with R

22 of 28

Logistic Regression

  • Consider the Default data set, where the response falls into Yes or No
  • Rather than modeling this response directly, logistic regression models the probability of the response categories (levels)

Adapted from Introduction to Statistical Learning Methods with R

23 of 28

Logistic Regression

  • Probability that gives output between 0 and 1 for all values of input variables.
  • Many functions meet these criteria – in particular, the logistic function:
    • S-Shaped curve
    • Maximum likelihood method to fit the model

Adapted from Introduction to Statistical Learning Methods with R

PAY_0

24 of 28

“Log of Odds” or Logit

  • After some mathematical manipulation, we find that:

  • By taking the logarithm of both sides, we arrive at:

Adapted from Introduction to Statistical Learning Methods with R

25 of 28

Logistic Regression: Training

Objective function:

Gradient:

Error

-▽ (Gradient Descent)

Maximum likelihood estimation:

(i)

(i)

(i)

(i)

(i)

(i)

(i)

26 of 28

Logistic Regression: Interpretation

 

Probability to Log Odds:

For categorical

  • p = event rate for that level
  • odds = p/(1-p)
  • odds ratio = oddslevel/oddsreference level
  • log odds ratio = ln(odds ratio)

For interval:

  • p = change in event rate for one unit increase; this is not constant
  • odds = oddslevel - oddslevel +1 , this is constant
  • Log odds = ln(odds)

27 of 28

Logistic Regression: Interpretation

 

Log odds = -0.54

Odds = e-0.54 = 0.58

“Holding all other variables constant, for a one unit increase in age, the odds of the event occurring change by a factor of 0.58 on average.”

Probability to Log Odds:

For categorical

  • p = event rate for that level
  • odds = p/(1-p)
  • odds ratio = oddslevel/oddsreference level
  • log odds ratio = ln(odds ratio)

Log odds ratio against reference level = 1.2

Odds ratio against reference level = e1.2 = 3.32

Probability/event rate in training data = 3.32/(1 + 3.32) = 0.76

“Holding all other variables constant, a person being male changes the odds of the event occurring by a factor of 3.32 over the reference level on average.”

For interval:

  • p = change in event rate for one unit increase; this is not constant
  • odds = oddslevel - oddslevel +1 , this is constant
  • Log odds = ln(odds)

28 of 28

Reading