LO 4.2.2.G

Learning Objective: Describe the advantage of Lasso over the Ridge regression.

Review:

<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext mathvariant="italic">RIDGE Loss Function = </mtext><mi>R</mi><mi>S</mi><mi>S</mi><mo>+</mo><mi>λ</mi><mstyle displaystyle="false"><mstyle displaystyle="true"><munderover><mo>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>p</mi></munderover></mstyle><msubsup><mi>β</mi><mi>j</mi><mn>2</mn></msubsup></mstyle></math>

<math xmlns="http://www.w3.org/1998/Math/MathML"><mtext mathvariant="italic">LASSO Loss Function = </mtext><mi>R</mi><mi>S</mi><mi>S</mi><mo>+</mo><mi>λ</mi><mstyle displaystyle="false"><munderover><mo>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>p</mi></munderover><mfenced open="|" close="|"><msubsup><mi>β</mi><mi>j</mi><mrow/></msubsup></mfenced></mstyle></math>

Unlike Ridge regression, Lasso can perform variable selection.

Ridge regression will shrink all of the coeﬃcients towards zero, but it will not set any of them exactly to zero (unless λ = ∞).
In the case of the Lasso, the l1 component of the loss function

will force some of the coeﬃcient estimates to be exactly equal to zero when the tuning parameter λ is suﬃciently large.

EXTRA

Lasso

With reference to the Lasso loss function and the figure shown below:

When λ = 0, then the lasso simply gives the least-squares ﬁt.
When λ becomes suﬃciently large, the lasso gives the null model in which all coeﬃcient estimates equal zero.
In between these two extremes, depending on the value of λ, the Lasso can produce a model involving any number of variables.
Lasso will generate a model involving a subset p-q predictors.

Source: Assigned reading

Curves: Income (black), Limit (red), Rating ( blue), Student (yellow)

Ridge regression

With reference to the Ridge loss function and the figure shown below:

When λ = 0, then the Ridge regression simply gives the least-squares ﬁt.
When λ is extremely large, then all of the ridge coeﬃcient estimates are basically zero; this corresponds to the null model that contains no predictors.
In between these two extremes, increasing the value of λ will tend to reduce the magnitudes of the coeﬃcients, but will not result in the exclusion of any of the variables.
Ridge regression will always generate a model involving all p predictors.

Source: Assigned reading