8 of 95

Optimization: Note

the same principle in a higher dimension

9 of 95

Revisit: Least-Square Solution

10 of 95

1. Solve using Linear Algebra

known as least square

11 of 95

1. Solve using Linear Algebra

known as least square

12 of 95

2. Solve using Gradient Descent

13 of 95

3. Solve using CVXPY Optimization

14 of 95

3. Solve using CVXPY Optimization

16 of 95

Regression with Outliers

17 of 95

Regression with Outliers

19 of 95

Think About What Makes Different

It is important to understand what makes them different

Sensitivity to outliers in L2 norm
Robustness of L1 norm

Squaring the residuals magnifies the effect of large errors, meaning that outliers have a significant influence on the final regression line

20 of 95

Scikit-Learn

Machine Learning in Python
Simple and efficient tools for data mining and data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license
https://scikit-learn.org/stable/index.html#

21 of 95

Scikit-Learn

22 of 95

Scikit-Learn: Regression

23 of 95

Scikit-Learn: Regression

24 of 95

Multivariate Linear Regression

25 of 95

Multivariate Linear Regression

26 of 95

Multivariate Linear Regression

27 of 95

Some features may have little impact on the target variable.
If a feature’s coefficient is relatively small compared to others, it contributes minimally to the prediction and can be eliminated to simplify the model.
Simplifying the model leads to better generalization and computational efficiency.

Before feature selection

After feature selection

28 of 95

Nonlinear Regression

Explores how to construct a regression model capable of approximating non-linearly distributed data

29 of 95

Nonlinear Regression with Polynomial Features

Polynomial (here, quad is used as an example)

30 of 95

Nonlinear Regression with Polynomial Features

Polynomial (here, quad is used as an example)

31 of 95

Polynomial Regression

32 of 95

Nonlinear Regression (Actually Linear Regression)

33 of 95

Linear Basis Function Model

34 of 95

Recap: Nonlinear Regression

Polynomial (here, quad is used as an example)

Different perspective:

- Approximate a target function as a linear combination of basis

35 of 95

Construct Explicit Feature Vectors

Consider linear combinations of fixed nonlinear functions

Polynomial
Radial Basis Function (RBF)

36 of 95

Polynomial Basis

1) Polynomial functions

37 of 95

Nonlinear Function with Polynomial Basis

38 of 95

RBF Basis

39 of 95

Nonlinear Function with RBF Basis

40 of 95

Nonlinear Regression with Linear Basis Function Models

Polynomial functions

RBF functions

41 of 95

Regression 2

42 of 95

Linear Regression: Advanced

Overfitting
Regularization (Ridge and Lasso)

43 of 95

Overfitting: Start with Linear Regression

44 of 95

Recap: Nonlinear Regression

Polynomial (here, quad is used as an example)

45 of 95

Nonlinear Regression

10 input points with degree 9 (or 10)

46 of 95

Polynomial Fitting with Different Degrees

Low error on input data points,

but high error nearby

Underfit
Good fit
Overfit

Important to find the right balance between model complexity and generalization

47 of 95

Loss

Loss: Residual Sum of Squares (RSS)

decreases as the polynomial degree increases
a lower error does not necessarily indicate a better model

Minimizing loss in training data is

often not the best

Low error on input data points,

but high error nearby

48 of 95

Issue with Rich Representation

Low error on input data points, but high error nearby
Low error on training data, but high error on testing data

49 of 95

Overfitting

One of the most common problem data science professionals face

Overfitting occurs when a machine learning model performs well on the training data but poorly on unseen or test data.

The model essentially "memorizes" the noise and details in the training set rather than learning the general patterns that can be applied to new data.

50 of 95

Signs of Overfitting

Large gap between training and test loss:

Training loss continues to decrease, but test loss increases.

51 of 95

Causes of Overfitting

Model Complexity

A model with too many parameters (e.g., deep neural networks) can overfit the training data.

Insufficient Training Data

If the training dataset is small, the model may fit every point precisely, capturing noise rather than general patterns.

Noisy Data

If the data has a lot of noise or irrelevant features, the model may try to learn this noise.

Too Many Training Iterations

Training for too long can lead to overfitting, where the model starts fitting noise instead of actual patterns.

52 of 95

Generalization Error

the difference between a machine learning model's performance on the training data and its performance on unseen data (test or validation data)

53 of 95

Regularization to Reduce Overfitting

54 of 95

Generalization Error

55 of 95

Representational Difficulties

56 of 95

With Less Basis Functions: Fewer RBF Centers

57 of 95

With Less Basis Functions: Fewer RBF Centers

Least-squares fits for different numbers of RBFs

58 of 95

Representational Difficulties

59 of 95

Regularization (Shrinkage Methods)

Often, overfitting associated with very large estimated parameters
We want to balance

how well function fits data
magnitude of coefficients

multi-objective optimization
𝜆 is a tuning parameter

60 of 95

Regularization (Shrinkage Methods)

61 of 95

RBF: Start from Rich Representation

62 of 95

RBF with Regularization

63 of 95

RBF with Regularization

65 of 95

How L2 Regularization Works

Effect on Weights:

L2 regularization forces the model to keep weights small.

Interpretation:

L2 regularization distributes the penalty uniformly across all weights, shrinking their magnitude.

Overfitting Reduction:

By limiting the model's ability to learn large weights, the model becomes less likely to overfit the noise in the training data.

67 of 95

RBF with LASSO

Approximated function looks similar to that of ridge regression

68 of 95

LASSO

Ridge

69 of 95

'Shrink' some coefficients exactly to 0

knock out certain features
the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly zero

Non-zero coefficients indicate the 'selected' features

LASSO

70 of 95

Sparsity and Feature Selection

The L1 penalty in LASSO encourages sparsity in the coefficient estimates, meaning it drives some coefficients to be exactly zero.

Useful for feature selection, as it allows the model to identify and retain only the most relevant predictors, simplifying the model and potentially improving interpretability.

71 of 95

Regression with Selected Features

72 of 95

LASSO vs. Ridge

Equivalent optimization formulations for both Ridge and LASSO

73 of 95

LASSO vs. Ridge

Equivalent optimization formulations for both Ridge and LASSO

74 of 95

Evaluation

Adding more features will always decrease the loss
How do we determine when an algorithm achieves “good” performance?

A better criterion:

Training set (e.g., 70 %)
Testing set (e.g., 30 %)

Performance on testing set called generalization performance

75 of 95

Regression 3

76 of 95

Linear Regression Examples

De-noising
Total Variation

77 of 95

De-noising Signal

78 of 95

Transform it to an Optimization Problem

Source:

Boyd & Vandenberghe's book "Convex Optimization“
http://cvxr.com/cvx/examples/ (Figures 6.8-6.10: Quadratic smoothing)
Week 4 of Linear and Integer Programming by Coursera of Univ. of Colorado

79 of 95

Transform it to an Optimization Problem

80 of 95

Least-Square Problems

81 of 95

Coded in Python

82 of 95

Notes

While this process highlights the potential of regression for noise reduction, it is important to note that such tasks are typically achieved more effectively and efficiently using low-pass filters in either the time or frequency domain.

Nevertheless, this example serves to highlight the broader applicability of regression methods, demonstrating their potential to address various practical problems.

84 of 95

CVXPY Implementation

89 of 95

Signal with Sharp Transition + Noise

Chapter 6.3 from Boyd & Vandenberghe's book "Convex Optimization”

90 of 95

Quadratic smoothing smooths out both noise and sharp transitions in signal, but this is not what we want
We will not be able to preserve the signal’s sharp transitions.

Any ideas ?

92 of 95

Total Variation (TV) smoothing preserves sharp transitions in signal, and this is not bad

Note that how TV reconstruction does a better job of preserving the sharp transitions in the signal while removing the noise.

93 of 95

Total Variation Image

Idea comes from http://www2.compute.dtu.dk/~pcha/mxTV/

94 of 95

Total Variation Image

Idea comes from http://www2.compute.dtu.dk/~pcha/mxTV/

95 of 95

Summary

Thses examples demonstrates how regression principles extend beyond conventional predictive models and can be applied effectively to complex tasks such as image denoising and signal reconstruction.

1 of 95

2 of 95

3 of 95

4 of 95

5 of 95

6 of 95

7 of 95

8 of 95

9 of 95

10 of 95

11 of 95

12 of 95

13 of 95

14 of 95

15 of 95

16 of 95

17 of 95

18 of 95

19 of 95

20 of 95

21 of 95

22 of 95

23 of 95

24 of 95

25 of 95

26 of 95

27 of 95

28 of 95

29 of 95

30 of 95

31 of 95

32 of 95

33 of 95

34 of 95

35 of 95

36 of 95

37 of 95

38 of 95

39 of 95

40 of 95

41 of 95

42 of 95

43 of 95

44 of 95

45 of 95

46 of 95

47 of 95

48 of 95

49 of 95

50 of 95

51 of 95

52 of 95

53 of 95

54 of 95

55 of 95

56 of 95

57 of 95

58 of 95

59 of 95

60 of 95

61 of 95

62 of 95

63 of 95

64 of 95

65 of 95

66 of 95

67 of 95

68 of 95

69 of 95

70 of 95

71 of 95

72 of 95

73 of 95

74 of 95

75 of 95

76 of 95

77 of 95

78 of 95

79 of 95

80 of 95