1 of 95

Regression 1

2 of 95

Assumption: Linear Model

  •  

2

3 of 95

Assumption: Linear Model

  •  

3

4 of 95

Linear Regression

  •  

4

5 of 95

Linear Regression as Optimization

  •  

5

6 of 95

Re-cast Problem as Least Squares

  •  

6

7 of 95

Optimization

7

8 of 95

Optimization: Note

8

the same principle in a higher dimension

9 of 95

Revisit: Least-Square Solution

  •  

9

10 of 95

1. Solve using Linear Algebra

  • known as least square

10

11 of 95

1. Solve using Linear Algebra

  • known as least square

11

12 of 95

2. Solve using Gradient Descent

12

13 of 95

3. Solve using CVXPY Optimization

13

14 of 95

3. Solve using CVXPY Optimization

  •  

14

15 of 95

 

  •  

15

16 of 95

Regression with Outliers

  •  

16

17 of 95

Regression with Outliers

17

18 of 95

 

18

19 of 95

Think About What Makes Different

  • It is important to understand what makes them different
    • Sensitivity to outliers in L2 norm
    • Robustness of L1 norm

  • Squaring the residuals magnifies the effect of large errors, meaning that outliers have a significant influence on the final regression line

19

20 of 95

Scikit-Learn

  • Machine Learning in Python
  • Simple and efficient tools for data mining and data analysis
  • Accessible to everybody, and reusable in various contexts
  • Built on NumPy, SciPy, and matplotlib
  • Open source, commercially usable - BSD license
  • https://scikit-learn.org/stable/index.html#

20

21 of 95

Scikit-Learn

21

22 of 95

Scikit-Learn: Regression

22

23 of 95

Scikit-Learn: Regression

23

24 of 95

Multivariate Linear Regression

  •  

24

25 of 95

Multivariate Linear Regression

25

26 of 95

Multivariate Linear Regression

26

27 of 95

 

  • Some features may have little impact on the target variable.
  • If a feature’s coefficient is relatively small compared to others, it contributes minimally to the prediction and can be eliminated to simplify the model.
  • Simplifying the model leads to better generalization and computational efficiency.

  • Before feature selection

  • After feature selection

27

28 of 95

Nonlinear Regression

  • Explores how to construct a regression model capable of approximating non-linearly distributed data

28

29 of 95

Nonlinear Regression with Polynomial Features

  • Polynomial (here, quad is used as an example)

29

30 of 95

Nonlinear Regression with Polynomial Features

  • Polynomial (here, quad is used as an example)

30

31 of 95

Polynomial Regression

31

32 of 95

Nonlinear Regression (Actually Linear Regression)

  •  

32

33 of 95

Linear Basis Function Model

  •  

33

34 of 95

Recap: Nonlinear Regression

  • Polynomial (here, quad is used as an example)

34

Different perspective:

- Approximate a target function as a linear combination of basis

35 of 95

Construct Explicit Feature Vectors

  • Consider linear combinations of fixed nonlinear functions
    • Polynomial
    • Radial Basis Function (RBF)

35

36 of 95

Polynomial Basis

1) Polynomial functions

36

37 of 95

Nonlinear Function with Polynomial Basis

  •  

37

38 of 95

RBF Basis

  •  

38

39 of 95

Nonlinear Function with RBF Basis

  •  

39

40 of 95

Nonlinear Regression with Linear Basis Function Models

  • Polynomial functions
  • RBF functions

40

41 of 95

Regression 2

42 of 95

Linear Regression: Advanced

42

  • Overfitting
  • Regularization (Ridge and Lasso)

43 of 95

Overfitting: Start with Linear Regression

43

44 of 95

Recap: Nonlinear Regression

  • Polynomial (here, quad is used as an example)

44

45 of 95

Nonlinear Regression

45

10 input points with degree 9 (or 10)

46 of 95

Polynomial Fitting with Different Degrees

46

Low error on input data points,

but high error nearby

  • Underfit
  • Good fit
  • Overfit

Important to find the right balance between model complexity and generalization

47 of 95

Loss

  • Loss: Residual Sum of Squares (RSS)
    • decreases as the polynomial degree increases
    • a lower error does not necessarily indicate a better model

47

Minimizing loss in training data is

often not the best

Low error on input data points,

but high error nearby

48 of 95

Issue with Rich Representation

  • Low error on input data points, but high error nearby
  • Low error on training data, but high error on testing data

48

49 of 95

Overfitting

  • One of the most common problem data science professionals face

  • Overfitting occurs when a machine learning model performs well on the training data but poorly on unseen or test data.

  • The model essentially "memorizes" the noise and details in the training set rather than learning the general patterns that can be applied to new data.

49

50 of 95

Signs of Overfitting

  • Large gap between training and test loss:

  • Training loss continues to decrease, but test loss increases.

50

51 of 95

Causes of Overfitting

  • Model Complexity
    • A model with too many parameters (e.g., deep neural networks) can overfit the training data.

  • Insufficient Training Data
    • If the training dataset is small, the model may fit every point precisely, capturing noise rather than general patterns.

  • Noisy Data
    • If the data has a lot of noise or irrelevant features, the model may try to learn this noise.

  • Too Many Training Iterations
    • Training for too long can lead to overfitting, where the model starts fitting noise instead of actual patterns.

51

52 of 95

Generalization Error

  • the difference between a machine learning model's performance on the training data and its performance on unseen data (test or validation data)

52

53 of 95

Regularization to Reduce Overfitting

53

54 of 95

Generalization Error

  •  

54

55 of 95

Representational Difficulties

  •  

55

56 of 95

With Less Basis Functions: Fewer RBF Centers

56

57 of 95

With Less Basis Functions: Fewer RBF Centers

  • Least-squares fits for different numbers of RBFs

57

58 of 95

Representational Difficulties

  •  

58

59 of 95

Regularization (Shrinkage Methods)

  • Often, overfitting associated with very large estimated parameters
  • We want to balance
    • how well function fits data
    • magnitude of coefficients 

    • multi-objective optimization
    • 𝜆 is a tuning parameter

59

60 of 95

Regularization (Shrinkage Methods)

  •  

60

61 of 95

RBF: Start from Rich Representation

61

62 of 95

RBF with Regularization

  •  

62

63 of 95

RBF with Regularization

  •  

63

64 of 95

 

  •  

64

65 of 95

How L2 Regularization Works

  • Effect on Weights:
    • L2 regularization forces the model to keep weights small.
  • Interpretation:
    • L2 regularization distributes the penalty uniformly across all weights, shrinking their magnitude.
  • Overfitting Reduction:
    • By limiting the model's ability to learn large weights, the model becomes less likely to overfit the noise in the training data.

65

66 of 95

 

  •  

66

67 of 95

RBF with LASSO

  • Approximated function looks similar to that of ridge regression

67

68 of 95

 

68

LASSO

Ridge

69 of 95

 

  • 'Shrink' some coefficients exactly to 0
    • knock out certain features
    • the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly zero

  • Non-zero coefficients indicate the 'selected' features

69

LASSO

70 of 95

Sparsity and Feature Selection

  • The L1 penalty in LASSO encourages sparsity in the coefficient estimates, meaning it drives some coefficients to be exactly zero.

  • Useful for feature selection, as it allows the model to identify and retain only the most relevant predictors, simplifying the model and potentially improving interpretability.

70

71 of 95

Regression with Selected Features

71

72 of 95

LASSO vs. Ridge

  • Equivalent optimization formulations for both Ridge and LASSO

72

73 of 95

LASSO vs. Ridge

  • Equivalent optimization formulations for both Ridge and LASSO

73

 

 

74 of 95

Evaluation

  • Adding more features will always decrease the loss
  • How do we determine when an algorithm achieves “good” performance?

  • A better criterion:
    • Training set (e.g., 70 %)
    • Testing set (e.g., 30 %)

  • Performance on testing set called generalization performance

74

75 of 95

Regression 3

76 of 95

Linear Regression Examples

76

  • De-noising
  • Total Variation

77 of 95

De-noising Signal

  •  

77

78 of 95

Transform it to an Optimization Problem

  •  

78

Source:

      • Boyd & Vandenberghe's book "Convex Optimization
      • http://cvxr.com/cvx/examples/ (Figures 6.8-6.10: Quadratic smoothing)
      • Week 4 of Linear and Integer Programming by Coursera of Univ. of Colorado

79 of 95

Transform it to an Optimization Problem

  •  

79

80 of 95

Least-Square Problems

  •  

80

81 of 95

Coded in Python

81

82 of 95

Notes

  • While this process highlights the potential of regression for noise reduction, it is important to note that such tasks are typically achieved more effectively and efficiently using low-pass filters in either the time or frequency domain.

  • Nevertheless, this example serves to highlight the broader applicability of regression methods, demonstrating their potential to address various practical problems.

82

83 of 95

 

83

84 of 95

CVXPY Implementation

84

85 of 95

 

85

86 of 95

 

86

87 of 95

 

87

88 of 95

 

88

89 of 95

Signal with Sharp Transition + Noise

  •  

89

Chapter 6.3 from Boyd & Vandenberghe's book "Convex Optimization

90 of 95

 

  • Quadratic smoothing smooths out both noise and sharp transitions in signal, but this is not what we want
  • We will not be able to preserve the signal’s sharp transitions.

  • Any ideas ?

90

91 of 95

 

  •  

91

92 of 95

 

  • Total Variation (TV) smoothing preserves sharp transitions in signal, and this is not bad

  • Note that how TV reconstruction does a better job of preserving the sharp transitions in the signal while removing the noise.

92

93 of 95

Total Variation Image

  •  

93

Idea comes from http://www2.compute.dtu.dk/~pcha/mxTV/

94 of 95

Total Variation Image

94

Idea comes from http://www2.compute.dtu.dk/~pcha/mxTV/

95 of 95

Summary

  • Thses examples demonstrates how regression principles extend beyond conventional predictive models and can be applied effectively to complex tasks such as image denoising and signal reconstruction.

95