1 of 71

Linear Regression (2)

Lecture 7

Geometric interpretation of least squares and probabilistic view of linear regression

EECS 189/289, Fall 2025 @ UC Berkeley

Joseph E. Gonzalez and Narges Norouzi

EECS 189/289, Fall 2025 @ UC Berkeley

Joseph E. Gonzalez and Narges Norouzi

2 of 71

Join at slido.com�#6101346

The Slido app must be installed on every computer you’re presenting from

6101346

3 of 71

Roadmap

  • Error Minimization
  • Geometric Interpretation
  • Evaluation
  • Regularized Least Squares
  • When Normal Equation Gets Tricky

6101346

4 of 71

Error Minimization

  • Error Minimization
  • Geometric Interpretation
  • Evaluation
  • Regularized Least Squares
  • When Normal Equation Gets Tricky

6101346

5 of 71

Optimization

L

LEARNING PROBLEM

M

MODEL DESIGN

O

OPTIMIZATION

L

M

O

Supervised learning of scalar target values

 

 

6101346

6 of 71

Error Function Minimization

  •  

 

 

 

Finding the optimum solution

 

6101346

7 of 71

Error Function Minimization

 

 

 

 

Separating the terms

 

6101346

8 of 71

Error Function Minimization

 

 

 

 

 

 

 

 

 

 

Separating the terms

6101346

9 of 71

Error Function Minimization

 

 

 

 

 

 

 

 

 

 

Separating the terms

6101346

10 of 71

Error Function Minimization

 

 

 

 

 

 

 

Reordering

 

6101346

11 of 71

Error Function Minimization

 

 

 

 

 

 

 

 

Takeaway

 

Normal equations for the least squares problem

 

6101346

12 of 71

Geometric Interpretation

  • Error Minimization
  • Geometric Interpretation
  • Evaluation
  • Regularized Least Squares
  • When Normal Equation Gets Tricky

6101346

13 of 71

[Linear Algebra] Span

  •  

 

 

 

6101346

14 of 71

[Linear Algebra] Matrix-Vector Multiplication

  •  

 

=

 

 

 

 

 

 

=

 

 

 

=

 

 

+

 

 

+

 

 

+

6101346

15 of 71

Prediction Is a Linear Combination of Columns

 

 

 

 

 

 

6101346

16 of 71

What’s the geometry word for ‘closest point in a subspace’?

The Slido app must be installed on every computer you’re presenting from

6101346

17 of 71

 

 

 

 

 

 

 

 

 

 

Length of the residual vector

6101346

18 of 71

Geometry of Least Squares in Plotly

19 of 71

[Linear Algebra] Orthogonality

  •  

 

 

 

 

 

 

 

 

 

 

=

 

 

 

 

 

 

 

We will use this shortly

6101346

20 of 71

Going Back to Our Error Function

 

 

 

Adding the definition of residual

 

 

 

Moving terms

 

Normal Equation

 

 

6101346

21 of 71

Evaluation

  • Error Minimization
  • Geometric Interpretation
  • Evaluation
  • Regularized Least Squares
  • When Normal Equation Gets Tricky

6101346

22 of 71

Predict and Evaluate

L

LEARNING PROBLEM

M

MODEL DESIGN

O

OPTIMIZATION

P

PREDICT & EVALUATE

L

M

P

O

Supervised learning of scalar target values

 

 

 

6101346

23 of 71

Evaluation - Visualization

 

24 of 71

When you see a fan shape in the residual plot, what comes to mind?

The Slido app must be installed on every computer you’re presenting from

6101346

25 of 71

Evaluation - Metrics

6101346

26 of 71

Evaluation - Metrics

Mean Squared Error (MSE)

6101346

27 of 71

Evaluation - Metrics

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

Moves the metric back to the original unit of the data compared to MSE

6101346

28 of 71

Evaluation - Metrics

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

R-Squared (R2) Score

6101346

29 of 71

Visualizing the Sum of Squared Error of Regression Model

29

 

 

Goal of regression: Make the total area of the boxes as small as possible.

 

6101346

30 of 71

Visualizing the Sum of Squared Error of Intercept Model

30

 

 

 

6101346

31 of 71

R2: Quality of the Fit Relative to Intercept Model

31

 

 

 

 

unitless and only compares performance relative to mean baseline

6101346

32 of 71

Evaluation - Metrics

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

R-Squared (R2) Score

Mean Absolute Error (MAE)

In the same unit as the data; similar to MSE but differs in how the penalization applies.

6101346

33 of 71

Evaluation - Metrics

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

R-Squared (R2) Score

Mean Absolute Error (MAE)

Mean Absolute Percentage Error (MAPE)

6101346

34 of 71

Regularized Least Squares

  • Error Minimization
  • Geometric Interpretation
  • Evaluation
  • Regularized Least Squares
  • When Normal Equation Gets Tricky

6101346

35 of 71

Complexity and Overfitting

“Sweet Spot”

Overfitting

Underfitting

Raw data

Fit models to samples of data

More complex isn’t always better.

6101346

36 of 71

Regularization

Regularization is the process of adding constraints or penalties to the learning process to improve generalization.

Many models and learning algorithms have methods to tune the regularization during the training process.

Overfitting

Underfitting

More Regularization

“Sweet Spot”

6101346

37 of 71

Regularization

  •  

 

6101346

38 of 71

Regularization – Lagrangian Duality

  •  

Violation magnitude

 

Imposed penalty

Introduce a penalty for breaking the constraint.

Idea of Lagrangian

 

 

6101346

39 of 71

 

  •  

 

 

6101346

40 of 71

 

  •  

 

6101346

41 of 71

 

  •  

 

 

 

 

 

 

 

 

6101346

42 of 71

 

  •  

 

 

 

 

 

 

 

 

6101346

43 of 71

Regularization

We typically use iterative optimization algorithms to train (fit) the model by solving the following problem:

 

Data-dependent Error

 

 

6101346

44 of 71

Regularization

We typically use iterative optimization algorithms to train (fit) the model by solving the following problem:

Data-dependent Error

 

 

 

6101346

45 of 71

Regularization

We typically use iterative optimization algorithms to train (fit) the model by solving the following problem:

Regularization

Function

Data-dependent Error

 

 

 

6101346

46 of 71

Regularization

We typically use iterative optimization algorithms to train (fit) the model by solving the following problem:

Regularizer Hyperparameter

Regularization

Function

Data-dependent Error

 

 

6101346

47 of 71

Linear Regression with L2 Regularization (Ridge)

 

 

 

 

 

 

 

 

 

 

 

 

6101346

48 of 71

L2 Regularization (Ridge)

 

 

 

 

 

 

length of the residual vector

 

 

6101346

49 of 71

L2 Regularization (Ridge)

 

 

 

 

 

 

 

length of the residual vector

 

 

 

6101346

50 of 71

L2 Regularization�Demo

51 of 71

L2 Regularization (Lasso)

 

 

 

 

 

 

length of the residual vector

 

 

 

 

6101346

52 of 71

L2 Regularization�Demo

53 of 71

Impact of Regularization - Ridge

  •  

6101346

54 of 71

Impact of Regularization - Ridge

  •  

6101346

55 of 71

Impact of Regularization - Lasso

  •  

6101346

56 of 71

Impact of Regularization - Lasso

  •  

6101346

57 of 71

Impact of Regularization - Lasso

  •  

Least Absolute Shrinkage and Selection Operator

Lasso

6101346

58 of 71

When Normal Equation Gets Tricky

  • Error Minimization
  • Geometric Interpretation
  • Evaluation
  • Regularized Least Squares
  • When Normal Equation Gets Tricky

6101346

59 of 71

When Normal Equation Gets Tricky

  •  

6101346

60 of 71

Slido

  •  

6101346

61 of 71

If we plug in the SVD of X, what is the simplified expression for the solution to the normal equation?

The Slido app must be installed on every computer you’re presenting from

6101346

62 of 71

Fix #1: Ridge Trick

  •  

6101346

63 of 71

Illconditioning Example

N, D = 500, 2 # one bias + 2 features

X = np.ones((N, D + 1)) # first column = bias

x1 = np.random.randn(N)

x2 = x1 + 1e-4 * np.random.randn(N) # almost identical feature

X[:, 1] = x1

X[:, 2] = x2

t = 3 + 2*x1 - 1*x2 + 0.1*np.random.randn(N) # targets

XtX = X.T @ X

Xty = X.T @ t

w_ne = np.linalg.solve(XtX, Xty)

lam = 1e-2

w_ridge = np.linalg.solve(XtX + lam*np.eye(D+1), Xty)

With the condition factor that is large, due to very small eigenvalues resulting from very similar features, the weights become unstable.

Try changing the noise and you will see substantial swings in weights.

 

w_neq: [ 3.00273423 12.87110771 -11.87562795]

w_ridge : [3.00275573 0.50070697 0.49481053]

64 of 71

Fix #2: Moore–Penrose Pseudo-Inverse

  •  

6101346

65 of 71

Fix #2: Moore–Penrose Pseudo-Inverse

  •  

How can Penrose Pseudo-Inverse help us with solving normal equation?

 

6101346

66 of 71

Fix 2: Why It Helps?

Issue

How pseudo-inverse fixes it?

Rank deficiency (perfect multicollinearity)

6101346

67 of 71

Fix 2: Why It Helps?

Issue

How pseudo-inverse fixes it?

Rank deficiency (perfect multicollinearity)

Ill-conditioning

Truncation threshold discards numerically meaningless directions, implicitly adds ridge-like regularization.

6101346

68 of 71

Fix 2: Why It Helps?

Issue

How pseudo-inverse fixes it?

Rank deficiency (perfect multicollinearity)

Ill-conditioning

Truncation threshold discards numerically meaningless directions, implicitly adds ridge-like regularization.

6101346

69 of 71

Fix 2: Why It Helps?

Issue

How pseudo-inverse fixes it?

Rank deficiency (perfect multicollinearity)

Ill-conditioning

Truncation threshold discards numerically meaningless directions, implicitly adds ridge-like regularization.

Implementation convenience

6101346

70 of 71

Fix 3: Sequential Learning

  •  

 

Learning rate

Known as Least Mean Squares or LMS

Lectures 11 &12

6101346

71 of 71

Linear Regression (2)

Lecture 7

Credit: Joseph E. Gonzalez and Narges Norouzi

Reference Book Chapters: Chapter 1.2, Chapter 4.[1.4-1.6]