1 of 71

Linear Regression (2)

Lecture 7

Geometric interpretation of least squares and probabilistic view of linear regression

EECS 189/289, Fall 2025 @ UC Berkeley

Joseph E. Gonzalez and Narges Norouzi

EECS 189/289, Fall 2025 @ UC Berkeley

Joseph E. Gonzalez and Narges Norouzi

2 of 71

Join at slido.com�#6101346

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

6101346

3 of 71

Roadmap

Error Minimization
Geometric Interpretation
Evaluation
Regularized Least Squares
When Normal Equation Gets Tricky

6101346

4 of 71

Error Minimization

Error Minimization
Geometric Interpretation
Evaluation
Regularized Least Squares
When Normal Equation Gets Tricky

6101346

5 of 71

Optimization

L

LEARNING PROBLEM

M

MODEL DESIGN

O

OPTIMIZATION

L

M

O

Supervised learning of scalar target values

6101346

6 of 71

Error Function Minimization

Finding the optimum solution

6101346

7 of 71

Error Function Minimization

Separating the terms

6101346

8 of 71

Error Function Minimization

Separating the terms

6101346

9 of 71

Error Function Minimization

Separating the terms

6101346

10 of 71

Error Function Minimization

Reordering

6101346

11 of 71

Error Function Minimization

Takeaway

Normal equations for the least squares problem

6101346

12 of 71

Geometric Interpretation

Error Minimization
Geometric Interpretation
Evaluation
Regularized Least Squares
When Normal Equation Gets Tricky

6101346

13 of 71

[Linear Algebra] Span

6101346

14 of 71

[Linear Algebra] Matrix-Vector Multiplication

=

…

=

…

=

+

…

6101346

15 of 71

Prediction Is a Linear Combination of Columns

6101346

16 of 71

What’s the geometry word for ‘closest point in a subspace’?

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

6101346

17 of 71

Length of the residual vector

6101346

18 of 71

Geometry of Least Squares in Plotly

Interactive link

19 of 71

[Linear Algebra] Orthogonality

…

=

…

We will use this shortly

6101346

20 of 71

Going Back to Our Error Function

Adding the definition of residual

Moving terms

Normal Equation

6101346

21 of 71

Evaluation

Error Minimization
Geometric Interpretation
Evaluation
Regularized Least Squares
When Normal Equation Gets Tricky

6101346

22 of 71

Predict and Evaluate

L

LEARNING PROBLEM

M

MODEL DESIGN

O

OPTIMIZATION

P

PREDICT & EVALUATE

L

M

P

O

Supervised learning of scalar target values

6101346

23 of 71

Evaluation - Visualization

24 of 71

When you see a fan shape in the residual plot, what comes to mind?

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

6101346

25 of 71

Evaluation - Metrics

6101346

26 of 71

Evaluation - Metrics

Mean Squared Error (MSE)

6101346

27 of 71

Evaluation - Metrics

Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)

Moves the metric back to the original unit of the data compared to MSE

6101346

28 of 71

Evaluation - Metrics

Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
R-Squared (R²) Score

6101346

29 of 71

Visualizing the Sum of Squared Error of Regression Model

29

Goal of regression: Make the total area of the boxes as small as possible.

6101346

30 of 71

Visualizing the Sum of Squared Error of Intercept Model

30

6101346

31 of 71

R²: Quality of the Fit Relative to Intercept Model

31

unitless and only compares performance relative to mean baseline

6101346

32 of 71

Evaluation - Metrics

Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
R-Squared (R²) Score
Mean Absolute Error (MAE)

In the same unit as the data; similar to MSE but differs in how the penalization applies.

6101346

33 of 71

Evaluation - Metrics

Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
R-Squared (R²) Score
Mean Absolute Error (MAE)
Mean Absolute Percentage Error (MAPE)

6101346

34 of 71

Regularized Least Squares

Error Minimization
Geometric Interpretation
Evaluation
Regularized Least Squares
When Normal Equation Gets Tricky

6101346

35 of 71

Complexity and Overfitting

“Sweet Spot”

Overfitting

Underfitting

Raw data

Fit models to samples of data

More complex isn’t always better.

6101346

36 of 71

Regularization

Regularization is the process of adding constraints or penalties to the learning process to improve generalization.

�

Many models and learning algorithms have methods to tune the regularization during the training process.

Overfitting

Underfitting

More Regularization

“Sweet Spot”

6101346

37 of 71

Regularization

6101346

38 of 71

Regularization – Lagrangian Duality

Violation magnitude

Imposed penalty

Introduce a penalty for breaking the constraint.

Idea of Lagrangian

6101346

39 of 71

6101346

40 of 71

6101346

41 of 71

6101346

42 of 71

6101346

43 of 71

Regularization

We typically use iterative optimization algorithms to train (fit) the model by solving the following problem:

Data-dependent Error

6101346

44 of 71

Regularization

We typically use iterative optimization algorithms to train (fit) the model by solving the following problem:

Data-dependent Error

6101346

45 of 71

Regularization

We typically use iterative optimization algorithms to train (fit) the model by solving the following problem:

Regularization

Function

Data-dependent Error

6101346

46 of 71

Regularization

We typically use iterative optimization algorithms to train (fit) the model by solving the following problem:

Regularizer Hyperparameter

Regularization

Function

Data-dependent Error

6101346

47 of 71

Linear Regression with L2 Regularization (Ridge)

6101346

48 of 71

L2 Regularization (Ridge)

length of the residual vector

6101346

49 of 71

L2 Regularization (Ridge)

length of the residual vector

6101346

50 of 71

L2 Regularization�Demo

Interactive link

51 of 71

L2 Regularization (Lasso)

length of the residual vector

6101346

52 of 71

L2 Regularization�Demo

Interactive link

53 of 71

Impact of Regularization - Ridge

Interactive link

6101346

54 of 71

Impact of Regularization - Ridge

6101346

55 of 71

Impact of Regularization - Lasso

Interactive link

6101346

56 of 71

Impact of Regularization - Lasso

6101346

57 of 71

Impact of Regularization - Lasso

Least Absolute Shrinkage and Selection Operator

Lasso

6101346

58 of 71

When Normal Equation Gets Tricky

Error Minimization
Geometric Interpretation
Evaluation
Regularized Least Squares
When Normal Equation Gets Tricky

6101346

59 of 71

When Normal Equation Gets Tricky

6101346

60 of 71

Slido

6101346

61 of 71

If we plug in the SVD of X, what is the simplified expression for the solution to the normal equation?

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

6101346

62 of 71

Fix #1: Ridge Trick

6101346

63 of 71

Illconditioning Example

N, D = 500, 2 # one bias + 2 features

X = np.ones((N, D + 1)) # first column = bias

x1 = np.random.randn(N)

x2 = x1 + 1e-4 * np.random.randn(N) # almost identical feature

X[:, 1] = x1

X[:, 2] = x2

t = 3 + 2*x1 - 1*x2 + 0.1*np.random.randn(N) # targets

XtX = X.T @ X

Xty = X.T @ t

w_ne = np.linalg.solve(XtX, Xty)

lam = 1e-2

w_ridge = np.linalg.solve(XtX + lam*np.eye(D+1), Xty)

With the condition factor that is large, due to very small eigenvalues resulting from very similar features, the weights become unstable.

Try changing the noise and you will see substantial swings in weights.

w_neq: [ 3.00273423 12.87110771 -11.87562795]

w_ridge : [3.00275573 0.50070697 0.49481053]

64 of 71

Fix #2: Moore–Penrose Pseudo-Inverse

6101346

65 of 71

Fix #2: Moore–Penrose Pseudo-Inverse

How can Penrose Pseudo-Inverse help us with solving normal equation?

6101346

66 of 71

Fix 2: Why It Helps?

Issue	How pseudo-inverse fixes it?
Rank deficiency (perfect multicollinearity)

6101346

67 of 71

Fix 2: Why It Helps?

Issue	How pseudo-inverse fixes it?
Rank deficiency (perfect multicollinearity)
Ill-conditioning	Truncation threshold discards numerically meaningless directions, implicitly adds ridge-like regularization.

6101346

68 of 71

Fix 2: Why It Helps?

Issue	How pseudo-inverse fixes it?
Rank deficiency (perfect multicollinearity)
Ill-conditioning	Truncation threshold discards numerically meaningless directions, implicitly adds ridge-like regularization.

6101346

69 of 71

Fix 2: Why It Helps?

Issue	How pseudo-inverse fixes it?
Rank deficiency (perfect multicollinearity)
Ill-conditioning	Truncation threshold discards numerically meaningless directions, implicitly adds ridge-like regularization.

Implementation convenience

6101346

70 of 71

Fix 3: Sequential Learning

Learning rate

Known as Least Mean Squares or LMS

Lectures 11 &12

6101346

71 of 71

Linear Regression (2)

Lecture 7

Credit: Joseph E. Gonzalez and Narges Norouzi

Reference Book Chapters: Chapter 1.2, Chapter 4.[1.4-1.6]