Linear Regression (2)
Lecture 7
Geometric interpretation of least squares and probabilistic view of linear regression
EECS 189/289, Fall 2025 @ UC Berkeley
Joseph E. Gonzalez and Narges Norouzi
EECS 189/289, Fall 2025 @ UC Berkeley
Joseph E. Gonzalez and Narges Norouzi
Join at slido.com�#6101346
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
6101346
Roadmap
6101346
Error Minimization
6101346
Optimization
L
LEARNING PROBLEM
M
MODEL DESIGN
O
OPTIMIZATION
L
M
O
Supervised learning of scalar target values
6101346
Error Function Minimization
Finding the optimum solution
6101346
Error Function Minimization
Separating the terms
6101346
Error Function Minimization
Separating the terms
6101346
Error Function Minimization
Separating the terms
6101346
Error Function Minimization
Reordering
6101346
Error Function Minimization
Takeaway
Normal equations for the least squares problem
6101346
Geometric Interpretation
6101346
[Linear Algebra] Span
6101346
[Linear Algebra] Matrix-Vector Multiplication
=
…
=
…
=
+
+
+
…
6101346
Prediction Is a Linear Combination of Columns
6101346
What’s the geometry word for ‘closest point in a subspace’?
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
6101346
Length of the residual vector
6101346
Geometry of Least Squares in Plotly
[Linear Algebra] Orthogonality
…
…
=
…
…
We will use this shortly
6101346
Going Back to Our Error Function
Adding the definition of residual
Moving terms
Normal Equation
6101346
Evaluation
6101346
Predict and Evaluate
L
LEARNING PROBLEM
M
MODEL DESIGN
O
OPTIMIZATION
P
PREDICT & EVALUATE
L
M
P
O
Supervised learning of scalar target values
6101346
Evaluation - Visualization
When you see a fan shape in the residual plot, what comes to mind?
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
6101346
Evaluation - Metrics
| |
| |
| |
| |
| |
6101346
Evaluation - Metrics
Mean Squared Error (MSE) | |
| |
| |
| |
| |
6101346
Evaluation - Metrics
Mean Squared Error (MSE) | |
Root Mean Squared Error (RMSE) | |
| |
| |
| |
Moves the metric back to the original unit of the data compared to MSE
6101346
Evaluation - Metrics
Mean Squared Error (MSE) | |
Root Mean Squared Error (RMSE) | |
R-Squared (R2) Score | |
| |
| |
6101346
Visualizing the Sum of Squared Error of Regression Model
29
Goal of regression: Make the total area of the boxes as small as possible.
6101346
Visualizing the Sum of Squared Error of Intercept Model
30
6101346
R2: Quality of the Fit Relative to Intercept Model
31
unitless and only compares performance relative to mean baseline
6101346
Evaluation - Metrics
Mean Squared Error (MSE) | |
Root Mean Squared Error (RMSE) | |
R-Squared (R2) Score | |
Mean Absolute Error (MAE) | |
| |
In the same unit as the data; similar to MSE but differs in how the penalization applies.
6101346
Evaluation - Metrics
Mean Squared Error (MSE) | |
Root Mean Squared Error (RMSE) | |
R-Squared (R2) Score | |
Mean Absolute Error (MAE) | |
Mean Absolute Percentage Error (MAPE) | |
6101346
Regularized Least Squares
6101346
Complexity and Overfitting
“Sweet Spot”
Overfitting
Underfitting
Raw data
Fit models to samples of data
More complex isn’t always better.
6101346
Regularization
Regularization is the process of adding constraints or penalties to the learning process to improve generalization.
�
Many models and learning algorithms have methods to tune the regularization during the training process.
Overfitting
Underfitting
More Regularization
“Sweet Spot”
6101346
Regularization
6101346
Regularization – Lagrangian Duality
Violation magnitude
Imposed penalty
Introduce a penalty for breaking the constraint.
Idea of Lagrangian
6101346
6101346
6101346
6101346
6101346
Regularization
We typically use iterative optimization algorithms to train (fit) the model by solving the following problem:
Data-dependent Error
6101346
Regularization
We typically use iterative optimization algorithms to train (fit) the model by solving the following problem:
Data-dependent Error
6101346
Regularization
We typically use iterative optimization algorithms to train (fit) the model by solving the following problem:
Regularization
Function
Data-dependent Error
6101346
Regularization
We typically use iterative optimization algorithms to train (fit) the model by solving the following problem:
Regularizer Hyperparameter
Regularization
Function
Data-dependent Error
6101346
Linear Regression with L2 Regularization (Ridge)
6101346
L2 Regularization (Ridge)
length of the residual vector
6101346
L2 Regularization (Ridge)
length of the residual vector
6101346
L2 Regularization�Demo
L2 Regularization (Lasso)
length of the residual vector
6101346
L2 Regularization�Demo
Impact of Regularization - Ridge
6101346
Impact of Regularization - Ridge
6101346
Impact of Regularization - Lasso
6101346
Impact of Regularization - Lasso
6101346
Impact of Regularization - Lasso
Least Absolute Shrinkage and Selection Operator
Lasso
6101346
When Normal Equation Gets Tricky
6101346
When Normal Equation Gets Tricky
6101346
Slido
6101346
If we plug in the SVD of X, what is the simplified expression for the solution to the normal equation?
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
6101346
Fix #1: Ridge Trick
6101346
Illconditioning Example
N, D = 500, 2 # one bias + 2 features
X = np.ones((N, D + 1)) # first column = bias
x1 = np.random.randn(N)
x2 = x1 + 1e-4 * np.random.randn(N) # almost identical feature
X[:, 1] = x1
X[:, 2] = x2
t = 3 + 2*x1 - 1*x2 + 0.1*np.random.randn(N) # targets
XtX = X.T @ X
Xty = X.T @ t
w_ne = np.linalg.solve(XtX, Xty)
lam = 1e-2
w_ridge = np.linalg.solve(XtX + lam*np.eye(D+1), Xty)
With the condition factor that is large, due to very small eigenvalues resulting from very similar features, the weights become unstable.
Try changing the noise and you will see substantial swings in weights.
w_neq: [ 3.00273423 12.87110771 -11.87562795]
w_ridge : [3.00275573 0.50070697 0.49481053]
Fix #2: Moore–Penrose Pseudo-Inverse
6101346
Fix #2: Moore–Penrose Pseudo-Inverse
How can Penrose Pseudo-Inverse help us with solving normal equation?
6101346
Fix 2: Why It Helps?
Issue | How pseudo-inverse fixes it? |
Rank deficiency (perfect multicollinearity) | |
| |
| |
| |
6101346
Fix 2: Why It Helps?
Issue | How pseudo-inverse fixes it? |
Rank deficiency (perfect multicollinearity) | |
Ill-conditioning | Truncation threshold discards numerically meaningless directions, implicitly adds ridge-like regularization. |
| |
| |
6101346
Fix 2: Why It Helps?
Issue | How pseudo-inverse fixes it? |
Rank deficiency (perfect multicollinearity) | |
Ill-conditioning | Truncation threshold discards numerically meaningless directions, implicitly adds ridge-like regularization. |
| |
| |
6101346
Fix 2: Why It Helps?
Issue | How pseudo-inverse fixes it? |
Rank deficiency (perfect multicollinearity) | |
Ill-conditioning | Truncation threshold discards numerically meaningless directions, implicitly adds ridge-like regularization. |
| |
Implementation convenience | |
6101346
Fix 3: Sequential Learning
Learning rate
Known as Least Mean Squares or LMS
Lectures 11 &12
6101346
Linear Regression (2)
Lecture 7
Credit: Joseph E. Gonzalez and Narges Norouzi
Reference Book Chapters: Chapter 1.2, Chapter 4.[1.4-1.6]