Join at slido.com�#1846379
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
1846379
Optimization and�Gradient Descent
Lecture 12
The core algorithm of modern machine learning
EECS 189/289, Fall 2025 @ UC Berkeley
Joseph E. Gonzalez and Narges Norouzi
EECS 189/289, Fall 2025 @ UC Berkeley
Joseph E. Gonzalez and Narges Norouzi
Key Factors in �Generative AI Revolution
Data
Compute
Automatic Diff.
Loss Functions
Stochastic Gradient Descent
Robbins–Monro 1951
Neural
Architectures
1846379
1846379
The General Optimization �Problem
The General Optimization Problem
Most problems in ML can be reduced to the following general optimization problem
Solution
(Learned Parameters)
Objective Function
Constraints
1846379
Optimization Constraints
The constraints define the valid set of parameters.
Solution
(Learned Parameters)
Constraints
1846379
The Objective Function
The shape of the objective function plays a big role in the complexity of solving the optimization problem.
Objective Function
Solution
(Learned Parameters)
Constraints
1846379
Examples Optimization Problems
How do we solve these optimization problems?
What properties of the objective function make it “easier” to solve?
Discrete
Multiple Minima
Global Minima
Plots in demo notebook.
Local Minima
1846379
Convex Functions
Non-Convex:
Convex:
In general, there are efficient algorithms to reliably minimize convex functions.
1846379
Convex (Constraint) Sets
Convex Set
Non-convex Set
In general, there are efficient algorithms to reliably minimize convex functions over convex sets.
1846379
Optimization Problems
Optimization of a convex objective function over a convex constraint set.
Many classic ML problems are convex.
Most of modern ML (deep learning) is not convex.
Convex Function
Convex Constraints
1846379
Minimizing the Error &�Optimizing in Weight-Space
Points on the Error Surface
Weight Space
Data
Plots in demo notebook.
1846379
Demo
Understanding the Error Surface
Slido Matching
Model 2
Least Squares
Regression
Model 1
Logistic
Regression
Model 3
Logistic
Regression
A
C
B
1846379
Match the model with the loss.
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
1846379
Calculus and Optimization
The Gradient
1846379
Gradient Exercise
Plots in demo notebook.
Gradient
Gradient
The gradient points in the direction of greatest ascent.
1846379
1846379
Demo
The Gradient of the Error Function
If w is a d-dimensional vector and E(w) is a scalar-valued function, what is the dimensionality of the gradient ∇E(w) with respect to w?
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
1846379
Common Confusion
The gradient is in the �domain of the function.
Not the gradient!
1846379
1846379
Stationary Points and Zero Gradient
Anywhere the gradient is zero is a stationary point.
Minimum
Maximum
Saddle Point
Plots in demo notebook.
1846379
Analytical Optimization using Gradients
1846379
Gradient Descent
Iterative optimization
Iterative Optimization Algorithms
New Estimate
Old Estimate
Update
(Based on Gradient)
1846379
Intuition
Try the loss game (its free)!
Error
Goal: Minimize the loss by turning the knobs.
1846379
Intuition
Try the loss game (its free)!
Error
1846379
Intuition
Try the loss game (its free)!
Error
1846379
Intuition
Try the loss game (its free)!
Error
1846379
Intuition
Try the loss game (its free)!
Error
1846379
Intuition
Try the loss game (its free)!
Error
What if we knew which way to turn the knob�and an idea of how far?
This is the Gradient!
1846379
Intuition
Try the loss game (its free)!
Error
1846379
Intuition
Try the loss game (its free)!
Error
1846379
Intuition
Try the loss game (its free)!
Error
1846379
Intuition
Try the loss game (its free)!
Error
1846379
Intuition
Try the loss game (its free)!
Error
This is Gradient descent!
1846379
Gradient Descent Intuition
Error
Goal:
Initial estimate:
Often a random guess
1846379
1846379
Gradient Descent Intuition
Error
Initial estimate:
Often a random guess
Loss
Evaluations
1846379
1846379
Gradient Descent Intuition
Error
Gradient
Error
Error
Gradient
Error
1846379
1846379
The (Batch) Gradient Descent Alg.
1846379
Demo
Batch Gradient Descent
Loss (Error) Curves
The loss or error curve plots the error as a function of the number of iterations (steps) of gradient descent.
Training and �Validation Error
Training Error
Validation Error
Overfitting
Best Model
Validation curves are typically above the training curves.
You can overoptimize the objective (overfit).
Choose parameters with lowest validation error.
1846379
Convergence Assuming�Quadratic Approximation
Assuming a simple quadratic equation (e.g., x^2) is gradient descent guaranteed to converge?
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
1846379
Quadratic Approximations
1846379
Taylor Expansion of the Error Surface
Matrix Equivalent of the Second Derivative
1846379
Hessians (the 2nd derivative)
1846379
Stopped Here
Will return to later material in the next lecture.
Hessian Exercise
Plots in demo notebook.
Hessian
1846379
1846379
Taylor Expansion of the Error Surface
Use a change of variables to better understand this part.
1846379
Eigen Decomposition of the Hessian
Show it!
1846379
Eigen Decomposition of the Hessian
Show it!
Derivation:
Eigenvector Defn.
Orthonormality
1846379
Taylor Expansion at the Stationary Pts.
1846379
Stationary Points and the Hessian
Eigenvalues of the Hessian determine the curvature �at the critical points
Minimum
Maximum
Saddle Point
1846379
Eigenvector of the Hessian
Elliptical contours of constant error align with the eigenvectors of the Hessian matrix.
Plots in demo notebook.
Direction with Smallest Eigenvalue
Direction with Largest Eigenvalue
1846379
Demo
Taylor Expansions and the Hessian
Recap: Quadratic Approx. Error
1846379
Grad. Descent on the Quadratic Approx.
Quadratic Approx. Error
1846379
Grad. Descent on the Quadratic Approx.
1846379
Learning Rate Impact on Convergence
1846379
Condition Number and Loss Surface
Direction with Smallest Eigenvalue
Direction with Largest Eigenvalue
1846379
Issues with Gradient Descent (GD)
GD converges slowly when the error surface is relatively flat:
GD oscillates when the error surface is poorly conditioned
1846379
Demo
Batch Gradient Descent
Momentum
1846379
How Momentum Helps: Flat Directions
Geometric Series
1846379
How Momentum Helps: Oscillations
Alternating Geometric Series
1846379
Demo
Mini-batch Gradient Descent
Optimization and�Gradient Descent
Lecture 12
Credit: Joseph E. Gonzalez and Narges Norouzi
Reference Book Chapters: Chapter 7