1 of 69

Join at slido.com�#1846379

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

1846379

2 of 69

Optimization and�Gradient Descent

Lecture 12

The core algorithm of modern machine learning

EECS 189/289, Fall 2025 @ UC Berkeley

Joseph E. Gonzalez and Narges Norouzi

EECS 189/289, Fall 2025 @ UC Berkeley

Joseph E. Gonzalez and Narges Norouzi

3 of 69

Key Factors in �Generative AI Revolution

Data

Compute

Automatic Diff.

Loss Functions

Stochastic Gradient Descent

Robbins–Monro 1951

Neural

Architectures

1846379

4 of 69

The General Optimization �Problem

5 of 69

The General Optimization Problem

Most problems in ML can be reduced to the following general optimization problem

Solution

(Learned Parameters)

Objective Function

Constraints

1846379

6 of 69

Optimization Constraints

The constraints define the valid set of parameters.

Solution

(Learned Parameters)

Constraints

1846379

7 of 69

The Objective Function

The shape of the objective function plays a big role in the complexity of solving the optimization problem.

Objective Function

Solution

(Learned Parameters)

Constraints

1846379

8 of 69

Examples Optimization Problems

How do we solve these optimization problems?

What properties of the objective function make it “easier” to solve?

Discrete

Multiple Minima

Global Minima

Plots in demo notebook.

Local Minima

1846379

9 of 69

Convex Functions

Non-Convex:

Convex:

In general, there are efficient algorithms to reliably minimize convex functions.

1846379

10 of 69

Convex (Constraint) Sets

Convex Set

Non-convex Set

In general, there are efficient algorithms to reliably minimize convex functions over convex sets.

1846379

11 of 69

Optimization Problems

Optimization of a convex objective function over a convex constraint set.

ML problems that we can solve efficiently
Primary focus of EECS-127 and books

Many classic ML problems are convex.

Least Squares Regression and Logistic Regression

Most of modern ML (deep learning) is not convex.

We use convex optimization techniques for non-convex problems
In this class, introduce basic optimization concepts

Convex Function

Convex Constraints

1846379

12 of 69

Minimizing the Error &�Optimizing in Weight-Space

13 of 69

Points on the Error Surface

Weight Space

Data

Plots in demo notebook.

1846379

14 of 69

Demo

Understanding the Error Surface

15 of 69

Slido Matching

Model 2

Least Squares

Regression

Model 1

Logistic

Regression

Model 3

Logistic

Regression

A

C

B

1846379

16 of 69

Match the model with the loss.

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

1846379

17 of 69

Calculus and Optimization

18 of 69

The Gradient

1846379

19 of 69

Gradient Exercise

Plots in demo notebook.

Gradient

The gradient points in the direction of greatest ascent.

1846379

20 of 69

Demo

The Gradient of the Error Function

21 of 69

If w is a d-dimensional vector and E(w) is a scalar-valued function, what is the dimensionality of the gradient ∇E(w) with respect to w?

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

1846379

22 of 69

Common Confusion

The gradient is in the �domain of the function.

Not the gradient!

1846379

23 of 69

Stationary Points and Zero Gradient

Anywhere the gradient is zero is a stationary point.

Gradient is zero at local minima but zero gradient does not imply a minima

Minimum

Maximum

Saddle Point

Plots in demo notebook.

1846379

24 of 69

Analytical Optimization using Gradients

1846379

25 of 69

Gradient Descent

Iterative optimization

26 of 69

Iterative Optimization Algorithms

New Estimate

Old Estimate

Update

(Based on Gradient)

1846379

27 of 69

Intuition

Try the loss game (its free)!

Error

Goal: Minimize the loss by turning the knobs.

1846379

28 of 69

Intuition

Try the loss game (its free)!

Error

1846379

29 of 69

Intuition

Try the loss game (its free)!

Error

1846379

30 of 69

Intuition

Try the loss game (its free)!

Error

1846379

31 of 69

Intuition

Try the loss game (its free)!

Error

1846379

32 of 69

Intuition

Try the loss game (its free)!

Error

What if we knew which way to turn the knob�and an idea of how far?

This is the Gradient!

1846379

33 of 69

Intuition

Try the loss game (its free)!

Error

1846379

34 of 69

Intuition

Try the loss game (its free)!

Error

1846379

35 of 69

Intuition

Try the loss game (its free)!

Error

1846379

36 of 69

Intuition

Try the loss game (its free)!

Error

1846379

37 of 69

Intuition

Try the loss game (its free)!

Error

This is Gradient descent!

1846379

38 of 69

Gradient Descent Intuition

Error

Goal:

Initial estimate:

Often a random guess

1846379

39 of 69

Gradient Descent Intuition

Error

Initial estimate:

Often a random guess

Loss

Evaluations

1846379

40 of 69

Gradient Descent Intuition

Error

Gradient

Error

Gradient

Error

1846379

41 of 69

The (Batch) Gradient Descent Alg.

1846379

42 of 69

Demo

Batch Gradient Descent

43 of 69

Loss (Error) Curves

The loss or error curve plots the error as a function of the number of iterations (steps) of gradient descent.

Training and �Validation Error

Training Error

Validation Error

Overfitting

Best Model

Validation curves are typically above the training curves.

You can overoptimize the objective (overfit).

Choose parameters with lowest validation error.

1846379

44 of 69

Convergence Assuming�Quadratic Approximation

45 of 69

Assuming a simple quadratic equation (e.g., x^2) is gradient descent guaranteed to converge?

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

1846379

46 of 69

Quadratic Approximations

1846379

47 of 69

Taylor Expansion of the Error Surface

Matrix Equivalent of the Second Derivative

1846379

48 of 69

Hessians (the 2^nd derivative)

1846379

49 of 69

Stopped Here

Will return to later material in the next lecture.

50 of 69

Hessian Exercise

Plots in demo notebook.

Hessian

1846379

51 of 69

Taylor Expansion of the Error Surface

Use a change of variables to better understand this part.

1846379

52 of 69

Eigen Decomposition of the Hessian

Show it!

1846379

53 of 69

Eigen Decomposition of the Hessian

Show it!

Derivation:

Eigenvector Defn.

Orthonormality

1846379

54 of 69

Taylor Expansion at the Stationary Pts.

1846379

55 of 69

Stationary Points and the Hessian

Eigenvalues of the Hessian determine the curvature �at the critical points

Minimum

Maximum

Saddle Point

1846379

56 of 69

Eigenvector of the Hessian

Elliptical contours of constant error align with the eigenvectors of the Hessian matrix.

Plots in demo notebook.

Direction with Smallest Eigenvalue

Direction with Largest Eigenvalue

1846379

57 of 69

Demo

Taylor Expansions and the Hessian

58 of 69

Recap: Quadratic Approx. Error

1846379

59 of 69

Grad. Descent on the Quadratic Approx.

Quadratic Approx. Error

1846379

60 of 69

Grad. Descent on the Quadratic Approx.

1846379

61 of 69

Learning Rate Impact on Convergence

1846379

62 of 69

Condition Number and Loss Surface

Direction with Smallest Eigenvalue

Direction with Largest Eigenvalue

1846379

63 of 69

Issues with Gradient Descent (GD)

GD converges slowly when the error surface is relatively flat:

GD oscillates when the error surface is poorly conditioned

1846379

64 of 69

Demo

Batch Gradient Descent

65 of 69

Momentum

1846379

66 of 69

How Momentum Helps: Flat Directions

Geometric Series

1846379

67 of 69

How Momentum Helps: Oscillations

Alternating Geometric Series

1846379

68 of 69

Demo

Mini-batch Gradient Descent

69 of 69

Optimization and�Gradient Descent

Lecture 12

Credit: Joseph E. Gonzalez and Narges Norouzi

Reference Book Chapters: Chapter 7