1 of 18

Gradient Descent and Logistic Regression

CSE 447 / 517

January 16th, 2025 (Week 2)

2 of 18

Logistics

Submit the Academic Integrity Form on Canvas
Assignment 1 (A1) is due on Wednesday, 1/22
No Lecture next Monday 1/20

3 of 18

Agenda

Gradient Descent
Stochastic Gradient Descent
Logistic Regression

4 of 18

Feature Vectors

The features fully determine what a learned model “sees” about an example.
We often stack the features into a feature vector:

which “embeds” the input x in d-dimensional space

Example feature: word frequencies, idf… You can stack them to be a feature vector!

5 of 18

Gradient Descent

Goal: Given a dataset , find the weights θ^* by maximum likelihood estimation.

6 of 18

Gradient Descent

Goal: Given a dataset , find the weights θ^* by maximum likelihood estimation.

apply the definition of p_LR

7 of 18

Gradient Descent

Big idea: minimize the loss by “optimization along the (negative) gradient”.

See Eisenstein pg. 37

𝞱

loss

“Gradient” aka the 1st derivative is the slope.

8 of 18

Gradient Descent

Step 1: finding the gradient.

Start from the loss function:

Differentiate with respect to the parameters:

9 of 18

Gradient Descent

Step 1: finding the gradient.

Simplify the gradient:

10 of 18

Gradient Descent

Step 2: take a step.

Step 3: repeat Step 1-2 until converge (i.e. loss basically stops decreasing).

Update the parameters:

where ɑ is the learning rate.

11 of 18

Gradient Descent

Things to consider: how to choose learning rate? Another hyperparameter!

From https://www.jeremyjordan.me/nn-learning-rate/

12 of 18

*Stochastic* Gradient Descent

13 of 18

Stochastic Gradient Descent

We can prove that SGD will eventually get very close to a global minimum of a convex objective function. What do you think will happen if we apply SGD to a function that is not convex?

Image source: Wikipedia

14 of 18

Stochastic Gradient Descent

We can prove that SGD will eventually get very close to a global minimum of a convex objective function. What do you think will happen if we apply SGD to a function that is not convex?

SGD tends to lead to local minima, but it makes no guarantees about global minima.

Image source: Wikipedia

15 of 18

Logistic Regression

A logistic regression model usually has:

- A collection of feature functions, denoted

each mapping

- A coefficient or “weight” for every feature, denoted

each

16 of 18

Binary Logistic Regression

the labels are arbitrary and can be changed as long as the classify() function is modified accordingly!

17 of 18

Binary Logistic Regression

from Lecture Slide 40

apply the definition of the score function

Symbol	Definition	Scalar / Vector
x	Input	Vector
y	Output	Scalar
𝞱	Parameters	Vector
𝜙(x)	Feature vector (Lecture Slide 31)	Vector

apply the definition of the standard logistic function

18 of 18

CSE447: Project 0, Python and Pytorch Tutorial + Review

https://colab.research.google.com/drive/1PAUlmIZMcxsKME0UlBCLf8HtQU2rcs5Q