1 of 18

Gradient Descent and Logistic Regression

CSE 447 / 517

January 16th, 2025 (Week 2)

2 of 18

Logistics

  • Submit the Academic Integrity Form on Canvas
  • Assignment 1 (A1) is due on Wednesday, 1/22
  • No Lecture next Monday 1/20

3 of 18

Agenda

  • Gradient Descent
  • Stochastic Gradient Descent
  • Logistic Regression

4 of 18

Feature Vectors

  • The features fully determine what a learned model “sees” about an example.
  • We often stack the features into a feature vector:

which “embeds” the input x in d-dimensional space

  • Example feature: word frequencies, idf… You can stack them to be a feature vector!

5 of 18

Gradient Descent

Goal: Given a dataset , find the weights θ* by maximum likelihood estimation.

6 of 18

Gradient Descent

Goal: Given a dataset , find the weights θ* by maximum likelihood estimation.

apply the definition of pLR

7 of 18

Gradient Descent

Big idea: minimize the loss by “optimization along the (negative) gradient”.

See Eisenstein pg. 37

𝞱

loss

“Gradient” aka the 1st derivative is the slope.

8 of 18

Gradient Descent

Step 1: finding the gradient.

Start from the loss function:

Differentiate with respect to the parameters:

9 of 18

Gradient Descent

Step 1: finding the gradient.

Simplify the gradient:

10 of 18

Gradient Descent

Step 2: take a step.

Step 3: repeat Step 1-2 until converge (i.e. loss basically stops decreasing).

Update the parameters:

where ɑ is the learning rate.

11 of 18

Gradient Descent

Things to consider: how to choose learning rate? Another hyperparameter!

From https://www.jeremyjordan.me/nn-learning-rate/

12 of 18

*Stochastic* Gradient Descent

13 of 18

Stochastic Gradient Descent

We can prove that SGD will eventually get very close to a global minimum of a convex objective function. What do you think will happen if we apply SGD to a function that is not convex?

Image source: Wikipedia

14 of 18

Stochastic Gradient Descent

We can prove that SGD will eventually get very close to a global minimum of a convex objective function. What do you think will happen if we apply SGD to a function that is not convex?

SGD tends to lead to local minima, but it makes no guarantees about global minima.

Image source: Wikipedia

15 of 18

Logistic Regression

A logistic regression model usually has:

- A collection of feature functions, denoted

each mapping

- A coefficient or “weight” for every feature, denoted

each

16 of 18

Binary Logistic Regression

the labels are arbitrary and can be changed as long as the classify() function is modified accordingly!

17 of 18

Binary Logistic Regression

from Lecture Slide 40

apply the definition of the score function

Symbol

Definition

Scalar / Vector

x

Input

Vector

y

Output

Scalar

𝞱

Parameters

Vector

𝜙(x)

Feature vector (Lecture Slide 31)

Vector

apply the definition of the standard logistic function

18 of 18

CSE447: Project 0, Python and Pytorch Tutorial + Review

https://colab.research.google.com/drive/1PAUlmIZMcxsKME0UlBCLf8HtQU2rcs5Q