1 of 67

Logistic Regression I

Moving from regression to classification.

Data 100, Summer 2023 @ UC Berkeley

Bella Crouch and Dominic Liu

Content credit: Acknowledgments

LECTURE 18

2 of 67

Join at slido.com�#3083107

ⓘ Start presenting to display the joining instructions on this slide.

4 of 67

Goals for this Lecture

Moving away from linear regression – it’s time for a new type of model

Introduce a new task: classification
Deriving a new model to handle this task

Lecture 18, Data 100 Summer 2023

5 of 67

Agenda

Lecture 18, Data 100 Summer 2023

Regression vs. Classification
The Logistic Regression Model
Cross-Entropy Loss

6 of 67

Regression vs Classification

Lecture 18, Data 100 Summer 2023

Regression vs. Classification
The Logistic Regression Model
Cross-Entropy Loss

7 of 67

Beyond Regression

Up until this point, we have been working exclusively with linear regression.

In the machine learning “landscape,” there are many other types of models.

8 of 67

Beyond Regression

Up until this point, we have been working exclusively with linear regression.

In the machine learning “landscape,” there are many other types of models.

Today + tomorrow

Next week

Week after

Take CS 188 or Data 102

✅

9 of 67

So Far: Regression

In regression, we use unbounded numeric features to predict an unbounded numeric output

Input: numeric features

Model: linear combination

Output: numeric prediction

∞

-∞

Examples:

Predict bill depth from flipper length
Predict tip from total bill
Predict mpg from hp

10 of 67

Now: Classification

In classification, we use unbounded numeric features to predict a categorical class

An aside: we will use logistic “regression” to perform a classification task. Here, “regression” refers to the type of model, not the task being performed.

Input: numeric features

Model: linear combination transformed by non-linear sigmoid

Decision rule

Examples:

Predict species of penguin from flipper length
Predict day of week from total bill
Predict model of car from hp

Output: class

IsAdelie?

If p > 0.5: predict “Adelie”

Other: predict “not Adelie”

11 of 67

Kinds of Classification

We are interested in predicting some categorical variable, or response, y.

Binary classification [Data 100]

Two classes
Responses y are either 0 or 1

Multiclass classification

Many classes
Examples: Image labeling (cat, dog, car), next word in a sentence, etc.

Structured prediction tasks

Multiple related classification predictions
Examples: Translation, voice recognition, etc.

win or lose

disease or no disease

spam or ham

Our new goal: predict a binary output (y_hat = 0 or y_hat = 1) given inputted numeric features

12 of 67

The Modeling Process

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Classification ( )

Regularization�Sklearn/Gradient descent

R², Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

(tomorrow)

Regularization�Sklearn/Gradient descent

13 of 67

The Logistic Regression Model

Lecture 18, Data 100 Summer 2023

Regression vs. Classification
The Logistic Regression Model
Cross-Entropy Loss

14 of 67

The games Dataset

New modeling task, new dataset.

The games dataset describes the win/loss results of basketball teams.

Difference in field goal success rate between teams

If a team won their game, we say they are in “Class 1”

15 of 67

Why Not Least Squares Linear Regression?

I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.

– Abraham Maslow, The Psychology of Science

Problems:

The output can be outside the label range {0, 1}
Some outputs can’t be interpreted: what does a class of “-2.3” mean?

Demo

16 of 67

Back to Basics: the Graph of Averages

Clearly, least squares regression won’t work here. We need to start fresh.

In Data 8, you built up to the idea of linear regression by considering the graph of averages.

Average y vs. x bins is approximately linear

Bucket x-axis into bins

[Data 8 textbook]

Parametric linear model

For an input x, compute the average value of y for all nearby x, and predict that.

17 of 67

Graph of Averages for Our Classification Task

For an input x, compute the average value of y�for all nearby x, and predict that.

[Data 8 textbook]

Demo

18 of 67

Graph of Averages for Our Classification Task

Looking a lot better!

Some observations:

All predictions are between 0 and 1
Our predictions are non-linear

Specifically, we see an “S” shape.

This will be important soon.

19 of 67

Graph of Averages for Our Classification Task

Thinking more deeply about what we’ve just done

To compute the average of a bin, we:

Counted the number of wins in the bin
Divided by the total number of datapoints in the bin

This is the probability that Y is 1 in that bin!

20 of 67

Predicting Probabilities

Our curve is modeling the probability that Y = 1 for a particular value of x.

This matches our observation that all predictions are between 0 and 1 – the predictions are representing probabilities

New modeling goal: model the probability of a datapoint belonging to Class 1

21 of 67

Handling the Non-Linear Output

We still seem to have a problem: our probability curve is non-linear, but we only know linear modeling techniques.

Good news: we’ve seen this before! To capture non-linear relationships, we:

Applied transformations to linearize the relationship
Fit a linear model to the transformed variables
Transformed the variables back to find their underlying relationship

We’ll use the exact same process here.

New modeling goal: model the probability of a datapoint belonging to Class 1

22 of 67

Step 1: Linearize the relationship

Our S-shaped probability curve doesn’t match any relationship we’ve seen before. The bulge diagram won’t help here.

To understand what transformation to apply, think about our eventual goal: assign each datapoint to its most likely class.

“Odds” is defined as the ratio of the probability of Y being Class 1 to the probability of Y being Class 0

How do we decide which class is more likely? One way: check which class has the higher predicted probability.

23 of 67

Step 1: Linearize the relationship

The odds curve looks roughly exponential. To linearize it, we’ll take the logarithm*.

*In Data 100: assume that “log” means base e natural log unless told otherwise

24 of 67

Step 2: Find a linear combination of the features

We now have a linear relationship between our transformed variables. We can represent this relationship as a linear combination of the features.

Our linear combination

Remember that our goal is to predict the probability of Class 1. This linear combination isn’t our final prediction! We use z instead of y_hat to remind ourselves.

25 of 67

Step 3: Transform back to original variables

Solve for p:

This is called the logistic function, 𝜎( ).

Odds

Log(odds)

26 of 67

Arriving at the Logistic Regression Model

We have just derived the logistic regression model for the probability of a datapoint belonging to Class 1.

To predict a probability:

Compute a linear combination of the features,
Apply the sigmoid function

27 of 67

The Sigmoid Function

The S-shaped curve we derived is formally known as the sigmoid function.

You worked with it in Homework 1B, you just didn’t know it at the time

Range

Reflection/ Symmetry

Domain

Derivative

Inverse

28 of 67

The Sigmoid Converts Numerical Features to Probabilities

Input: numeric features

Model: linear combination transformed by activation function

Decision rule

Output: class

IsAdelie?

If y_hat > 0.5: predict “Adelie”

Other: predict “not Adelie”

In logistic regression, the sigmoid transforms a linear combination of numerical features into a probability

Tomorrow’s lecture

29 of 67

Formalizing the Logistic Regression Model

Our main takeaways of this section:

Fit the “S” curve as best as possible.
The curve models probability: P(Y = 1 | x).
Assume log-odds is a linear combination of x and θ.

Putting it all together:

Logistic function 𝜎( ) at the value

Estimated probability that given the features x, the response is 1

Looks like linear regression. Now wrapped with 𝞼( )!

The logistic regression model is most commonly written as follows:

🎉

30 of 67

Example Calculation

Suppose I want to predict the probability that a team wins a game, given GOAL_DIFF (first feature) and number of free throws (second feature).

I fit a logistic regression model (with no intercept)�using my training data, and estimate�the optimal parameters:

Now, you want to predict the probability that a new

team wins their game.

🤔

31 of 67

Suppose x^T contains the GOAL_DIFF and number of free throws for a new team. What is the predicted probability that this new team will win?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

32 of 67

Example Calculation

Suppose I want to predict the probability that a team wins a game, given GOAL_DIFF (first feature) and number of free throws (second feature).

I fit a logistic regression model (with no intercept)�using my training data, and estimate�the optimal parameters:

Now, you want to predict the probability that a new

team wins their game.

🤔

Because the response is more likely to be 1 than 0, a reasonable prediction is

(more on this next lecture)

33 of 67

Properties of the Logistic Model

Consider a logistic regression model with one feature and an intercept term (Desmos):

Properties:

controls the position of the curve along the horizontal axis
The magnitude of controls the “steepness” of the sigmoid
The sign of controls the orientation of the curve

34 of 67

Interlude

Classification is hard.

35 of 67

Cross-Entropy Loss

Lecture 18, Data 100 Summer 2023

Regression vs. Classification
The Logistic Regression Model
Cross-Entropy Loss

36 of 67

The Modeling Process

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Regularization�Sklearn/Gradient descent

R², Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

✅

Classification ( )

(next time)

Regularization�Sklearn/Gradient descent

Logistic Regression

37 of 67

The Modeling Process

Classification ( )

(next time)

Regularization�Sklearn/Gradient descent

Logistic Regression

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Regularization�Sklearn/Gradient descent

R², Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

✅

Can squared loss still work?

38 of 67

Toy Dataset: L2 Loss

Mean Squared Error:

The MSE loss surface for logistic regression has many issues!

Logistic Regression model:

Assume no intercept.�So x, θ both scalars.

⚠️

Demo

39 of 67

What problems might arise from using MSE loss with logistic regression?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

40 of 67

Pitfalls of Squared Loss

1. Non-convex. Gets stuck in local minima.

Secant line crosses function, so R''(θ) is not greater than 0 for all θ.

Demo

41 of 67

Pitfalls of Squared Loss

1. Non-convex. Gets stuck in local minima.

Gradient Descent: Different initial guesses will yield different optimal estimates.

from scipy.optimize import minimize

minimize(mse_loss_toy_nobias, x0 = 0)["x"][0]

minimize(mse_loss_toy_nobias, x0 = -5)["x"][0]

0.5446601825581691

-10.343653061026611

Secant line crosses function, so R''(θ) is not greater than 0 for all θ.

Demo

42 of 67

Pitfalls of Squared Loss

1. Non-convex. Gets stuck in local minima.

Gradient Descent: Different initial guesses will yield different optimal estimates.

Secant line crosses function, so R''(θ) is not greater than 0 for all θ.

Demo

43 of 67

Pitfalls of Squared Loss

1. Non-convex. Gets stuck in local minima.

2. Bounded. Not a good measure of model error.

We’d like loss functions to penalize “off” predictions.
MSE never gets very large, because both response and predicted probability are bounded by 1.

If true y=1 but predicted probability = 0:

Demo

44 of 67

Choosing a Different Loss Function

Classification ( )

(next time)

Regularization�Sklearn/Gradient descent

Logistic Regression

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Regularization�Sklearn/Gradient descent

R², Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

✅

Cross-Entropy Loss

45 of 67

Loss in Classification

Let y be a binary label {0, 1}, and p be the model’s predicted probability of the label being 1.

In a classification task, how do we want our loss function to behave?

When the true y is 1, we should incur low loss when the model predicts large p
When the true y is 0, we should incur high loss when the model predicts large p

In other words, the behavior we need from our loss function depends on the value of the true class, y

46 of 67

Cross-Entropy Loss

Let y be a binary label {0, 1}, and p be the probability of the label being 1.

The cross-entropy loss is defined as

For y = 1,

p → 0: ∞ loss
p → 1: zero loss

For y = 0,

p → 0: zero loss
p → 1: ∞ loss

47 of 67

Cross-Entropy Loss: Two Loss Functions In One!

The piecewise loss function we introduced just then is difficult to optimize – we don’t want to check “which” loss to use at each step of optimizing theta

Cross-entropy loss can be equivalently expressed as:

for y = 1, only this term stays

makes loss positive

for y = 0, only this term stays

48 of 67

Empirical Risk: Average Cross-Entropy Loss

For a single datapoint, the cross-entropy curve�is convex. It has a global minimum.

What about average cross-entropy loss, i.e., empirical risk?�

For logistic regression, the empirical risk over a sample of size n is:

The optimization problem is therefore to find the estimate that minimizes R(θ):

[Recall our model is ]

49 of 67

Convexity Proof By Picture

Squared Loss Surface

Cross-Entropy Loss Surface

A straight line crosses the curve Non-convex

Convex!

50 of 67

Bonus Material:

Maximum Likelihood Estimation

Lecture 18, Data 100 Summer 2023

It may have seemed like we just pulled cross-entropy loss out of thin air.

CE loss is justified by a probability analysis technique called maximum likelihood estimation. Read on if you would like to learn more.

Recorded walkthrough: link.

51 of 67

The No-Input Binary Classifier

Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):

{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

For the next flip, do you predict heads or tails?

Training data has only responses (no features )

52 of 67

The No-Input Binary Classifier

Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):

{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

For the next flip, do you predict heads or tails?

A reasonable model is to assume all flips are IID (i.e., same coin; same prob. of heads θ).

Training data has only responses (no features )

Parameter θ: Probability that�flip == 1 (Heads)

Prediction:�1 or 0

🤔

1. Of the below, which is the best theta θ? Why?

A. 0.8 B. 0.5 C. 0.4

D. 0.2 E. Something else

2. For the next flip, would you predict 1 or 0?

53 of 67

The No-Input Binary Classifier

Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):

{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

For the next flip, do you predict heads or tails?

A reasonable model is to assume all flips are IID (i.e., same coin; same prob. of heads θ).

Training data has only responses (no features )

1. Of the below, which is the best theta θ? Why?

A. 0.8 B. 0.5 C. 0.4

D. 0.2 E. Something else

Parameter θ: Probability that�flip == 1 (Heads)

Prediction:�1 or 0

54 of 67

The No-Input Binary Classifier

Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):

{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

For the next flip, do you predict heads or tails?

A reasonable model is to assume all flips are IID (i.e., same coin; same prob. of heads θ).

Training data has only responses (no features )

1. Of the below, which is the best theta θ? Why?

A. 0.8 B. 0.5 C. 0.4

D. 0.2 E. Something else

Parameter θ: Probability that�flip == 1 (Heads)

Prediction:�1 or 0

0.4 is the most “intuitive” for two reasons:

Frequency of heads in our data.
Maximizes the likelihood of our data.

55 of 67

The No-Input Binary Classifier

Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):

{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

For the next flip, do you predict heads or tails?

A reasonable model is to assume all flips are IID (i.e., same coin; same prob. of heads θ).

Training data has only responses (no features )

1. Of the below, which is the best theta θ? Why?

A. 0.8 B. 0.5 C. 0.4

D. 0.2 E. Something else

2. For the next flip, would you predict 1 or 0?

The most frequent outcome in the sample which is tails.

Parameter θ: Probability that�flip == 1 (Heads)

Prediction:�1 or 0

56 of 67

Likelihood of Data; Definition of Probability

A Bernoulli random variable Y with parameter p has distribution:

Given that all flips are IID from the same coin�(probability of heads = p), the likelihood of our�data is proportional to the probability of observing the datapoints.

Training data: [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]

Data likelihood: .

57 of 67

Likelihood of Data

A Bernoulli random variable Y with parameter p has distribution:

Given that all flips are IID from the same coin�(probability of heads = p), the likelihood of our�data is proportional to the probability of observing the datapoints.

Training data: [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]

Data likelihood: .

An example of a bad estimate is parameter

since the likelihood of observing the training data is going to be :

An example of a bad estimate is parameter

since the likelihood of observing the training data is going to be :

58 of 67

Generalization of the Coin Demo

For training data: {0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

0.4 is the most “intuitive” θ for two reasons:

Frequency of heads in our data
Maximizes the likelihood of our data:

How can we generalize this notion of�likelihood to any random binary sample?

Parameter θ: Probability that�IID flip == 1 (Heads)

Prediction:�1 or 0

likelihood

data (1’s and 0’s)

59 of 67

A Compact Representation of the Bernoulli Probability Distribution

How can we generalize this notion of�likelihood to any random binary sample?

(long, non-compact form):

For P(Y = 1), only this term stays

For P(Y = 0), only this term stays

Let Y be Bernoulli(p). The probability distribution can be written compactly:

likelihood

data (1’s and 0’s)

60 of 67

Generalized Likelihood of Binary Data

How can we generalize this notion of�likelihood to any random binary sample?

For P(Y = 1), only this term stays

For P(Y = 0), only this term stays

If binary data are IID with same probability p, then the likelihood of the data is:

Ex: {0, 0, 1, 1, 1, 1, 0, 0, 0, 0} →

Let Y be Bernoulli(p). The probability distribution can be written compactly:

likelihood

data (1’s and 0’s)

61 of 67

Generalized Likelihood of Binary Data

How can we generalize this notion of�likelihood to any random binary sample?

(spoiler: for logistic regression, )

likelihood

data (1’s and 0’s)

For P(Y = 1), only this term stays

For P(Y = 0), only this term stays

Let Y be Bernoulli(p). The probability distribution can be written compactly:

If binary data are IID with same probability p, then the likelihood of the data is:

If data are independent with different probability p_i, then the likelihood of the data is:

62 of 67

Maximum Likelihood Estimation (MLE)

Our maximum likelihood estimation problem:

For i = 1, 2, …, n, let Y_i be independent Bernoulli(p_i). Observe data .
We’d like to estimate .

Find that maximize

63 of 67

Maximum Likelihood Estimation (MLE)

Our maximum likelihood estimation problem:

For i = 1, 2, …, n, let Y_i be independent Bernoulli(p_i). Observe data .
We’d like to estimate .

Find that maximize

Equivalent, simplifying optimization problems (since we need to take the first derivative):

maximize

(log is an increasing function. If a > b, then log(a) > log(b).)

64 of 67

Maximum Likelihood Estimation (MLE)

Our maximum likelihood estimation problem:

For i = 1, 2, …, n, let Y_i be independent Bernoulli(p_i). Observe data .
We’d like to estimate .

Find that maximize

Equivalent, simplifying optimization problems (since we need to take the first derivative):

maximize

(log is an increasing function. If a > b, then log(a) > log(b).)

minimize

65 of 67

Maximizing Likelihood == Minimizing Average Cross-Entropy

maximize

minimize

Cross-Entropy Loss!!

Log is increasing; max/min properties

For logistic regression, let

Minimizing cross-entropy loss is equivalent to maximizing the likelihood of the training data.

We are choosing the model parameters that are “most likely”, given this data.

Assumption: all data drawn independently from the same logistic regression model with parameter θ

It turns out that many of the model + loss combinations we’ve seen can be motivated using MLE (OLS, Ridge Regression, etc.)
You will study MLE further in probability and ML classes. But now you know it exists.

66 of 67

We Did it!

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Classification ( )

Regularization�Sklearn/Gradient descent

R², Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

(next time)

Regularization�Sklearn/Gradient descent

✅

Logistic Regression

Average Cross-Entropy Loss

✅

1 of 67

2 of 67

3 of 67

4 of 67

5 of 67

6 of 67

7 of 67

8 of 67

9 of 67

10 of 67

11 of 67

12 of 67

13 of 67

14 of 67

15 of 67

16 of 67

17 of 67

18 of 67

19 of 67

20 of 67

21 of 67

22 of 67

23 of 67

24 of 67

25 of 67

26 of 67

27 of 67

28 of 67

29 of 67

30 of 67

31 of 67

32 of 67

33 of 67

34 of 67

35 of 67

36 of 67

37 of 67

38 of 67

39 of 67

40 of 67

41 of 67

42 of 67

43 of 67

44 of 67

45 of 67

46 of 67

47 of 67

48 of 67

49 of 67

50 of 67

51 of 67

52 of 67

53 of 67

54 of 67

55 of 67

56 of 67

57 of 67

58 of 67

59 of 67

60 of 67

61 of 67

62 of 67

63 of 67

64 of 67

65 of 67

66 of 67

67 of 67