1 of 67

Join at slido.com�#1600133

ⓘ Start presenting to display the joining instructions on this slide.

1600133

2 of 67

Logistic Regression I

Moving from regression to classification.

Data 100/Data 200, Fall 2023 @ UC Berkeley

Narges Norouzi and Fernando Pérez

Content credit: Acknowledgments

LECTURE 22

1600133

3 of 67

Goals for this Lecture

Moving away from linear regression – it’s time for a new type of model

Introduce a new task: classification
Deriving a new model to handle this task

Lecture 22, Data 100 Fall 2023

1600133

4 of 67

Agenda

Regression vs. Classification
The Logistic Regression Model
Cross-Entropy Loss

Lecture 22, Data 100 Fall 2023

1600133

5 of 67

Regression vs Classification

Regression vs. Classification
The Logistic Regression Model
Cross-Entropy Loss

Lecture 22, Data 100 Fall 2023

1600133

6 of 67

Beyond Regression

Up until this point, we have been working exclusively with linear regression.

In the machine learning “landscape,” there are many other types of models.

1600133

7 of 67

Beyond Regression

Up until this point, we have been working exclusively with linear regression.

In the machine learning “landscape,” there are many other types of models.

Next week

Week after

Take CS 188 or Data 102

✅

This week

1600133

8 of 67

So Far: Regression

In regression, we use unbounded numeric features to predict an unbounded numeric output.

Input: numeric features

Model: linear combination

Output: numeric prediction

∞

-∞

Examples:

Predict goal difference from turnover %
Predict tip from total bill
Predict mpg from hp

1600133

9 of 67

Now: Classification

In classification, we use unbounded numeric features to predict a categorical class.

An aside: we will use logistic “regression” to perform a classification task. Here, “regression” refers to the type of model, not the task being performed.

Input: numeric features

Model: linear combination transformed by non-linear sigmoid

Decision rule

Examples:

Predict which team won from turnover %
Predict day of week from total bill
Predict model of car from hp

Output: class

Win?

If p > 0.5: predict a win

Other: predict a loss

1600133

10 of 67

Kinds of Classification

We are interested in predicting some categorical variable, or response, y.

Binary classification [Data 100]

Two classes
Responses y are either 0 or 1

win or lose

disease or no disease

spam or ham

Our new goal: predict a binary output (y_hat = 0 or y_hat = 1) given inputted numeric features

Multiclass classification

Many classes
Examples: Image labeling (Pishi, Thor, Hera), next word in a sentence, etc.

Structured prediction tasks

Multiple related classification predictions
Examples: Translation, voice recognition, etc.

1600133

11 of 67

The Modeling Process

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Classification ( )

Regularization�Sklearn/Gradient descent

R², Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

(next lecture)

Regularization�Sklearn/Gradient descent

1600133

12 of 67

The Logistic Regression Model

Regression vs. Classification
The Logistic Regression Model
Cross-Entropy Loss

Lecture 22, Data 100 Fall 2023

1600133

13 of 67

The games Dataset

New modeling task, new dataset.

The games dataset describes the win/loss results of basketball teams.

Difference in field goal success rate between teams

If a team won their game, we say they are in “Class 1”

1600133

14 of 67

Why Not Least Squares Linear Regression?

I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.

– Abraham Maslow, The Psychology of Science

Problems:

The output can be outside the label range {0, 1}.
Some outputs can’t be interpreted: what does a class of “-2.3” mean?

Demo

1600133

15 of 67

Back to Basics: the Graph of Averages

Clearly, least squares regression won’t work here. We need to start fresh.

In Data 8, you built up to the idea of linear regression by considering the graph of averages.

Average y vs. x bins is approximately linear

Bucket x-axis into bins

[Data 8 textbook]

Parametric linear model

For an input x, compute the average value of y for all nearby x, and predict that.

1600133

16 of 67

Graph of Averages for Our Classification Task

For an input x, compute the average value of y�for all nearby x, and predict that.

[Data 8 textbook]

Demo

1600133

17 of 67

Graph of Averages for Our Classification Task

Looking a lot better!

Some observations:

All predictions are between 0 and 1.
Our predictions are non-linear.

Specifically, we see an “S” shape.

This will be important soon.

1600133

18 of 67

Graph of Averages for Our Classification Task

Thinking more deeply about what we’ve just done.

To compute the average of a bin, we:

Counted the number of wins in the bin.
Divided by the total number of datapoints in the bin.

This is the probability that Y is 1 in that bin!

1600133

19 of 67

Predicting Probabilities

Our curve is modeling the probability that Y = 1 for a particular value of x.

This matches our observation that all predictions are between 0 and 1 – the predictions are representing probabilities.

New modeling goal: model the probability of a datapoint belonging to Class 1.

1600133

20 of 67

Handling the Non-Linear Output

We still seem to have a problem: our probability curve is non-linear, but we only know linear modeling techniques.

Good news: we’ve seen this before! To capture non-linear relationships, we:

Applied transformations to linearize the relationship.
Fit a linear model to the transformed variables.
Transformed the variables back to find their underlying relationship.

We’ll use the exact same process here.

New modeling goal: model the probability of a datapoint belonging to Class 1.

1600133

21 of 67

Step 1: Linearize the Relationship

Our S-shaped probability curve doesn’t match any relationship we’ve seen before. The bulge diagram won’t help here.

To understand what transformation to apply, think about our eventual goal: assign each datapoint to its most likely class.

“Odds” is defined as the ratio of the probability of Y being Class 1 to the probability of Y being Class 0

How do we decide which class is more likely? One way: check which class has the higher predicted probability.

1600133

22 of 67

Step 1: Linearize the Relationship

The odds curve looks roughly exponential. To linearize it, we’ll take the logarithm*.

*In Data 100: assume that “log” means base e natural log unless told otherwise.

1600133

23 of 67

Step 2: Find a Linear Combination of the Features

We now have a linear relationship between our transformed variables. We can represent this relationship as a linear combination of the features.

Our linear combination

Remember that our goal is to predict the probability of Class 1. This linear combination isn’t our final prediction! We use z instead of y_hat to remind ourselves.

1600133

24 of 67

Step 3: Transform Back to Original Variables

Solve for p:

This is called the logistic function, 𝜎( ).

Odds

Log(odds)

1600133

25 of 67

Step 3: Transform Back to Original Variables

Solve for p:

This is called the logistic function, 𝜎( ).

Odds

Log(odds)

1600133

26 of 67

Arriving at the Logistic Regression Model

We have just derived the logistic regression model for the probability of a datapoint belonging to Class 1.

To predict a probability:

Compute a linear combination of the features,
Apply the sigmoid function

1600133

27 of 67

The Sigmoid Function

The S-shaped curve we derived is formally known as the sigmoid function.

Range

Reflection/ Symmetry

Domain

Derivative

Inverse

1600133

28 of 67

The Sigmoid Converts Numerical Features to Probabilities

Input: numeric features

Model: linear combination transformed by activation function

Decision rule

Output: class

In logistic regression, the sigmoid transforms a linear combination of numerical features into a probability.

Next lecture:

Win?

If p > 0.5: predict a win

Other: predict a loss

1600133

29 of 67

Formalizing the Logistic Regression Model

Our main takeaways of this section:

Fit the “S” curve as best as possible.
The curve models probability: P(Y = 1 | x).
Assume log-odds is a linear combination of x and θ.

Putting it all together:

Logistic function 𝜎( ) at the value

Estimated probability that given the features x, the response is 1

Looks like linear regression. Now wrapped with 𝞼( )!

The logistic regression model is most commonly written as follows:

🎉

1600133

30 of 67

Example Calculation

Suppose I want to predict the probability that a team wins a game, given GOAL_DIFF (first feature) and number of free throws (second feature).

I fit a logistic regression model (with no intercept)�using my training data, and estimate�the optimal parameters:

Now, you want to predict the probability that a new

team wins their game.

🤔

1600133

31 of 67

Suppose x^T contains the GOAL_DIFF and number of free throws for a new team. What is the predicted probability that this new team will win?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

1600133

32 of 67

Example Calculation

Suppose I want to predict the probability that a team wins a game, given GOAL_DIFF (first feature) and number of free throws (second feature).

I fit a logistic regression model (with no intercept)�using my training data, and estimate�the optimal parameters:

Now, you want to predict the probability that a new

team wins their game.

🤔

Because the response is more likely to be 1 than 0, a reasonable prediction is

(more on this next lecture)

1600133

33 of 67

Properties of the Logistic Model

Consider a logistic regression model with one feature and an intercept term (Desmos):

Properties:

controls the position of the curve along the horizontal axis.
The magnitude of controls the “steepness” of the sigmoid.
The sign of controls the orientation of the curve.

1600133

34 of 67

Interlude

Classification is hard.

1600133

35 of 67

Cross-Entropy Loss

Regression vs. Classification
The Logistic Regression Model
Cross-Entropy Loss

Lecture 22, Data 100 Fall 2023

1600133

36 of 67

The Modeling Process

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Regularization�Sklearn/Gradient descent

R², Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

✅

Classification ( )

(next time)

Regularization�Sklearn/Gradient descent

Logistic Regression

1600133

37 of 67

The Modeling Process

Classification ( )

(next time)

Regularization�Sklearn/Gradient descent

Logistic Regression

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Regularization�Sklearn/Gradient descent

R², Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

✅

Can squared loss still work?

1600133

38 of 67

Toy Dataset: L2 Loss

Mean Squared Error:

The MSE loss surface for logistic regression has many issues!

Logistic Regression model:

Assume no intercept.�So x, θ both scalars.

⚠️

Demo

1600133

39 of 67

What problems might arise from using MSE loss with logistic regression?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

1600133

40 of 67

Pitfalls of Squared Loss

1. Non-convex. Gets stuck in local minima.

Secant line crosses function, so R''(θ) is not greater than 0 for all θ.

Demo

1600133

41 of 67

Pitfalls of Squared Loss

1. Non-convex. Gets stuck in local minima.

Gradient Descent: Different initial guesses will yield different optimal estimates.

from scipy.optimize import minimize

minimize(mse_loss_toy_nobias, x0 = 0)["x"][0]

minimize(mse_loss_toy_nobias, x0 = -5)["x"][0]

0.5446601825581691

-10.343653061026611

Secant line crosses function, so R''(θ) is not greater than 0 for all θ.

Demo

1600133

42 of 67

Pitfalls of Squared Loss

1. Non-convex. Gets stuck in local minima.

Gradient Descent: Different initial guesses will yield different optimal estimates.

Secant line crosses function, so R''(θ) is not greater than 0 for all θ.

Demo

1600133

43 of 67

Pitfalls of Squared Loss

1. Non-convex. Gets stuck in local minima.

2. Bounded. Not a good measure of model error.

We’d like loss functions to penalize “off” predictions.
MSE never gets very large, because both response and predicted probability are bounded by 1.

If true y=1 but predicted probability = 0:

Demo

1600133

44 of 67

Choosing a Different Loss Function

Classification ( )

(next time)

Regularization�Sklearn/Gradient descent

Logistic Regression

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Regularization�Sklearn/Gradient descent

R², Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

✅

Cross-Entropy Loss

1600133

45 of 67

Loss in Classification

Let y be a binary label {0, 1}, and p be the model’s predicted probability of the label being 1.

In a classification task, how do we want our loss function to behave?

When the true y is 1, we should incur low loss when the model predicts large p.
When the true y is 0, we should incur high loss when the model predicts large p.

In other words, the behavior we need from our loss function depends on the value of the true class, y.

1600133

46 of 67

Cross-Entropy Loss

Let y be a binary label {0, 1}, and p be the probability of the label being 1.

The cross-entropy loss is defined as:

For y = 1,

p → 0: ∞ loss
p → 1: zero loss

For y = 0,

p → 0: zero loss
p → 1: ∞ loss

1600133

47 of 67

Cross-Entropy Loss: Two Loss Functions In One!

The piecewise loss function we introduced just then is difficult to optimize – we don’t want to check “which” loss to use at each step of optimizing theta.

Cross-entropy loss can be equivalently expressed as:

for y = 1, only this term stays

for y = 0, only this term stays

makes loss positive

1600133

48 of 67

Empirical Risk: Average Cross-Entropy Loss

For a single datapoint, the cross-entropy curve�is convex. It has a global minimum.

What about average cross-entropy loss, i.e., empirical risk?�

For logistic regression, the empirical risk over a sample of size n is:

The optimization problem is therefore to find the estimate that minimizes R(θ):

[Recall our model is ]

1600133

49 of 67

Convexity Proof By Picture

Squared Loss Surface

Cross-Entropy Loss Surface

A straight line crosses the curve Non-convex

Convex!

1600133

50 of 67

Logistic Regression I

Data 100/Data 200, Fall 2023 @ UC Berkeley

Narges Norouzi and Fernando Pérez

Content credit: Acknowledgments

LECTURE 22

1600133

51 of 67

Bonus Material:

Maximum Likelihood Estimation

It may have seemed like we just pulled cross-entropy loss out of thin air.

CE loss is justified by a probability analysis technique called maximum likelihood estimation. Read on if you would like to learn more.

Recorded walkthrough: link.

Lecture 22, Data 100 Fall 2023

1600133

52 of 67

The No-Input Binary Classifier

Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):

{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

For the next flip, do you predict heads or tails?

Training data has only responses (no features )

1600133

53 of 67

The No-Input Binary Classifier

Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):

{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

For the next flip, do you predict heads or tails?

A reasonable model is to assume all flips are IID (i.e., same coin; same prob. of heads θ).

Training data has only responses (no features )

Parameter θ: Probability that�flip == 1 (Heads)

Prediction:�1 or 0

🤔

1. Of the below, which is the best theta θ? Why?

A. 0.8 B. 0.5 C. 0.4

D. 0.2 E. Something else

2. For the next flip, would you predict 1 or 0?

1600133

54 of 67

The No-Input Binary Classifier

Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):

{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

For the next flip, do you predict heads or tails?

A reasonable model is to assume all flips are IID (i.e., same coin; same prob. of heads θ).

Training data has only responses (no features )

1. Of the below, which is the best theta θ? Why?

A. 0.8 B. 0.5 C. 0.4

D. 0.2 E. Something else

Parameter θ: Probability that�flip == 1 (Heads)

Prediction:�1 or 0

1600133

55 of 67

The No-Input Binary Classifier

Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):

{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

For the next flip, do you predict heads or tails?

A reasonable model is to assume all flips are IID (i.e., same coin; same prob. of heads θ).

Training data has only responses (no features )

1. Of the below, which is the best theta θ? Why?

A. 0.8 B. 0.5 C. 0.4

D. 0.2 E. Something else

Parameter θ: Probability that�flip == 1 (Heads)

Prediction:�1 or 0

0.4 is the most “intuitive” for two reasons:

Frequency of heads in our data.
Maximizes the likelihood of our data.

1600133

56 of 67

The No-Input Binary Classifier

Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):

{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

For the next flip, do you predict heads or tails?

A reasonable model is to assume all flips are IID (i.e., same coin; same prob. of heads θ).

Training data has only responses (no features )

1. Of the below, which is the best theta θ? Why?

A. 0.8 B. 0.5 C. 0.4

D. 0.2 E. Something else

2. For the next flip, would you predict 1 or 0?

The most frequent outcome in the sample which is tails.

Parameter θ: Probability that�flip == 1 (Heads)

Prediction:�1 or 0

1600133

57 of 67

Likelihood of Data; Definition of Probability

A Bernoulli random variable Y with parameter p has distribution:

Given that all flips are IID from the same coin�(probability of heads = p), the likelihood of our�data is proportional to the probability of observing the datapoints.

Training data: [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]

Data likelihood: .

1600133

58 of 67

Likelihood of Data

A Bernoulli random variable Y with parameter p has distribution:

Given that all flips are IID from the same coin�(probability of heads = p), the likelihood of our�data is proportional to the probability of observing the datapoints.

Training data: [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]

Data likelihood: .

An example of a bad estimate is parameter

since the likelihood of observing the training data is going to be :

An example of a bad estimate is parameter

since the likelihood of observing the training data is going to be :

1600133

59 of 67

Generalization of the Coin Demo

For training data: {0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

0.4 is the most “intuitive” θ for two reasons:

Frequency of heads in our data
Maximizes the likelihood of our data:

How can we generalize this notion of�likelihood to any random binary sample?

Parameter θ: Probability that�IID flip == 1 (Heads)

Prediction:�1 or 0

likelihood

data (1’s and 0’s)

1600133

60 of 67

A Compact Representation of the Bernoulli Probability Distribution

How can we generalize this notion of�likelihood to any random binary sample?

(long, non-compact form):

For P(Y = 1), only this term stays

For P(Y = 0), only this term stays

Let Y be Bernoulli(p). The probability distribution can be written compactly:

likelihood

data (1’s and 0’s)

1600133

61 of 67

Generalized Likelihood of Binary Data

How can we generalize this notion of�likelihood to any random binary sample?

For P(Y = 1), only this term stays

For P(Y = 0), only this term stays

If binary data are IID with same probability p, then the likelihood of the data is:

Ex: {0, 0, 1, 1, 1, 1, 0, 0, 0, 0} →

Let Y be Bernoulli(p). The probability distribution can be written compactly:

likelihood

data (1’s and 0’s)

1600133

62 of 67

Generalized Likelihood of Binary Data

How can we generalize this notion of�likelihood to any random binary sample?

(spoiler: for logistic regression, )

likelihood

data (1’s and 0’s)

For P(Y = 1), only this term stays

For P(Y = 0), only this term stays

Let Y be Bernoulli(p). The probability distribution can be written compactly:

If binary data are IID with same probability p, then the likelihood of the data is:

If data are independent with different probability p_i, then the likelihood of the data is:

1600133

63 of 67

Maximum Likelihood Estimation (MLE)

Our maximum likelihood estimation problem:

For i = 1, 2, …, n, let Y_i be independent Bernoulli(p_i). Observe data .
We’d like to estimate .

Find that maximize

1600133

64 of 67

Maximum Likelihood Estimation (MLE)

Our maximum likelihood estimation problem:

For i = 1, 2, …, n, let Y_i be independent Bernoulli(p_i). Observe data .
We’d like to estimate .

Find that maximize

Equivalent, simplifying optimization problems (since we need to take the first derivative):

maximize

(log is an increasing function. If a > b, then log(a) > log(b).)

1600133

65 of 67

Maximum Likelihood Estimation (MLE)

Our maximum likelihood estimation problem:

For i = 1, 2, …, n, let Y_i be independent Bernoulli(p_i). Observe data .
We’d like to estimate .

Find that maximize

Equivalent, simplifying optimization problems (since we need to take the first derivative):

maximize

(log is an increasing function. If a > b, then log(a) > log(b).)

minimize

1600133

66 of 67

Maximizing Likelihood == Minimizing Average Cross-Entropy

maximize

minimize

Cross-Entropy Loss!!

Log is increasing; max/min properties

For logistic regression, let

Minimizing cross-entropy loss is equivalent to maximizing the likelihood of the training data.

We are choosing the model parameters that are “most likely”, given this data.

Assumption: all data drawn independently from the same logistic regression model with parameter θ

It turns out that many of the model + loss combinations we’ve seen can be motivated using MLE (OLS, Ridge Regression, etc.)
You will study MLE further in probability and ML classes. But now you know it exists.

1600133

67 of 67

We Did it!

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Classification ( )

Regularization�Sklearn/Gradient descent

R², Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

(next time)

Regularization�Sklearn/Gradient descent

✅

Logistic Regression

Average Cross-Entropy Loss

✅

1600133