1 of 67

Logistic Regression I

Moving from regression to classification.

Data 100, Summer 2023 @ UC Berkeley

Bella Crouch and Dominic Liu

Content credit: Acknowledgments

1

LECTURE 18

2 of 67

Join at slido.com�#3083107

Start presenting to display the joining instructions on this slide.

3 of 67

3

4 of 67

Goals for this Lecture

Moving away from linear regression – it’s time for a new type of model

  • Introduce a new task: classification
  • Deriving a new model to handle this task

4

Lecture 18, Data 100 Summer 2023

5 of 67

Agenda

Lecture 18, Data 100 Summer 2023

  • Regression vs. Classification
  • The Logistic Regression Model
  • Cross-Entropy Loss

5

6 of 67

Regression vs Classification

Lecture 18, Data 100 Summer 2023

  • Regression vs. Classification
  • The Logistic Regression Model
  • Cross-Entropy Loss

6

7 of 67

Beyond Regression

Up until this point, we have been working exclusively with linear regression.

In the machine learning “landscape,” there are many other types of models.

7

8 of 67

Beyond Regression

Up until this point, we have been working exclusively with linear regression.

In the machine learning “landscape,” there are many other types of models.

8

Today + tomorrow

Next week

Week after

Take CS 188 or Data 102

9 of 67

So Far: Regression

In regression, we use unbounded numeric features to predict an unbounded numeric output

9

Input: numeric features

Model: linear combination

Output: numeric prediction

-∞

Examples:

  • Predict bill depth from flipper length
  • Predict tip from total bill
  • Predict mpg from hp

10 of 67

Now: Classification

In classification, we use unbounded numeric features to predict a categorical class

10

An aside: we will use logistic “regression” to perform a classification task. Here, “regression” refers to the type of model, not the task being performed.

Input: numeric features

Model: linear combination transformed by non-linear sigmoid

Decision rule

Examples:

  • Predict species of penguin from flipper length
  • Predict day of week from total bill
  • Predict model of car from hp

Output: class

IsAdelie?

If p > 0.5: predict “Adelie”

Other: predict “not Adelie”

11 of 67

Kinds of Classification

We are interested in predicting some categorical variable, or response, y.

Binary classification [Data 100]

  • Two classes
  • Responses y are either 0 or 1

Multiclass classification

  • Many classes
  • Examples: Image labeling (cat, dog, car), next word in a sentence, etc.

Structured prediction tasks

  • Multiple related classification predictions
  • Examples: Translation, voice recognition, etc.

11

win or lose

disease or no disease

spam or ham

Our new goal: predict a binary output (y_hat = 0 or y_hat = 1) given inputted numeric features

12 of 67

The Modeling Process

12

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Classification ( )

Regularization�Sklearn/Gradient descent

R2, Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

??

??

??

(tomorrow)

Regularization�Sklearn/Gradient descent

13 of 67

The Logistic Regression Model

Lecture 18, Data 100 Summer 2023

  • Regression vs. Classification
  • The Logistic Regression Model
  • Cross-Entropy Loss

13

14 of 67

The games Dataset

New modeling task, new dataset.

The games dataset describes the win/loss results of basketball teams.

14

Difference in field goal success rate between teams

If a team won their game, we say they are in “Class 1”

15 of 67

Why Not Least Squares Linear Regression?

15

I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.

– Abraham Maslow, The Psychology of Science

Problems:

  • The output can be outside the label range {0, 1}
  • Some outputs can’t be interpreted: what does a class of “-2.3” mean?

Demo

16 of 67

Back to Basics: the Graph of Averages

Clearly, least squares regression won’t work here. We need to start fresh.

In Data 8, you built up to the idea of linear regression by considering the graph of averages.

16

Average y vs. x bins is approximately linear

Bucket x-axis into bins

[Data 8 textbook]

Parametric linear model

For an input x, compute the average value of y for all nearby x, and predict that.

17 of 67

Graph of Averages for Our Classification Task

17

For an input x, compute the average value of yfor all nearby x, and predict that.

[Data 8 textbook]

Demo

18 of 67

Graph of Averages for Our Classification Task

Looking a lot better!

Some observations:

  • All predictions are between 0 and 1
  • Our predictions are non-linear
    • Specifically, we see an “S” shape.

This will be important soon.

18

19 of 67

Graph of Averages for Our Classification Task

Thinking more deeply about what we’ve just done

To compute the average of a bin, we:

  • Counted the number of wins in the bin
  • Divided by the total number of datapoints in the bin

19

This is the probability that Y is 1 in that bin!

20 of 67

Predicting Probabilities

Our curve is modeling the probability that Y = 1 for a particular value of x.

  • This matches our observation that all predictions are between 0 and 1 – the predictions are representing probabilities

20

New modeling goal: model the probability of a datapoint belonging to Class 1

21 of 67

Handling the Non-Linear Output

We still seem to have a problem: our probability curve is non-linear, but we only know linear modeling techniques.

Good news: we’ve seen this before! To capture non-linear relationships, we:

  1. Applied transformations to linearize the relationship
  2. Fit a linear model to the transformed variables
  3. Transformed the variables back to find their underlying relationship

We’ll use the exact same process here.

21

New modeling goal: model the probability of a datapoint belonging to Class 1

22 of 67

Step 1: Linearize the relationship

Our S-shaped probability curve doesn’t match any relationship we’ve seen before. The bulge diagram won’t help here.

To understand what transformation to apply, think about our eventual goal: assign each datapoint to its most likely class.

22

“Odds” is defined as the ratio of the probability of Y being Class 1 to the probability of Y being Class 0

How do we decide which class is more likely? One way: check which class has the higher predicted probability.

23 of 67

Step 1: Linearize the relationship

The odds curve looks roughly exponential. To linearize it, we’ll take the logarithm*.

23

*In Data 100: assume that “log” means base e natural log unless told otherwise

24 of 67

Step 2: Find a linear combination of the features

We now have a linear relationship between our transformed variables. We can represent this relationship as a linear combination of the features.

24

Our linear combination

Remember that our goal is to predict the probability of Class 1. This linear combination isn’t our final prediction! We use z instead of y_hat to remind ourselves.

25 of 67

Step 3: Transform back to original variables

25

Solve for p:

This is called the logistic function, 𝜎( ).

Odds

p

x

x

Log(odds)

x

26 of 67

Arriving at the Logistic Regression Model

We have just derived the logistic regression model for the probability of a datapoint belonging to Class 1.

26

To predict a probability:

  • Compute a linear combination of the features,
  • Apply the sigmoid function

27 of 67

The Sigmoid Function

The S-shaped curve we derived is formally known as the sigmoid function.

  • You worked with it in Homework 1B, you just didn’t know it at the time

27

Range

Reflection/ Symmetry

Domain

Derivative

Inverse

28 of 67

The Sigmoid Converts Numerical Features to Probabilities

28

Input: numeric features

Model: linear combination transformed by activation function

Decision rule

Output: class

IsAdelie?

If y_hat > 0.5: predict “Adelie”

Other: predict “not Adelie”

In logistic regression, the sigmoid transforms a linear combination of numerical features into a probability

Tomorrow’s lecture

29 of 67

Formalizing the Logistic Regression Model

Our main takeaways of this section:

  • Fit the “S” curve as best as possible.
  • The curve models probability: P(Y = 1 | x).
  • Assume log-odds is a linear combination of x and θ.

Putting it all together:

29

Logistic function 𝜎( ) at the value

Estimated probability that given the features x, the response is 1

Looks like linear regression. Now wrapped with 𝞼( )!

The logistic regression model is most commonly written as follows:

🎉

30 of 67

Example Calculation

Suppose I want to predict the probability that a team wins a game, given GOAL_DIFF (first feature) and number of free throws (second feature).

I fit a logistic regression model (with no intercept)�using my training data, and estimate�the optimal parameters:

Now, you want to predict the probability that a new

team wins their game.

30

🤔

31 of 67

Suppose x^T contains the GOAL_DIFF and number of free throws for a new team. What is the predicted probability that this new team will win?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

32 of 67

Example Calculation

Suppose I want to predict the probability that a team wins a game, given GOAL_DIFF (first feature) and number of free throws (second feature).

I fit a logistic regression model (with no intercept)�using my training data, and estimate�the optimal parameters:

Now, you want to predict the probability that a new

team wins their game.

32

🤔

Because the response is more likely to be 1 than 0, a reasonable prediction is

(more on this next lecture)

33 of 67

Properties of the Logistic Model

Consider a logistic regression model with one feature and an intercept term (Desmos):

33

Properties:

  • controls the position of the curve along the horizontal axis
  • The magnitude of controls the “steepness” of the sigmoid
  • The sign of controls the orientation of the curve

34 of 67

Interlude

Classification is hard.

34

35 of 67

Cross-Entropy Loss

Lecture 18, Data 100 Summer 2023

  • Regression vs. Classification
  • The Logistic Regression Model
  • Cross-Entropy Loss

35

36 of 67

The Modeling Process

36

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Regularization�Sklearn/Gradient descent

R2, Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

??

Classification ( )

??

(next time)

Regularization�Sklearn/Gradient descent

Logistic Regression

37 of 67

The Modeling Process

37

Classification ( )

??

(next time)

Regularization�Sklearn/Gradient descent

Logistic Regression

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Regularization�Sklearn/Gradient descent

R2, Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

Can squared loss still work?

38 of 67

Toy Dataset: L2 Loss

38

Mean Squared Error:

The MSE loss surface for logistic regression has many issues!

Logistic Regression model:

Assume no intercept.�So x, θ both scalars.

⚠️

Demo

39 of 67

What problems might arise from using MSE loss with logistic regression?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

40 of 67

Pitfalls of Squared Loss

1. Non-convex. Gets stuck in local minima.

40

Secant line crosses function, so R''(θ) is not greater than 0 for all θ.

Demo

41 of 67

Pitfalls of Squared Loss

1. Non-convex. Gets stuck in local minima.

41

Gradient Descent: Different initial guesses will yield different optimal estimates.

from scipy.optimize import minimize

minimize(mse_loss_toy_nobias, x0 = 0)["x"][0]

minimize(mse_loss_toy_nobias, x0 = -5)["x"][0]

0.5446601825581691

-10.343653061026611

Secant line crosses function, so R''(θ) is not greater than 0 for all θ.

Demo

42 of 67

Pitfalls of Squared Loss

1. Non-convex. Gets stuck in local minima.

42

Gradient Descent: Different initial guesses will yield different optimal estimates.

Secant line crosses function, so R''(θ) is not greater than 0 for all θ.

Demo

43 of 67

Pitfalls of Squared Loss

1. Non-convex. Gets stuck in local minima.

2. Bounded. Not a good measure of model error.

  • We’d like loss functions to penalize “off” predictions.
  • MSE never gets very large, because both response and predicted probability are bounded by 1.

43

If true y=1 but predicted probability = 0:

Demo

44 of 67

Choosing a Different Loss Function

44

Classification ( )

??

(next time)

Regularization�Sklearn/Gradient descent

Logistic Regression

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Regularization�Sklearn/Gradient descent

R2, Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

Cross-Entropy Loss

45 of 67

Loss in Classification

Let y be a binary label {0, 1}, and p be the model’s predicted probability of the label being 1.

In a classification task, how do we want our loss function to behave?

  • When the true y is 1, we should incur low loss when the model predicts large p
  • When the true y is 0, we should incur high loss when the model predicts large p

In other words, the behavior we need from our loss function depends on the value of the true class, y

45

46 of 67

Cross-Entropy Loss

Let y be a binary label {0, 1}, and p be the probability of the label being 1.

The cross-entropy loss is defined as

46

For y = 1,

  • p → 0: loss
  • p → 1: zero loss

For y = 0,

  • p → 0: zero loss
  • p → 1: loss

47 of 67

Cross-Entropy Loss: Two Loss Functions In One!

The piecewise loss function we introduced just then is difficult to optimize – we don’t want to check “which” loss to use at each step of optimizing theta

Cross-entropy loss can be equivalently expressed as:

47

for y = 1, only this term stays

makes loss positive

for y = 0, only this term stays

48 of 67

Empirical Risk: Average Cross-Entropy Loss

For a single datapoint, the cross-entropy curve�is convex. It has a global minimum.

What about average cross-entropy loss, i.e., empirical risk?�

For logistic regression, the empirical risk over a sample of size n is:

The optimization problem is therefore to find the estimate that minimizes R(θ):

48

[Recall our model is ]

49 of 67

Convexity Proof By Picture

49

Squared Loss Surface

Cross-Entropy Loss Surface

A straight line crosses the curve Non-convex

Convex!

50 of 67

Bonus Material:

Maximum Likelihood Estimation

Lecture 18, Data 100 Summer 2023

It may have seemed like we just pulled cross-entropy loss out of thin air.

CE loss is justified by a probability analysis technique called maximum likelihood estimation. Read on if you would like to learn more.

Recorded walkthrough: link.

50

51 of 67

The No-Input Binary Classifier

Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):

{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

For the next flip, do you predict heads or tails?

51

Training data has only responses (no features )

52 of 67

The No-Input Binary Classifier

Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):

{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

For the next flip, do you predict heads or tails?

A reasonable model is to assume all flips are IID (i.e., same coin; same prob. of heads θ).

52

Training data has only responses (no features )

Parameter θ: Probability that�flip == 1 (Heads)

Prediction:�1 or 0

🤔

1. Of the below, which is the best theta θ? Why?

A. 0.8 B. 0.5 C. 0.4

D. 0.2 E. Something else

2. For the next flip, would you predict 1 or 0?

53 of 67

The No-Input Binary Classifier

Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):

{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

For the next flip, do you predict heads or tails?

A reasonable model is to assume all flips are IID (i.e., same coin; same prob. of heads θ).

53

Training data has only responses (no features )

1. Of the below, which is the best theta θ? Why?

A. 0.8 B. 0.5 C. 0.4

D. 0.2 E. Something else

Parameter θ: Probability that�flip == 1 (Heads)

Prediction:�1 or 0

54 of 67

The No-Input Binary Classifier

Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):

{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

For the next flip, do you predict heads or tails?

A reasonable model is to assume all flips are IID (i.e., same coin; same prob. of heads θ).

54

Training data has only responses (no features )

1. Of the below, which is the best theta θ? Why?

A. 0.8 B. 0.5 C. 0.4

D. 0.2 E. Something else

Parameter θ: Probability that�flip == 1 (Heads)

Prediction:�1 or 0

0.4 is the most “intuitive” for two reasons:

  1. Frequency of heads in our data.
  2. Maximizes the likelihood of our data.

55 of 67

The No-Input Binary Classifier

Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):

{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

For the next flip, do you predict heads or tails?

A reasonable model is to assume all flips are IID (i.e., same coin; same prob. of heads θ).

55

Training data has only responses (no features )

1. Of the below, which is the best theta θ? Why?

A. 0.8 B. 0.5 C. 0.4

D. 0.2 E. Something else

2. For the next flip, would you predict 1 or 0?

The most frequent outcome in the sample which is tails.

Parameter θ: Probability that�flip == 1 (Heads)

Prediction:�1 or 0

56 of 67

Likelihood of Data; Definition of Probability

A Bernoulli random variable Y with parameter p has distribution:

Given that all flips are IID from the same coin�(probability of heads = p), the likelihood of our�data is proportional to the probability of observing the datapoints.

56

Training data: [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]

Data likelihood: .

57 of 67

Likelihood of Data

A Bernoulli random variable Y with parameter p has distribution:

Given that all flips are IID from the same coin�(probability of heads = p), the likelihood of our�data is proportional to the probability of observing the datapoints.

57

Training data: [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]

Data likelihood: .

An example of a bad estimate is parameter

since the likelihood of observing the training data is going to be :

An example of a bad estimate is parameter

since the likelihood of observing the training data is going to be :

58 of 67

Generalization of the Coin Demo

For training data: {0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

0.4 is the most “intuitive” θ for two reasons:

  1. Frequency of heads in our data
  2. Maximizes the likelihood of our data:

How can we generalize this notion of�likelihood to any random binary sample?

58

Parameter θ: Probability that�IID flip == 1 (Heads)

Prediction:�1 or 0

likelihood

data (1’s and 0’s)

59 of 67

A Compact Representation of the Bernoulli Probability Distribution

How can we generalize this notion of�likelihood to any random binary sample?

59

(long, non-compact form):

For P(Y = 1), only this term stays

For P(Y = 0), only this term stays

Let Y be Bernoulli(p). The probability distribution can be written compactly:

likelihood

data (1’s and 0’s)

60 of 67

Generalized Likelihood of Binary Data

How can we generalize this notion of�likelihood to any random binary sample?

60

For P(Y = 1), only this term stays

For P(Y = 0), only this term stays

If binary data are IID with same probability p, then the likelihood of the data is:

Ex: {0, 0, 1, 1, 1, 1, 0, 0, 0, 0} →

Let Y be Bernoulli(p). The probability distribution can be written compactly:

likelihood

data (1’s and 0’s)

61 of 67

Generalized Likelihood of Binary Data

How can we generalize this notion of�likelihood to any random binary sample?

61

(spoiler: for logistic regression, )

likelihood

data (1’s and 0’s)

For P(Y = 1), only this term stays

For P(Y = 0), only this term stays

Let Y be Bernoulli(p). The probability distribution can be written compactly:

If binary data are IID with same probability p, then the likelihood of the data is:

If data are independent with different probability pi, then the likelihood of the data is:

62 of 67

Maximum Likelihood Estimation (MLE)

Our maximum likelihood estimation problem:

  • For i = 1, 2, …, n, let Yi be independent Bernoulli(pi). Observe data .
  • We’d like to estimate .

Find that maximize

62

63 of 67

Maximum Likelihood Estimation (MLE)

Our maximum likelihood estimation problem:

  • For i = 1, 2, …, n, let Yi be independent Bernoulli(pi). Observe data .
  • We’d like to estimate .

Find that maximize

Equivalent, simplifying optimization problems (since we need to take the first derivative):

63

maximize

(log is an increasing function. If a > b, then log(a) > log(b).)

64 of 67

Maximum Likelihood Estimation (MLE)

Our maximum likelihood estimation problem:

  • For i = 1, 2, …, n, let Yi be independent Bernoulli(pi). Observe data .
  • We’d like to estimate .

Find that maximize

Equivalent, simplifying optimization problems (since we need to take the first derivative):

64

maximize

(log is an increasing function. If a > b, then log(a) > log(b).)

minimize

65 of 67

Maximizing Likelihood == Minimizing Average Cross-Entropy

65

maximize

minimize

minimize

Cross-Entropy Loss!!

Log is increasing; max/min properties

For logistic regression, let

Minimizing cross-entropy loss is equivalent to maximizing the likelihood of the training data.

  • We are choosing the model parameters that are “most likely”, given this data.

Assumption: all data drawn independently from the same logistic regression model with parameter θ

  • It turns out that many of the model + loss combinations we’ve seen can be motivated using MLE (OLS, Ridge Regression, etc.)
  • You will study MLE further in probability and ML classes. But now you know it exists.

66 of 67

We Did it!

66

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Classification ( )

Regularization�Sklearn/Gradient descent

R2, Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

??

(next time)

Regularization�Sklearn/Gradient descent

Logistic Regression

Average Cross-Entropy Loss

67 of 67

Logistic Regression I

Content credit: Acknowledgments

67

LECTURE 18