Logistic Regression I
Moving from regression to classification.
1
LECTURE 18
Join at slido.com�#3083107
ⓘ Start presenting to display the joining instructions on this slide.
3
Goals for this Lecture
Moving away from linear regression – it’s time for a new type of model
4
Lecture 18, Data 100 Summer 2023
Agenda
Lecture 18, Data 100 Summer 2023
5
Regression vs Classification
Lecture 18, Data 100 Summer 2023
6
Beyond Regression
Up until this point, we have been working exclusively with linear regression.
In the machine learning “landscape,” there are many other types of models.
7
Beyond Regression
Up until this point, we have been working exclusively with linear regression.
In the machine learning “landscape,” there are many other types of models.
8
Today + tomorrow
Next week
Week after
Take CS 188 or Data 102
✅
So Far: Regression
In regression, we use unbounded numeric features to predict an unbounded numeric output
9
Input: numeric features
Model: linear combination
Output: numeric prediction
∞
-∞
Examples:
Now: Classification
In classification, we use unbounded numeric features to predict a categorical class
10
An aside: we will use logistic “regression” to perform a classification task. Here, “regression” refers to the type of model, not the task being performed.
Input: numeric features
Model: linear combination transformed by non-linear sigmoid
Decision rule
Examples:
Output: class
IsAdelie?
If p > 0.5: predict “Adelie”
Other: predict “not Adelie”
Kinds of Classification
We are interested in predicting some categorical variable, or response, y.
Binary classification [Data 100]
Multiclass classification
Structured prediction tasks
11
win or lose
disease or no disease
spam or ham
Our new goal: predict a binary output (y_hat = 0 or y_hat = 1) given inputted numeric features
The Modeling Process
12
2. Choose a loss function
3. Fit the model
4. Evaluate model performance
1. Choose a model
Regression ( )
Classification ( )
Regularization�Sklearn/Gradient descent
R2, Residuals, etc.
Squared Loss or Absolute Loss
Linear Regression
??
??
??
(tomorrow)
Regularization�Sklearn/Gradient descent
The Logistic Regression Model
Lecture 18, Data 100 Summer 2023
13
The games Dataset
New modeling task, new dataset.
The games dataset describes the win/loss results of basketball teams.
14
Difference in field goal success rate between teams
If a team won their game, we say they are in “Class 1”
Why Not Least Squares Linear Regression?
15
I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.
– Abraham Maslow, The Psychology of Science
Problems:
Demo
Back to Basics: the Graph of Averages
Clearly, least squares regression won’t work here. We need to start fresh.
In Data 8, you built up to the idea of linear regression by considering the graph of averages.
16
Average y vs. x bins is approximately linear
Bucket x-axis into bins
[Data 8 textbook]
Parametric linear model
For an input x, compute the average value of y for all nearby x, and predict that.
Graph of Averages for Our Classification Task
17
For an input x, compute the average value of y�for all nearby x, and predict that.
[Data 8 textbook]
Demo
Graph of Averages for Our Classification Task
Looking a lot better!
Some observations:
This will be important soon.
18
Graph of Averages for Our Classification Task
Thinking more deeply about what we’ve just done
To compute the average of a bin, we:
19
This is the probability that Y is 1 in that bin!
Predicting Probabilities
Our curve is modeling the probability that Y = 1 for a particular value of x.
20
New modeling goal: model the probability of a datapoint belonging to Class 1
Handling the Non-Linear Output
We still seem to have a problem: our probability curve is non-linear, but we only know linear modeling techniques.
Good news: we’ve seen this before! To capture non-linear relationships, we:
We’ll use the exact same process here.
21
New modeling goal: model the probability of a datapoint belonging to Class 1
Step 1: Linearize the relationship
Our S-shaped probability curve doesn’t match any relationship we’ve seen before. The bulge diagram won’t help here.
To understand what transformation to apply, think about our eventual goal: assign each datapoint to its most likely class.
22
“Odds” is defined as the ratio of the probability of Y being Class 1 to the probability of Y being Class 0
How do we decide which class is more likely? One way: check which class has the higher predicted probability.
Step 1: Linearize the relationship
The odds curve looks roughly exponential. To linearize it, we’ll take the logarithm*.
23
*In Data 100: assume that “log” means base e natural log unless told otherwise
Step 2: Find a linear combination of the features
We now have a linear relationship between our transformed variables. We can represent this relationship as a linear combination of the features.
24
Our linear combination
Remember that our goal is to predict the probability of Class 1. This linear combination isn’t our final prediction! We use z instead of y_hat to remind ourselves.
Step 3: Transform back to original variables
25
Solve for p:
This is called the logistic function, 𝜎( ).
Odds
p
x
x
Log(odds)
x
Arriving at the Logistic Regression Model
We have just derived the logistic regression model for the probability of a datapoint belonging to Class 1.
26
To predict a probability:
The Sigmoid Function
The S-shaped curve we derived is formally known as the sigmoid function.
27
Range
Reflection/ Symmetry
Domain
Derivative
Inverse
The Sigmoid Converts Numerical Features to Probabilities
28
Input: numeric features
Model: linear combination transformed by activation function
Decision rule
Output: class
IsAdelie?
If y_hat > 0.5: predict “Adelie”
Other: predict “not Adelie”
In logistic regression, the sigmoid transforms a linear combination of numerical features into a probability
Tomorrow’s lecture
Formalizing the Logistic Regression Model
Our main takeaways of this section:
Putting it all together:
29
Logistic function 𝜎( ) at the value
Estimated probability that given the features x, the response is 1
Looks like linear regression. Now wrapped with 𝞼( )!
The logistic regression model is most commonly written as follows:
🎉
Example Calculation
Suppose I want to predict the probability that a team wins a game, given GOAL_DIFF (first feature) and number of free throws (second feature).
I fit a logistic regression model (with no intercept)�using my training data, and estimate�the optimal parameters:
Now, you want to predict the probability that a new
team wins their game.
30
🤔
Suppose x^T contains the GOAL_DIFF and number of free throws for a new team. What is the predicted probability that this new team will win?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Example Calculation
Suppose I want to predict the probability that a team wins a game, given GOAL_DIFF (first feature) and number of free throws (second feature).
I fit a logistic regression model (with no intercept)�using my training data, and estimate�the optimal parameters:
Now, you want to predict the probability that a new
team wins their game.
32
🤔
Because the response is more likely to be 1 than 0, a reasonable prediction is
(more on this next lecture)
Properties of the Logistic Model
Consider a logistic regression model with one feature and an intercept term (Desmos):
33
Properties:
Interlude
Classification is hard.
34
Cross-Entropy Loss
Lecture 18, Data 100 Summer 2023
35
The Modeling Process
36
2. Choose a loss function
3. Fit the model
4. Evaluate model performance
1. Choose a model
Regression ( )
Regularization�Sklearn/Gradient descent
R2, Residuals, etc.
Squared Loss or Absolute Loss
Linear Regression
??
✅
Classification ( )
??
(next time)
Regularization�Sklearn/Gradient descent
Logistic Regression
The Modeling Process
37
Classification ( )
??
(next time)
Regularization�Sklearn/Gradient descent
Logistic Regression
2. Choose a loss function
3. Fit the model
4. Evaluate model performance
1. Choose a model
Regression ( )
Regularization�Sklearn/Gradient descent
R2, Residuals, etc.
Squared Loss or Absolute Loss
Linear Regression
✅
Can squared loss still work?
Toy Dataset: L2 Loss
38
Mean Squared Error:
The MSE loss surface for logistic regression has many issues!
Logistic Regression model:
Assume no intercept.�So x, θ both scalars.
⚠️
Demo
What problems might arise from using MSE loss with logistic regression?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Pitfalls of Squared Loss
1. Non-convex. Gets stuck in local minima.
40
Secant line crosses function, so R''(θ) is not greater than 0 for all θ.
Demo
Pitfalls of Squared Loss
1. Non-convex. Gets stuck in local minima.
41
Gradient Descent: Different initial guesses will yield different optimal estimates.
from scipy.optimize import minimize
minimize(mse_loss_toy_nobias, x0 = 0)["x"][0]
minimize(mse_loss_toy_nobias, x0 = -5)["x"][0]
0.5446601825581691
-10.343653061026611
Secant line crosses function, so R''(θ) is not greater than 0 for all θ.
Demo
Pitfalls of Squared Loss
1. Non-convex. Gets stuck in local minima.
42
Gradient Descent: Different initial guesses will yield different optimal estimates.
Secant line crosses function, so R''(θ) is not greater than 0 for all θ.
Demo
Pitfalls of Squared Loss
1. Non-convex. Gets stuck in local minima.
2. Bounded. Not a good measure of model error.
43
If true y=1 but predicted probability = 0:
Demo
Choosing a Different Loss Function
44
Classification ( )
??
(next time)
Regularization�Sklearn/Gradient descent
Logistic Regression
2. Choose a loss function
3. Fit the model
4. Evaluate model performance
1. Choose a model
Regression ( )
Regularization�Sklearn/Gradient descent
R2, Residuals, etc.
Squared Loss or Absolute Loss
Linear Regression
✅
Cross-Entropy Loss
Loss in Classification
Let y be a binary label {0, 1}, and p be the model’s predicted probability of the label being 1.
In a classification task, how do we want our loss function to behave?
In other words, the behavior we need from our loss function depends on the value of the true class, y
45
Cross-Entropy Loss
Let y be a binary label {0, 1}, and p be the probability of the label being 1.
The cross-entropy loss is defined as
46
For y = 1,
For y = 0,
Cross-Entropy Loss: Two Loss Functions In One!
The piecewise loss function we introduced just then is difficult to optimize – we don’t want to check “which” loss to use at each step of optimizing theta
Cross-entropy loss can be equivalently expressed as:
47
for y = 1, only this term stays
makes loss positive
for y = 0, only this term stays
Empirical Risk: Average Cross-Entropy Loss
For a single datapoint, the cross-entropy curve�is convex. It has a global minimum.
What about average cross-entropy loss, i.e., empirical risk?�
For logistic regression, the empirical risk over a sample of size n is:
The optimization problem is therefore to find the estimate that minimizes R(θ):
48
[Recall our model is ]
Convexity Proof By Picture
49
Squared Loss Surface
Cross-Entropy Loss Surface
A straight line crosses the curve Non-convex
Convex!
Bonus Material:
Maximum Likelihood Estimation
Lecture 18, Data 100 Summer 2023
It may have seemed like we just pulled cross-entropy loss out of thin air.
CE loss is justified by a probability analysis technique called maximum likelihood estimation. Read on if you would like to learn more.
Recorded walkthrough: link.
50
The No-Input Binary Classifier
Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):
{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}
For the next flip, do you predict heads or tails?
51
Training data has only responses (no features )
The No-Input Binary Classifier
Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):
{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}
For the next flip, do you predict heads or tails?
A reasonable model is to assume all flips are IID (i.e., same coin; same prob. of heads θ).
52
Training data has only responses (no features )
Parameter θ: Probability that�flip == 1 (Heads)
Prediction:�1 or 0
🤔
1. Of the below, which is the best theta θ? Why?
A. 0.8 B. 0.5 C. 0.4
D. 0.2 E. Something else
2. For the next flip, would you predict 1 or 0?
The No-Input Binary Classifier
Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):
{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}
For the next flip, do you predict heads or tails?
A reasonable model is to assume all flips are IID (i.e., same coin; same prob. of heads θ).
53
Training data has only responses (no features )
1. Of the below, which is the best theta θ? Why?
A. 0.8 B. 0.5 C. 0.4
D. 0.2 E. Something else
Parameter θ: Probability that�flip == 1 (Heads)
Prediction:�1 or 0
The No-Input Binary Classifier
Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):
{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}
For the next flip, do you predict heads or tails?
A reasonable model is to assume all flips are IID (i.e., same coin; same prob. of heads θ).
54
Training data has only responses (no features )
1. Of the below, which is the best theta θ? Why?
A. 0.8 B. 0.5 C. 0.4
D. 0.2 E. Something else
Parameter θ: Probability that�flip == 1 (Heads)
Prediction:�1 or 0
0.4 is the most “intuitive” for two reasons:
The No-Input Binary Classifier
Suppose you observed some outcomes of a coin (1 = Heads, 0 = Tails):
{0, 0, 1, 1, 1, 1, 0, 0, 0, 0}
For the next flip, do you predict heads or tails?
A reasonable model is to assume all flips are IID (i.e., same coin; same prob. of heads θ).
55
Training data has only responses (no features )
1. Of the below, which is the best theta θ? Why?
A. 0.8 B. 0.5 C. 0.4
D. 0.2 E. Something else
2. For the next flip, would you predict 1 or 0?
The most frequent outcome in the sample which is tails.
Parameter θ: Probability that�flip == 1 (Heads)
Prediction:�1 or 0
Likelihood of Data; Definition of Probability
A Bernoulli random variable Y with parameter p has distribution:
Given that all flips are IID from the same coin�(probability of heads = p), the likelihood of our�data is proportional to the probability of observing the datapoints.
56
Training data: [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]
Data likelihood: .
Likelihood of Data
A Bernoulli random variable Y with parameter p has distribution:
Given that all flips are IID from the same coin�(probability of heads = p), the likelihood of our�data is proportional to the probability of observing the datapoints.
57
Training data: [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]
Data likelihood: .
An example of a bad estimate is parameter
since the likelihood of observing the training data is going to be :
An example of a bad estimate is parameter
since the likelihood of observing the training data is going to be :
Generalization of the Coin Demo
For training data: {0, 0, 1, 1, 1, 1, 0, 0, 0, 0}
0.4 is the most “intuitive” θ for two reasons:
How can we generalize this notion of�likelihood to any random binary sample?
58
Parameter θ: Probability that�IID flip == 1 (Heads)
Prediction:�1 or 0
likelihood
data (1’s and 0’s)
A Compact Representation of the Bernoulli Probability Distribution
How can we generalize this notion of�likelihood to any random binary sample?
59
(long, non-compact form):
For P(Y = 1), only this term stays
For P(Y = 0), only this term stays
Let Y be Bernoulli(p). The probability distribution can be written compactly:
likelihood
data (1’s and 0’s)
Generalized Likelihood of Binary Data
How can we generalize this notion of�likelihood to any random binary sample?
60
For P(Y = 1), only this term stays
For P(Y = 0), only this term stays
If binary data are IID with same probability p, then the likelihood of the data is:
Ex: {0, 0, 1, 1, 1, 1, 0, 0, 0, 0} →
Let Y be Bernoulli(p). The probability distribution can be written compactly:
likelihood
data (1’s and 0’s)
Generalized Likelihood of Binary Data
How can we generalize this notion of�likelihood to any random binary sample?
61
(spoiler: for logistic regression, )
likelihood
data (1’s and 0’s)
For P(Y = 1), only this term stays
For P(Y = 0), only this term stays
Let Y be Bernoulli(p). The probability distribution can be written compactly:
If binary data are IID with same probability p, then the likelihood of the data is:
If data are independent with different probability pi, then the likelihood of the data is:
Maximum Likelihood Estimation (MLE)
Our maximum likelihood estimation problem:
Find that maximize
62
Maximum Likelihood Estimation (MLE)
Our maximum likelihood estimation problem:
Find that maximize
Equivalent, simplifying optimization problems (since we need to take the first derivative):
63
maximize
(log is an increasing function. If a > b, then log(a) > log(b).)
Maximum Likelihood Estimation (MLE)
Our maximum likelihood estimation problem:
Find that maximize
Equivalent, simplifying optimization problems (since we need to take the first derivative):
64
maximize
(log is an increasing function. If a > b, then log(a) > log(b).)
minimize
Maximizing Likelihood == Minimizing Average Cross-Entropy
65
maximize
minimize
minimize
Cross-Entropy Loss!!
Log is increasing; max/min properties
For logistic regression, let
Minimizing cross-entropy loss is equivalent to maximizing the likelihood of the training data.
Assumption: all data drawn independently from the same logistic regression model with parameter θ
We Did it!
66
2. Choose a loss function
3. Fit the model
4. Evaluate model performance
1. Choose a model
Regression ( )
Classification ( )
Regularization�Sklearn/Gradient descent
R2, Residuals, etc.
Squared Loss or Absolute Loss
Linear Regression
??
(next time)
Regularization�Sklearn/Gradient descent
✅
Logistic Regression
Average Cross-Entropy Loss
✅
Logistic Regression I
Content credit: Acknowledgments
67
LECTURE 18