Logistic Regression II
Model Performance.
Data 100/Data 200, Spring 2022 @ UC Berkeley
Josh Hug and Lisa Yan
1
LECTURE 22
More Logistic Regression
2
Question & Problem
Formulation
Data
Acquisition
Exploratory Data Analysis
Prediction and
Inference
Reports, Decisions, and Solutions
?
Logistic Regression II:�Linear Separability�Accuracy, Precision, Recall
Classification Thresholds
Logistic Regression I:�The Model
Cross-Entropy Loss
The Probabilistic View
(today)
Today’s Roadmap
Lecture 22, Data 100 Spring 2022
Logistic Regression Model, continued
Linear separability and Regularization
Performance Metrics
Adjusting the Classification Threshold
[Extra] Detailed MLE, Gradient Descent,�PR curves
3
Logistic Regression Model, continued
Lecture 22, Data 100 Spring 2022
Logistic Regression Model, continued
Linear separability and Regularization
Performance Metrics
Adjusting the Classification Threshold
[Extra] Detailed MLE, Gradient Descent,�PR curves
4
Logistic Regression with sklearn
5
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(fit_intercept=False)
model.fit(X, Y)
Task/Model
Average Cross-Entropy Loss
+ regularization
Fit to objective function
Binary Classification ( )
For logistic regression, sklearn applies regularization by default.
We’ll see why soon.
Demo
Sklearn: Predict Probabilities
6
model.predict_proba(X) # probs for all classes
model.classes_ # array([0, 1])
Demo
Sklearn: Classification
Equivalent “otherwise” condition:
Interpret: Given the input feature x:� If Y is more likely to be 1 than 0,� then predict .� Else predict 0.
7
model.predict_proba(X) # probs for all classes
model.classes_ # array([0, 1])
model.predict(X) # predict 1 or 0
Demo
[High-Level] Maximum Likelihood Estimation
8
argmin
This material will not be tested; I’ve recorded a detailed video instead (link).
argmax
For logistic regression, let
Prob. that i-th response is yi
Minimizing cross-entropy loss is equivalent to maximizing the likelihood of the training data.
Assumption: all data are independent Bernoulli random variables.
Main takeaway: The optimal theta that minimizes mean cross-entropy loss “pushes” all probabilities in the direction of the true class.
Want
Want
[High-Level] Maximum Likelihood Estimation
9
argmin
argmax
For logistic regression, let
Prob. that i-th response is yi
Minimizing cross-entropy loss is equivalent to maximizing the likelihood of the training data.
Assumption: all data are independent Bernoulli random variables
Main takeaway: The optimal theta that minimizes mean cross-entropy loss “pushes” all probabilities in the direction of the true class.
Want
Want
Linear separability and Regularization
Lecture 22, Data 100 Spring 2022
Logistic Regression Model, continued
Linear separability and Regularization
Performance Metrics
Adjusting the Classification Threshold
[Extra] Detailed MLE, Gradient Descent,�PR curves
10
Logistic Regression with sklearn
11
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(fit_intercept=False)
model.fit(X, Y)
Task/Model
Average Cross-Entropy Loss
+ regularization
Fit to objective function
Binary Classification ( )
Why does sklearn always apply regularization?
Demo
Linear Separability
A classification dataset is said to be linearly separable if there exists a hyperplane among input features x that separates the two classes y.
12
If there is one feature; the input feature is 1-D.
separable
not separable
Linear Separability
A classification dataset is said to be linearly separable if there exists a hyperplane among input features x that separates the two classes y.
13
If there is one feature; the input feature is 1-D.
separable
not separable
If there are two features, the input feature is 2-D. Use scatter plot to see separability.
separable
not separable
Linearly Separability Creates Diverging Weights
Consider the simplified logistic regression model�fit to the toy data:
14
C.
D.
What will be the optimal weight theta? Why?
A.
B.
[Hint] The optimal theta should “push” probabilities in the direction of the true class:
🤔
Linearly Separability Creates Diverging Weights
Consider the simplified logistic regression model�fit to the toy data:
15
happens as
C.
D.
What will be the optimal weight theta? Why?
A.
B.
[Hint] The optimal theta should “push” probabilities in the direction of the true class:
Linearly Separability Creates Diverging Weights
Consider the simplified logistic regression model�fit to the toy data:
16
happens as
C.
D.
What will be the optimal weight theta? Why?
A.
B.
[Hint] The optimal theta should “push” probabilities in the direction of the true class:
(Impossible to see, but) plateau is slightly tilted downwards.
Loss approaches 0 as theta decreases.
direction of gradient
Linearly Separability Creates Diverging Weights
Consider the simplified logistic regression model�fit to the toy data:
17
:
Linearly Separability Creates Diverging Weights
Consider the simplified logistic regression model�fit to the toy data:
18
:
Diverging weights
Divergent weights (i.e., | θ | →∞) occur with linearly separable data.
“Overconfidence” is a particularly dangerous version of overfitting.
This model is overconfident.
Loss is infinite.
Typo fixed 5/3
Regularized Logistic Regression
To avoid large weights (particularly on linearly separable data), use regularization.
19
⚠️ �infinite argmin
✅
finite
argmin
# sklearn defaults
model = LogisticRegression(
penalty='l2', C=1.0, …)
model.fit()
Regularization hyperparameter C is the inverse of λ. C = 1 / λ.
Set C big for minimal regularization, e.g., C=300.0.
Performance Metrics
Lecture 22, Data 100 Spring 2022
Logistic Regression Model, continued
Linear separability and Regularization
Performance Metrics
Adjusting the Classification Threshold
[Extra] Detailed MLE, Gradient Descent,�PR curves
20
Next Time
21
2. Choose a loss function
3. Fit the model
4. Evaluate model performance
1. Choose a model
Regression ( )
Classification ( )
Regularization�Sklearn/Gradient descent
R2, Residuals, etc.
Squared Loss or Absolute Loss
Linear Regression
Regularization�Sklearn/Gradient descent
✅
Logistic Regression
Average Cross-Entropy Loss
✅
Let’s do it!
Classifier Accuracy
Now that we actually have our classifier, let’s try and quantify how well it performs.
The most basic evaluation metric for a classifier is accuracy.
22
model.score(X, Y) # 0.8691
(sklearn documentation)
While widely used, the accuracy metric is not so meaningful when�dealing with class imbalance in a dataset.
def accuracy(X, Y):
return np.mean(model.predict(X) == Y)
accuracy(X, Y) # 0.8691
Pitfalls of Accuracy: A Case Study
Suppose we’re trying to build a classifier to filter spam emails.
Let’s say we have 100 emails, of which only 5 are truly spam, and the remaining 95 are ham.
23
Your friend (“Friend 1”):
Classify every email as ham (0).
🤔
Pitfalls of Accuracy: A Case Study
Suppose we’re trying to build a classifier to filter spam emails.
Let’s say we have 100 emails, of which only 5 are truly spam, and the remaining 95 are ham.
24
Your friend (“Friend 1”):
Classify every email as ham (0).
High accuracy…�…but we detected none ⚠️ of the spam!!!
Pitfalls of Accuracy: A Case Study
Suppose we’re trying to build a classifier to filter spam emails.
Let’s say we have 100 emails, of which only 5 are truly spam, and the remaining 95 are ham.
25
Your friend (“Friend 1”):
Classify every email as ham (0).
Your other friend (“Friend 2”):
Classify every email as spam (1).
High accuracy…�…but we detected none ⚠️ of the spam!!!
Low ⚠️ accuracy…�…but we detected all of the spam!!!
Pitfalls of Accuracy: Class Imbalance
Suppose we’re trying to build a classifier to filter spam emails.
Let’s say we have 100 emails, of which only 5 are truly spam, and the remaining 95 are ham.
26
Your friend (“Friend 1”):
Classify every email as ham (0).
Your other friend:
Classify every email as spam (1).
Accuracy is not always a good metric for classification, particularly when your data have�class imbalance (e.g., very few 1’s compared to 0’s).
High accuracy…�…but we detected none ⚠️ of the spam!!!
Low ⚠️ accuracy…�…but we detected all of the spam!!!
Types of Classification Successes/Errors: The Confusion Matrix
27
| 0 | 1 |
0 | True negative (TN) | False positive (FP) |
1 | False negative (FN) | True positive (TP) |
Actual
Prediction
“positive” means a prediction of 1.�“negative” means a prediction of 0.
Types of Classification Successes/Errors: The Confusion Matrix
28
| 0 | 1 |
0 | True negative (TN) | False positive (FP) |
1 | False negative (FN) | True positive (TP) |
Actual
Prediction
A confusion matrix plots these four quantities for a particular classifier and dataset.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_true, Y_pred)
Accuracy, Precision, and Recall
29
| 0 | 1 |
0 | TN | FP |
1 | FN | TP |
Actual
Prediction
What proportion of points did our classifier classify correctly?
Accuracy, Precision, and Recall
Precision and recall are two commonly used metrics that,�measure performance even in the presence of class imbalance.
30
| 0 | 1 |
0 | TN | FP |
1 | FN | TP |
Actual
Prediction
Of all observations that were predicted to be 1, what proportion were actually 1?
What proportion of points did our classifier classify correctly?
Accuracy, Precision, and Recall
Precision and recall are two commonly used metrics that,�measure performance even in the presence of class imbalance.
31
| 0 | 1 |
0 | TN | FP |
1 | FN | TP |
Actual
Prediction
What proportion of points did our classifier classify correctly?
Of all observations that were predicted to be 1, what proportion were actually 1?
Of all observations that were actually 1, what proportion did we predict to be 1? (Also known as sensitivity.)
One of the Most Valuable Graphics on Wikipedia
32
[adapted from Wikipedia]
(*i.e., true class is 1)
(i.e., positive; predicted class is 1)
Back to the Spam
Suppose we’re trying to build a classifier to filter spam emails.
Let’s say we have 100 emails, of which only 5 are truly spam, and the remaining 95 are ham.
33
Your friend:
Classify every email as ham (0).
| 0 | 1 |
0 | TN: 95 | FP: 0 |
1 | FN: 5 | TP: 0 |
Back to the Spam
Suppose we’re trying to build a classifier to filter spam emails.
Let’s say we have 100 emails, of which only 5 are truly spam, and the remaining 95 are ham.
34
Your friend:
Classify every email as ham (0).
Your other friend (“Friend 2”):
Classify every email as spam (1).
Never positive!
Many false positives!
No false negatives!
| 0 | 1 |
0 | TN: 0 | FP: 95 |
1 | FN: 0 | TP: 5 |
Precision vs. Recall
Precision penalizes false positives, and Recall penalizes false negatives.
35
We can achieve 100% recall by making our classifier output “1”, regardless of the input.
(see extra slides re: the precision-recall curve)
This suggests that there is a tradeoff between precision and recall;�they are often inversely related.
Which Performance Metric?
In many settings, there might be a much higher cost to missing positive cases.�For our tumor classifier:
36
How do we engineer classifiers to meet the performance goals of our problem?
Interlude
Break (2 min)
37
Adjusting the Classification Threshold
Lecture 22, Data 100 Spring 2022
Logistic Regression Model, continued
Linear separability and Regularization
Performance Metrics
Adjusting the Classification Threshold
[Extra] Detailed MLE, Gradient Descent,�PR curves
38
Engineering, Part 1: Deciding a Model
39
prediction
classify
categorical response
≈
Parameter :
Probability of response = 1
observation vector
Feature Engineering:
What are the features x that generate great probabilities for prediction?
Engineering, Part 2: Deciding a Classification Threshold
Classification:
What is the best�classification threshold T to choose that best fits our problem context?
40
prediction
classify
categorical response
≈
Parameter :
observation vector
Probability of response = 1
sklearn’s model.predict() uses fixed 0.5
Classification Threshold
The default threshold in sklearn is T = 0.5.
41
Classification Threshold
As we increase the threshold T, we “raise the standard” of how confident our classifier needs to be to predict 1 (i.e., “positive”).
T = 0.25
T = 0.50
T = 0.75
These x will all predict 1
Fewer positives
Choosing an Accuracy Threshold
See notebook for code snippets.
The choice of threshold T impacts our classification performance.
Do we get max accuracy when T ≈ 0.5? Not always the case…
43
Best T ≈ 0.57 likely due to class imbalance. There are fewer malgnant tumors and so we want to be more confident before classifying a tumor as malignant.
Train Accuracy vs. Threshold
T ≈ 0.57
Demo
Tune Thresholds with Cross Validation
See notebook for code snippets.
The threshold should typically be tuned using cross validation.
44
For a threshold T:
Model fit to train set 1,�Acc on val set 1
cross_val_�acc =
(1/k)
Model fit to train set k,�Acc on val set k
+ … +
T ≈ 0.56
Demo
Choosing a Threshold According to Other Metrics?
The choice of threshold T impacts our classification performance.
Could we choose a threshold T based on metrics that measure false positives/false negatives?
Yes! Two options:
Each of these visualizations have an associated performance metric: AUC (Area Under Curve).
45
Demo
Two More Metrics
46
| 0 | 1 |
0 | TN | FP |
1 | FN | TP |
Prediction
True Positive Rate (TPR):
“What proportion of spam did I mark�correctly?
Same thing as recall. In statistics, sensitivity.
The ROC curve plots TPR vs FPR for different classification thresholds.
False Positive Rate (FPR):
“What proportion of regular email did I mark as spam?
In statistics, also called specificity.
Demo
One of the Most Valuable Graphics on Wikipedia, Now With FPR
47
FPR =
How many irrelevant items are retrieved?
Want close to 1.0
Want close to 1.0
Want close to 0.0
TPR
The ROC curve plots TPR vs FPR for different classification thresholds.
[adapted from Wikipedia]
Not the ROC Curve, but Useful to Start with
The choice of threshold T impacts our classification performance.
As we increase T, both TPR and FPR decrease.
48
Demo
The ROC Curve
The ROC Curve plots this tradeoff.
49
🤔
Demo
The ROC Curve
The ROC Curve plots this tradeoff.
50
T = 0.1
T = 0.6
T = 0.9
Demo
The Perfect Classifier
The “perfect” classifier is the one that has a�TPR of 1, and FPR of 0.
51
Perfect Predictor
Demo
Performance Metric: Area Under Curve (AUC)
The “perfect” classifier is the one that has a�TPR of 1, and FPR of 0.
52
Perfect Predictor
We can compute the area under curve (AUC) of our model.
[Extra] What is the “worst” AUC and why is it 0.5?
A random predictor randomly predicts P(Y = 1 | x) to be uniformly between 0 and 1.
Best possible AUC = 1. Terrible AUC = 0.5.
53
This slide was added post- lecture to clarify closing comments.
| 0 | 1 |
0 | TN = 0.5 n0 | FP = 0.5 n0 |
1 | FN = 0.5 n1 | TP = 0.5 n1 |
If T = 0.5:
FPR = 0.5 n0/((0.5 + 0.5)n0) = 0.5
TPR = 0.5 n1/((0.5 + 0.5)n1) = 0.5
Point on ROC curve is (0.5, 0.5).
| 0 | 1 |
0 | TN = 0.8 n0 | FP = 0.2 n0 |
1 | FN = 0.8 n1 | TP = 0.2 n1 |
If T = 0.8:
FPR = 0.2 n0/((0.2 + 0.8)n0) = 0.2
TPR = 0.2 n1/((0.2 + 0.8)n1) = 0.2
Point on ROC curve is (0.2, 0.2).
| 0 | 1 |
0 | TN = 0.3 n0 | FP = 0.7 n0 |
1 | FN = 0.3 n1 | TP = 0.7 n1 |
If T = 0.3:
FPR = 0.7 n0/((0.7 + 0.3)n0) = 0.7
TPR = 0.7 n1/((0.7 + 0.3)n1) = 0.7
Point on ROC curve is (0.7, 0.7).
On average, if your dataset is size n1+n0 (with n1 true class 1’s and n0 true class 0’s):
[Extra] What is the “worst” AUC and why is it 0.5?
Best possible AUC = 1. Terrible AUC = 0.5.
A random predictor randomly predicts P(Y = 1 | x) to be uniformly between 0 and 1.
54
Perfect Predictor. AUC = 1.0
Area Under Curve (AUC) of random predictor is the area of this triangle.
Random Predictor. AUC = 0.5.
This slide was added post- lecture to clarify closing comments.
Common techniques for evaluating classifiers
Numerical assessments:
Visualizations:
55
Extra Slides
Lecture 22, Data 100 Spring 2022
Logistic Regression Model, continued
Linear separability and Regularization
Performance Metrics
Adjusting the Classification Threshold
[Extra] Detailed MLE, Gradient Descent,�PR curves
56
[Extra] Detailed Maximum Likelihood Estimation
Lecture 22, Data 100 Spring 2022
Logistic Regression Model, continued
Linear separability and Regularization
Performance Metrics
Adjusting the Classification Threshold
[Extra] Detailed MLE, Gradient Descent,�PR curves
57
Video: link
Out of scope, but useful for understanding where cross-entropy loss comes from.
The Modeling Process
58
2. Choose a loss function
3. Fit the model
4. Evaluate model performance
1. Choose a model
Regression ( )
Classification ( )
Regularization�Sklearn/Gradient descent
R2, Residuals, etc.
Squared Loss or Absolute Loss
Linear Regression
??
(next time)
Regularization�Sklearn/Gradient descent
✅
Logistic Regression
Wherefore use cross-entropy?
Average Cross-Entropy Loss
Shakespeare
Why Use Cross-Entropy Loss?
This section will not be directly tested, but you will understand why we minimize cross-entropy loss for logistic regression.
Two common explanations:
59
Recall the Coin Demo (No-Input Classification)
For training data: {0, 0, 1, 1, 1, 1, 0, 0, 0, 0}
0.4 is the most “intuitive” θ for two reasons:
60
Parameter θ: Probability that�IID flip == 1 (Heads)
Prediction:�1 or 0
likelihood
data (1’s and 0’s)
How can we generalize this notion of�likelihood to any random binary sample?
(proportional to the probability of our data)
A Compact Representation of the Bernoulli Probability Distribution
How can we generalize this notion of�likelihood to any random binary sample?
61
(long, non-compact form):
For P(Y = 1), only this term stays
For P(Y = 0), only this term stays
Let Y be Bernoulli(p). The probability distribution can be written compactly:
likelihood
data (1’s and 0’s)
Generalized Likelihood of Binary Data
How can we generalize this notion of�likelihood to any random binary sample?
62
For P(Y = 1), only this term stays
For P(Y = 0), only this term stays
If binary data are IID with same probability p, then the likelihood of the data is:
Let Y be Bernoulli(p). The probability distribution can be written compactly:
likelihood
data (1’s and 0’s)
Ex: {0, 0, 1, 1, 1, 1, 0, 0, 0, 0} →
likelihood�vs. probability
Generalized Likelihood of Binary Data
How can we generalize this notion of�likelihood to any random binary sample?
63
(spoiler: for logistic regression, )
likelihood
data (1’s and 0’s)
For P(Y = 1), only this term stays
For P(Y = 0), only this term stays
Let Y be Bernoulli(p). The probability distribution can be written compactly:
If binary data are IID with same probability p, then the likelihood of the data is:
If binary data are independent with different probability pi, then the likelihood of the data is:
Maximum Likelihood Estimation (MLE)
Our maximum likelihood estimation problem:
Find that maximize
64
Maximum Likelihood Estimation (MLE)
Our maximum likelihood estimation problem:
Find that maximize
Equivalent, simplifying optimization problems:
65
maximize
(log is an increasing function. If a > b, then log(a) > log(b).)
Maximum Likelihood Estimation (MLE)
Our maximum likelihood estimation problem:
Find that maximize
Equivalent, simplifying optimization problems:
66
maximize
(log is an increasing function. If a > b, then log(a) > log(b).)
minimize
Argmax property:
x that maximizes f(x) will minimize -f(x)
Maximizing Likelihood == Minimizing Average Cross-Entropy
67
argmax
argmin
Average Cross-Entropy Loss!!
Log is increasing; max/min properties
argmin
For logistic regression, let
Average Cross-Entropy Loss for Logistic Regression!!
🎉
[High-Level] Maximum Likelihood Estimation
68
argmin
argmax
For logistic regression, let
Prob. that i-th response is yi
Minimizing cross-entropy loss is equivalent to maximizing the likelihood of the training data.
Assumption: all data are independent Bernoulli random variables.
[High-Level] Maximum Likelihood Estimation
69
argmin
argmax
For logistic regression, let
Prob. that i-th response is yi
Minimizing cross-entropy loss is equivalent to maximizing the likelihood of the training data.
Assumption: all data are independent Bernoulli random variables
Main takeaway: The optimal theta that minimizes mean cross-entropy loss “pushes” all probabilities in the direction of the true class.
Want
Want
[High-Level] Maximum Likelihood Estimation
70
argmin
argmax
For logistic regression, let
Prob. that i-th response is yi
Minimizing cross-entropy loss is equivalent to maximizing the likelihood of the training data.
Assumption: all data are independent Bernoulli random variables
Want
Want
It turns out that many of the model + loss combinations we’ve seen can be motivated using MLE.
We Did it!
71
2. Choose a loss function
3. Fit the model
4. Evaluate model performance
1. Choose a model
Regression ( )
Classification ( )
Regularization�Sklearn/Gradient descent
R2, Residuals, etc.
Squared Loss or Absolute Loss
Linear Regression
??
(next time)
Regularization�Sklearn/Gradient descent
✅
Logistic Regression
Average Cross-Entropy Loss
✅
Shakespeare
That which we call a rose would by any other name smell as sweet.
[Extra] Gradient Descent for Logistic Regression
Lecture 22, Data 100 Spring 2022
Logistic Regression Model, continued
Linear separability and Regularization
Performance Metrics
Adjusting the Classification Threshold
[Extra] Detailed MLE, Gradient Descent,�PR curves
72
Reference slides. Out of scope.
[Extra] Gradient Descent for Logistic Regression
73
2. Choose a loss function
3. Fit the model
4. Evaluate model performance
1. Choose a model
Regression ( )
Classification ( )
Regularization�Sklearn/Gradient descent
R2, Residuals, etc.
Squared Loss or Absolute Loss
Linear Regression
Accuracy, Precision, Recall, ROC Curves
Regularization�Sklearn/Gradient descent
✅
Logistic Regression
Average Cross-Entropy Loss
✅
✅
Simplifying Average Cross-Entropy Loss
74
Slides from Spring 2020.
Gradient of Average Cross-Entropy Loss
75
Slides from Spring 2020.
Gradient Descent Algorithms
76
Slides from Spring 2020.
[Extra] Precision-Recall Curves
Lecture 22, Data 100 Spring 2022
Logistic Regression Model, continued
Linear separability and Regularization
Performance Metrics
Adjusting the Classification Threshold
[Extra] Detailed MLE, Gradient Descent,�PR curves
77
Reference slides. Out of scope.
Precision vs. threshold
As we increase our threshold, we have fewer and fewer false positives.
78
It is possible for precision to decrease slightly with an increased threshold. Why?
Recall vs. threshold
As we increase our threshold, we have more and more false negatives.
79
Recall strictly decreases as we increase our threshold. Why?
Precision and Recall vs. Threshold
80
Precision-recall curves
We can also plot precision vs. recall, for all possible thresholds.
81
🤔
Precision-recall curves
We can also plot precision vs. recall, for all possible thresholds.
Answer:
82
T = 0.9
T = 0.6
T = 0.1
threshold
Precision-recall curves
The “perfect classifier” is one with precision of 1 and recall of 1.
83
Perfect Predictor
Logistic Regression II
Content credit: Lisa Yan, Suraj Rampure, Ani Adhikari, Josh Hug, Joseph Gonzalez
84
LECTURE 22