1 of 84

Logistic Regression II

Model Performance.

Data 100/Data 200, Spring 2022 @ UC Berkeley

Josh Hug and Lisa Yan

1

LECTURE 22

2 of 84

More Logistic Regression

2

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

Reports, Decisions, and Solutions

?

Logistic Regression II:�Linear Separability�Accuracy, Precision, Recall

Classification Thresholds

Logistic Regression I:�The Model

Cross-Entropy Loss

The Probabilistic View

(today)

3 of 84

Today’s Roadmap

Lecture 22, Data 100 Spring 2022

Logistic Regression Model, continued

  • sklearn demo
  • Maximum Likelihood Estimation: high-level (live), detailed (recorded)

Linear separability and Regularization

Performance Metrics

  • Accuracy
  • Imbalanced Data, Precision, Recall

Adjusting the Classification Threshold

  • A case study
  • ROC curves, and AUC

[Extra] Detailed MLE, Gradient Descent,�PR curves

3

4 of 84

Logistic Regression Model, continued

Lecture 22, Data 100 Spring 2022

Logistic Regression Model, continued

  • sklearn demo
  • Maximum Likelihood Estimation: high-level (live), detailed (recorded)

Linear separability and Regularization

Performance Metrics

  • Accuracy
  • Imbalanced Data, Precision, Recall

Adjusting the Classification Threshold

  • A case study
  • ROC curves, and AUC

[Extra] Detailed MLE, Gradient Descent,�PR curves

4

5 of 84

Logistic Regression with sklearn

5

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(fit_intercept=False)

model.fit(X, Y)

Task/Model

Average Cross-Entropy Loss

+ regularization

Fit to objective function

Binary Classification ( )

For logistic regression, sklearn applies regularization by default.

We’ll see why soon.

Demo

6 of 84

Sklearn: Predict Probabilities

6

model.predict_proba(X) # probs for all classes

model.classes_ # array([0, 1])

Demo

7 of 84

Sklearn: Classification

Equivalent “otherwise” condition:

Interpret: Given the input feature x:� If Y is more likely to be 1 than 0,� then predict .� Else predict 0.

7

model.predict_proba(X) # probs for all classes

model.classes_ # array([0, 1])

model.predict(X) # predict 1 or 0

Demo

8 of 84

[High-Level] Maximum Likelihood Estimation

8

argmin

This material will not be tested; I’ve recorded a detailed video instead (link).

argmax

For logistic regression, let

Prob. that i-th response is yi

Minimizing cross-entropy loss is equivalent to maximizing the likelihood of the training data.

Assumption: all data are independent Bernoulli random variables.

Main takeaway: The optimal theta that minimizes mean cross-entropy loss “pushes” all probabilities in the direction of the true class.

Want

Want

9 of 84

[High-Level] Maximum Likelihood Estimation

9

argmin

argmax

For logistic regression, let

Prob. that i-th response is yi

Minimizing cross-entropy loss is equivalent to maximizing the likelihood of the training data.

Assumption: all data are independent Bernoulli random variables

Main takeaway: The optimal theta that minimizes mean cross-entropy loss “pushes” all probabilities in the direction of the true class.

Want

Want

10 of 84

Linear separability and Regularization

Lecture 22, Data 100 Spring 2022

Logistic Regression Model, continued

  • sklearn demo
  • Maximum Likelihood Estimation: high-level (live), detailed (recorded)

Linear separability and Regularization

Performance Metrics

  • Accuracy
  • Imbalanced Data, Precision, Recall

Adjusting the Classification Threshold

  • A case study
  • ROC curves, and AUC

[Extra] Detailed MLE, Gradient Descent,�PR curves

10

11 of 84

Logistic Regression with sklearn

11

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(fit_intercept=False)

model.fit(X, Y)

Task/Model

Average Cross-Entropy Loss

+ regularization

Fit to objective function

Binary Classification ( )

Why does sklearn always apply regularization?

Demo

12 of 84

Linear Separability

A classification dataset is said to be linearly separable if there exists a hyperplane among input features x that separates the two classes y.

12

If there is one feature; the input feature is 1-D.

  • Class label is not a�feature; it is output.
  • Use rug plot to see�separability.

separable

not separable

13 of 84

Linear Separability

A classification dataset is said to be linearly separable if there exists a hyperplane among input features x that separates the two classes y.

13

If there is one feature; the input feature is 1-D.

  • Class label is not a�feature; it is output.
  • Use rug plot to see�separability.

separable

not separable

If there are two features, the input feature is 2-D. Use scatter plot to see separability.

separable

not separable

14 of 84

Linearly Separability Creates Diverging Weights

Consider the simplified logistic regression model�fit to the toy data:

14

C.

D.

What will be the optimal weight theta? Why?

A.

B.

[Hint] The optimal theta should “push” probabilities in the direction of the true class:

🤔

15 of 84

Linearly Separability Creates Diverging Weights

Consider the simplified logistic regression model�fit to the toy data:

15

happens as

C.

D.

What will be the optimal weight theta? Why?

A.

B.

[Hint] The optimal theta should “push” probabilities in the direction of the true class:

16 of 84

Linearly Separability Creates Diverging Weights

Consider the simplified logistic regression model�fit to the toy data:

16

happens as

C.

D.

What will be the optimal weight theta? Why?

A.

B.

[Hint] The optimal theta should “push” probabilities in the direction of the true class:

(Impossible to see, but) plateau is slightly tilted downwards.

Loss approaches 0 as theta decreases.

direction of gradient

17 of 84

Linearly Separability Creates Diverging Weights

Consider the simplified logistic regression model�fit to the toy data:

17

:

18 of 84

Linearly Separability Creates Diverging Weights

Consider the simplified logistic regression model�fit to the toy data:

18

:

Diverging weights

Divergent weights (i.e., | θ | →∞) occur with linearly separable data.

“Overconfidence” is a particularly dangerous version of overfitting.

This model is overconfident.

  • Consider a new point (0.5, 1).
  • The model incorrectly says p = 0, so it predicts 0. = 1.

Loss is infinite.

Typo fixed 5/3

19 of 84

Regularized Logistic Regression

To avoid large weights (particularly on linearly separable data), use regularization.

  • As with linear regression, standardize features first.

19

⚠️ �infinite argmin

finite

argmin

# sklearn defaults

model = LogisticRegression(

penalty='l2', C=1.0, …)

model.fit()

Regularization hyperparameter C is the inverse of λ. C = 1 / λ.

Set C big for minimal regularization, e.g., C=300.0.

20 of 84

Performance Metrics

Lecture 22, Data 100 Spring 2022

Logistic Regression Model, continued

  • sklearn demo
  • Maximum Likelihood Estimation: high-level (live), detailed (recorded)

Linear separability and Regularization

Performance Metrics

  • Accuracy
  • Imbalanced Data, Precision, Recall

Adjusting the Classification Threshold

  • A case study
  • ROC curves, and AUC

[Extra] Detailed MLE, Gradient Descent,�PR curves

20

21 of 84

Next Time

21

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Classification ( )

Regularization�Sklearn/Gradient descent

R2, Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

Regularization�Sklearn/Gradient descent

Logistic Regression

Average Cross-Entropy Loss

Let’s do it!

22 of 84

Classifier Accuracy

Now that we actually have our classifier, let’s try and quantify how well it performs.

The most basic evaluation metric for a classifier is accuracy.

22

model.score(X, Y) # 0.8691

(sklearn documentation)

While widely used, the accuracy metric is not so meaningful when�dealing with class imbalance in a dataset.

def accuracy(X, Y):

return np.mean(model.predict(X) == Y)

accuracy(X, Y) # 0.8691

23 of 84

Pitfalls of Accuracy: A Case Study

Suppose we’re trying to build a classifier to filter spam emails.

  • Each email is spam (1) or ham (0).

Let’s say we have 100 emails, of which only 5 are truly spam, and the remaining 95 are ham.

23

  1. What is the accuracy of your friend’s classifier?
  2. Is accuracy a good metric of this classifier’s performance?

Your friend (“Friend 1”):

Classify every email as ham (0).

🤔

24 of 84

Pitfalls of Accuracy: A Case Study

Suppose we’re trying to build a classifier to filter spam emails.

  • Each email is spam (1) or ham (0).

Let’s say we have 100 emails, of which only 5 are truly spam, and the remaining 95 are ham.

24

Your friend (“Friend 1”):

Classify every email as ham (0).

High accuracy…�…but we detected none ⚠️ of the spam!!!

25 of 84

Pitfalls of Accuracy: A Case Study

Suppose we’re trying to build a classifier to filter spam emails.

  • Each email is spam (1) or ham (0).

Let’s say we have 100 emails, of which only 5 are truly spam, and the remaining 95 are ham.

25

Your friend (“Friend 1”):

Classify every email as ham (0).

Your other friend (“Friend 2”):

Classify every email as spam (1).

High accuracy…�…but we detected none ⚠️ of the spam!!!

Low ⚠️ accuracy…�…but we detected all of the spam!!!

26 of 84

Pitfalls of Accuracy: Class Imbalance

Suppose we’re trying to build a classifier to filter spam emails.

  • Each email is spam (1) or ham (0).

Let’s say we have 100 emails, of which only 5 are truly spam, and the remaining 95 are ham.

26

Your friend (“Friend 1”):

Classify every email as ham (0).

Your other friend:

Classify every email as spam (1).

Accuracy is not always a good metric for classification, particularly when your data have�class imbalance (e.g., very few 1’s compared to 0’s).

High accuracy…�…but we detected none ⚠️ of the spam!!!

Low ⚠️ accuracy…�…but we detected all of the spam!!!

27 of 84

Types of Classification Successes/Errors: The Confusion Matrix

  • True positives and true negatives are when we correctly classify an observation as being positive or negative, respectively.

27

0

1

0

True negative (TN)

False positive (FP)

1

False negative (FN)

True positive (TP)

Actual

Prediction

positive” means a prediction of 1.�“negative” means a prediction of 0.

  • False positives are “false alarms”:�we predicted 1, but the true class was 0.
  • False negatives are “failed detections”:�we predicted 0, but the true class was 1.

28 of 84

Types of Classification Successes/Errors: The Confusion Matrix

  • True positives and true negatives are when we correctly classify an observation as being positive or negative, respectively.

28

0

1

0

True negative (TN)

False positive (FP)

1

False negative (FN)

True positive (TP)

Actual

Prediction

A confusion matrix plots these four quantities for a particular classifier and dataset.

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(Y_true, Y_pred)

  • False positives are “false alarms”:�we predicted 1, but the true class was 0.
  • False negatives are “failed detections”:�we predicted 0, but the true class was 1.

29 of 84

Accuracy, Precision, and Recall

29

0

1

0

TN

FP

1

FN

TP

Actual

Prediction

What proportion of points did our classifier classify correctly?

30 of 84

Accuracy, Precision, and Recall

Precision and recall are two commonly used metrics that,�measure performance even in the presence of class imbalance.

30

0

1

0

TN

FP

1

FN

TP

Actual

Prediction

Of all observations that were predicted to be 1, what proportion were actually 1?

  • How accurate is our classifier when it is positive?
  • Penalizes false positives.

What proportion of points did our classifier classify correctly?

31 of 84

Accuracy, Precision, and Recall

Precision and recall are two commonly used metrics that,�measure performance even in the presence of class imbalance.

31

0

1

0

TN

FP

1

FN

TP

Actual

Prediction

What proportion of points did our classifier classify correctly?

Of all observations that were predicted to be 1, what proportion were actually 1?

  • How accurate is our classifier when it is positive?
  • Penalizes false positives.

Of all observations that were actually 1, what proportion did we predict to be 1? (Also known as sensitivity.)

  • How sensitive is our classifier to positives?
  • Penalizes false negatives.

32 of 84

One of the Most Valuable Graphics on Wikipedia

32

[adapted from Wikipedia]

(*i.e., true class is 1)

(i.e., positive; predicted class is 1)

33 of 84

Back to the Spam

Suppose we’re trying to build a classifier to filter spam emails.

  • Each email is spam (1) or ham (0).

Let’s say we have 100 emails, of which only 5 are truly spam, and the remaining 95 are ham.

33

Your friend:

Classify every email as ham (0).

0

1

0

TN: 95

FP: 0

1

FN: 5

TP: 0

34 of 84

Back to the Spam

Suppose we’re trying to build a classifier to filter spam emails.

  • Each email is spam (1) or ham (0).

Let’s say we have 100 emails, of which only 5 are truly spam, and the remaining 95 are ham.

34

Your friend:

Classify every email as ham (0).

Your other friend (“Friend 2”):

Classify every email as spam (1).

Never positive!

Many false positives!

No false negatives!

0

1

0

TN: 0

FP: 95

1

FN: 0

TP: 5

35 of 84

Precision vs. Recall

Precision penalizes false positives, and Recall penalizes false negatives.

35

We can achieve 100% recall by making our classifier output “1”, regardless of the input.

  • Friend 2’s “always predict spam” classifier.
  • We would have no false negatives, but many false positives,�and so our precision would be low.

(see extra slides re: the precision-recall curve)

This suggests that there is a tradeoff between precision and recall;�they are often inversely related.

  • Ideally, both would be near 100%, but that’s unlikely to happen.

36 of 84

Which Performance Metric?

In many settings, there might be a much higher cost to missing positive cases.�For our tumor classifier:

  • We really don’t want to miss any malignant tumors (avoid false negatives).
  • We might be fine with classifying benign tumors as malignant (OK to have false positives),�since pathologists could do further studies to verify all malignant tumors.
  • This context would prioritize recall.

36

How do we engineer classifiers to meet the performance goals of our problem?

37 of 84

Interlude

Break (2 min)

37

PCA is commonly used in biomedical contexts, which have many named variables!

1. To cluster data (Paper 1, Paper 2)

2. To identify correlated�variables (interpret rows�of VT as linear coefficients)�(Paper 3). Uses biplots.

38 of 84

Adjusting the Classification Threshold

Lecture 22, Data 100 Spring 2022

Logistic Regression Model, continued

  • sklearn demo
  • Maximum Likelihood Estimation: high-level (live), detailed (recorded)

Linear separability and Regularization

Performance Metrics

  • Accuracy
  • Imbalanced Data, Precision, Recall

Adjusting the Classification Threshold

  • A case study
  • ROC curves, and AUC

[Extra] Detailed MLE, Gradient Descent,�PR curves

38

39 of 84

Engineering, Part 1: Deciding a Model

39

prediction

classify

categorical response

Parameter :

Probability of response = 1

observation vector

Feature Engineering:

What are the features x that generate great probabilities for prediction?

40 of 84

Engineering, Part 2: Deciding a Classification Threshold

Classification:

What is the best�classification threshold T to choose that best fits our problem context?

40

prediction

classify

categorical response

Parameter :

observation vector

Probability of response = 1

sklearn’s model.predict() uses fixed 0.5

41 of 84

Classification Threshold

The default threshold in sklearn is T = 0.5.

41

42 of 84

Classification Threshold

As we increase the threshold T, we “raise the standard” of how confident our classifier needs to be to predict 1 (i.e., “positive”).

T = 0.25

T = 0.50

T = 0.75

These x will all predict 1

Fewer positives

43 of 84

Choosing an Accuracy Threshold

See notebook for code snippets.

The choice of threshold T impacts our classification performance.

  • High T: Most predictions are 0. Lots of false negatives.
  • Low T: Most predictions are 1. Lots of false positives.

Do we get max accuracy when T ≈ 0.5? Not always the case…

43

Best T ≈ 0.57 likely due to class imbalance. There are fewer malgnant tumors and so we want to be more confident before classifying a tumor as malignant.

Train Accuracy vs. Threshold

T ≈ 0.57

Demo

44 of 84

Tune Thresholds with Cross Validation

See notebook for code snippets.

documentation

The threshold should typically be tuned using cross validation.

44

For a threshold T:

Model fit to train set 1,�Acc on val set 1

cross_val_�acc =

(1/k)

Model fit to train set k,�Acc on val set k

+ … +

T ≈ 0.56

Demo

45 of 84

Choosing a Threshold According to Other Metrics?

The choice of threshold T impacts our classification performance.

  • High T: Most predictions are 0. Lots of false negatives.
  • Low T: Most predictions are 1. Lots of false positives.

Could we choose a threshold T based on metrics that measure false positives/false negatives?

Yes! Two options:

  • Precision-Recall Curve (PR Curve). Covered in extra slides.
  • “Receiver Operating Characteristic” Curve (ROC Curve).

Each of these visualizations have an associated performance metric: AUC (Area Under Curve).

45

Demo

46 of 84

Two More Metrics

46

0

1

0

TN

FP

1

FN

TP

Prediction

True Positive Rate (TPR):

“What proportion of spam did I mark�correctly?

Same thing as recall. In statistics, sensitivity.

The ROC curve plots TPR vs FPR for different classification thresholds.

False Positive Rate (FPR):

“What proportion of regular email did I mark as spam?

In statistics, also called specificity.

Demo

47 of 84

One of the Most Valuable Graphics on Wikipedia, Now With FPR

47

FPR =

How many irrelevant items are retrieved?

Want close to 1.0

Want close to 1.0

Want close to 0.0

TPR

The ROC curve plots TPR vs FPR for different classification thresholds.

[adapted from Wikipedia]

48 of 84

Not the ROC Curve, but Useful to Start with

The choice of threshold T impacts our classification performance.

  • High T: Most predictions are 0. Lots of false negatives.
  • Low T: Most predictions are 1. Lots of false positives.

As we increase T, both TPR and FPR decrease.

  • A decreased TPR is bad (detecting fewer positives).
  • A decreased FPR is good (fewer false positives).

48

Demo

49 of 84

The ROC Curve

The ROC Curve plots this tradeoff.

  • ROC stands for “Receiver Operating Characteristic.” [Wikipedia]
  • We want high TPR, low FPR.

49

🤔

  • Which part of this curve corresponds to T = 0.9?
  • Which part of this curve corresponds to T = 0.1?

Demo

50 of 84

The ROC Curve

The ROC Curve plots this tradeoff.

  • ROC stands for “Receiver Operating Characteristic.” [Wikipedia]
  • We want high TPR, low FPR.

50

T = 0.1

T = 0.6

T = 0.9

Demo

51 of 84

The Perfect Classifier

The “perfect” classifier is the one that has a�TPR of 1, and FPR of 0.

  • We want our logistic regression model to match that as well as possible.
  • We want our ROC curve to be as close to the “top left” of this graph as possible.

51

Perfect Predictor

Demo

52 of 84

Performance Metric: Area Under Curve (AUC)

The “perfect” classifier is the one that has a�TPR of 1, and FPR of 0.

  • We want our model to match�that as well as possible.
  • We want our ROC curve to be�as close to the “top left” of this�graph as possible.

52

Perfect Predictor

We can compute the area under curve (AUC) of our model.

  • Different AUCs for both ROC curves and PR curves, but ROC is more common.
  • Best possible AUC = 1. Terrible AUC = 0.5.
    • Random predictors have an AUC of around 0.5. Why?
  • Your model’s AUC: somewhere between 0.5 and 1.

53 of 84

[Extra] What is the “worst” AUC and why is it 0.5?

A random predictor randomly predicts P(Y = 1 | x) to be uniformly between 0 and 1.

Best possible AUC = 1. Terrible AUC = 0.5.

  • Random predictors have an AUC of around 0.5. Why?

53

This slide was added post- lecture to clarify closing comments.

0

1

0

TN = 0.5 n0

FP = 0.5 n0

1

FN = 0.5 n1

TP = 0.5 n1

If T = 0.5:

FPR = 0.5 n0/((0.5 + 0.5)n0) = 0.5

TPR = 0.5 n1/((0.5 + 0.5)n1) = 0.5

Point on ROC curve is (0.5, 0.5).

0

1

0

TN = 0.8 n0

FP = 0.2 n0

1

FN = 0.8 n1

TP = 0.2 n1

If T = 0.8:

FPR = 0.2 n0/((0.2 + 0.8)n0) = 0.2

TPR = 0.2 n1/((0.2 + 0.8)n1) = 0.2

Point on ROC curve is (0.2, 0.2).

0

1

0

TN = 0.3 n0

FP = 0.7 n0

1

FN = 0.3 n1

TP = 0.7 n1

If T = 0.3:

FPR = 0.7 n0/((0.7 + 0.3)n0) = 0.7

TPR = 0.7 n1/((0.7 + 0.3)n1) = 0.7

Point on ROC curve is (0.7, 0.7).

On average, if your dataset is size n1+n0 (with n1 true class 1’s and n0 true class 0’s):

54 of 84

[Extra] What is the “worst” AUC and why is it 0.5?

Best possible AUC = 1. Terrible AUC = 0.5.

  • Random predictors have an AUC of around 0.5. Why?

A random predictor randomly predicts P(Y = 1 | x) to be uniformly between 0 and 1.

54

Perfect Predictor. AUC = 1.0

Area Under Curve (AUC) of random predictor is the area of this triangle.

Random Predictor. AUC = 0.5.

This slide was added post- lecture to clarify closing comments.

55 of 84

Common techniques for evaluating classifiers

Numerical assessments:

  • Accuracy, precision, recall/TPR, FPR.
  • Area under curve (AUC), for ROC curves.

Visualizations:

  • Confusion matrices.
  • Precision/recall curves.
  • ROC curves.

55

56 of 84

Extra Slides

Lecture 22, Data 100 Spring 2022

Logistic Regression Model, continued

  • sklearn demo
  • Maximum Likelihood Estimation: high-level (live), detailed (recorded)

Linear separability and Regularization

Performance Metrics

  • Accuracy
  • Imbalanced Data, Precision, Recall

Adjusting the Classification Threshold

  • A case study
  • ROC curves, and AUC

[Extra] Detailed MLE, Gradient Descent,�PR curves

56

57 of 84

[Extra] Detailed Maximum Likelihood Estimation

Lecture 22, Data 100 Spring 2022

Logistic Regression Model, continued

  • sklearn demo
  • Maximum Likelihood Estimation: high-level (live), detailed (recorded)

Linear separability and Regularization

Performance Metrics

  • Accuracy
  • Imbalanced Data, Precision, Recall

Adjusting the Classification Threshold

  • A case study
  • ROC curves, and AUC

[Extra] Detailed MLE, Gradient Descent,�PR curves

57

Video: link

Out of scope, but useful for understanding where cross-entropy loss comes from.

58 of 84

The Modeling Process

58

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Classification ( )

Regularization�Sklearn/Gradient descent

R2, Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

??

(next time)

Regularization�Sklearn/Gradient descent

Logistic Regression

Wherefore use cross-entropy?

Average Cross-Entropy Loss

Shakespeare

[Wikipedia]

59 of 84

Why Use Cross-Entropy Loss?

This section will not be directly tested, but you will understand why we minimize cross-entropy loss for logistic regression.

Two common explanations:

  • [Information Theory] KL Divergence (textbook)
  • [Probability] Maximum Likelihood Estimation (this lecture)

59

60 of 84

Recall the Coin Demo (No-Input Classification)

For training data: {0, 0, 1, 1, 1, 1, 0, 0, 0, 0}

0.4 is the most “intuitive” θ for two reasons:

  1. Frequency of heads in our data
  2. Maximizes the likelihood of our data:

60

Parameter θ: Probability that�IID flip == 1 (Heads)

Prediction:�1 or 0

likelihood

data (1’s and 0’s)

How can we generalize this notion of�likelihood to any random binary sample?

(proportional to the probability of our data)

61 of 84

A Compact Representation of the Bernoulli Probability Distribution

How can we generalize this notion of�likelihood to any random binary sample?

61

(long, non-compact form):

For P(Y = 1), only this term stays

For P(Y = 0), only this term stays

Let Y be Bernoulli(p). The probability distribution can be written compactly:

likelihood

data (1’s and 0’s)

62 of 84

Generalized Likelihood of Binary Data

How can we generalize this notion of�likelihood to any random binary sample?

62

For P(Y = 1), only this term stays

For P(Y = 0), only this term stays

If binary data are IID with same probability p, then the likelihood of the data is:

Let Y be Bernoulli(p). The probability distribution can be written compactly:

likelihood

data (1’s and 0’s)

Ex: {0, 0, 1, 1, 1, 1, 0, 0, 0, 0} →

likelihood�vs. probability

63 of 84

Generalized Likelihood of Binary Data

How can we generalize this notion of�likelihood to any random binary sample?

63

(spoiler: for logistic regression, )

likelihood

data (1’s and 0’s)

For P(Y = 1), only this term stays

For P(Y = 0), only this term stays

Let Y be Bernoulli(p). The probability distribution can be written compactly:

If binary data are IID with same probability p, then the likelihood of the data is:

If binary data are independent with different probability pi, then the likelihood of the data is:

64 of 84

Maximum Likelihood Estimation (MLE)

Our maximum likelihood estimation problem:

  • For i = 1, 2, …, n, let Yi be independent Bernoulli(pi). Observe data .
  • We’d like to estimate .

Find that maximize

64

65 of 84

Maximum Likelihood Estimation (MLE)

Our maximum likelihood estimation problem:

  • For i = 1, 2, …, n, let Yi be independent Bernoulli(pi). Observe data .
  • We’d like to estimate .

Find that maximize

Equivalent, simplifying optimization problems:

65

maximize

(log is an increasing function. If a > b, then log(a) > log(b).)

66 of 84

Maximum Likelihood Estimation (MLE)

Our maximum likelihood estimation problem:

  • For i = 1, 2, …, n, let Yi be independent Bernoulli(pi). Observe data .
  • We’d like to estimate .

Find that maximize

Equivalent, simplifying optimization problems:

66

maximize

(log is an increasing function. If a > b, then log(a) > log(b).)

minimize

Argmax property:

x that maximizes f(x) will minimize -f(x)

67 of 84

Maximizing Likelihood == Minimizing Average Cross-Entropy

67

argmax

argmin

Average Cross-Entropy Loss!!

Log is increasing; max/min properties

argmin

For logistic regression, let

Average Cross-Entropy Loss for Logistic Regression!!

🎉

68 of 84

[High-Level] Maximum Likelihood Estimation

68

argmin

argmax

For logistic regression, let

Prob. that i-th response is yi

Minimizing cross-entropy loss is equivalent to maximizing the likelihood of the training data.

Assumption: all data are independent Bernoulli random variables.

69 of 84

[High-Level] Maximum Likelihood Estimation

69

argmin

argmax

For logistic regression, let

Prob. that i-th response is yi

Minimizing cross-entropy loss is equivalent to maximizing the likelihood of the training data.

Assumption: all data are independent Bernoulli random variables

Main takeaway: The optimal theta that minimizes mean cross-entropy loss “pushes” all probabilities in the direction of the true class.

Want

Want

70 of 84

[High-Level] Maximum Likelihood Estimation

70

argmin

argmax

For logistic regression, let

Prob. that i-th response is yi

Minimizing cross-entropy loss is equivalent to maximizing the likelihood of the training data.

Assumption: all data are independent Bernoulli random variables

Want

Want

It turns out that many of the model + loss combinations we’ve seen can be motivated using MLE.

  • OLS, Ridge Regression, etc.
  • You will study MLE further in probability and ML classes. But now you know it exists.

71 of 84

We Did it!

71

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Classification ( )

Regularization�Sklearn/Gradient descent

R2, Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

??

(next time)

Regularization�Sklearn/Gradient descent

Logistic Regression

Average Cross-Entropy Loss

Shakespeare

[Wikipedia]

That which we call a rose would by any other name smell as sweet.

72 of 84

[Extra] Gradient Descent for Logistic Regression

Lecture 22, Data 100 Spring 2022

Logistic Regression Model, continued

  • sklearn demo
  • Maximum Likelihood Estimation: high-level (live), detailed (recorded)

Linear separability and Regularization

Performance Metrics

  • Accuracy
  • Imbalanced Data, Precision, Recall

Adjusting the Classification Threshold

  • A case study
  • ROC curves, and AUC

[Extra] Detailed MLE, Gradient Descent,�PR curves

72

Reference slides. Out of scope.

73 of 84

[Extra] Gradient Descent for Logistic Regression

73

2. Choose a loss function

3. Fit the model

4. Evaluate model performance

1. Choose a model

Regression ( )

Classification ( )

Regularization�Sklearn/Gradient descent

R2, Residuals, etc.

Squared Loss or Absolute Loss

Linear Regression

Accuracy, Precision, Recall, ROC Curves

Regularization�Sklearn/Gradient descent

Logistic Regression

Average Cross-Entropy Loss

74 of 84

Simplifying Average Cross-Entropy Loss

74

Slides from Spring 2020.

75 of 84

Gradient of Average Cross-Entropy Loss

75

Slides from Spring 2020.

76 of 84

Gradient Descent Algorithms

76

Slides from Spring 2020.

77 of 84

[Extra] Precision-Recall Curves

Lecture 22, Data 100 Spring 2022

Logistic Regression Model, continued

  • sklearn demo
  • Maximum Likelihood Estimation: high-level (live), detailed (recorded)

Linear separability and Regularization

Performance Metrics

  • Accuracy
  • Imbalanced Data, Precision, Recall

Adjusting the Classification Threshold

  • A case study
  • ROC curves, and AUC

[Extra] Detailed MLE, Gradient Descent,�PR curves

77

Reference slides. Out of scope.

78 of 84

Precision vs. threshold

As we increase our threshold, we have fewer and fewer false positives.

  • Thus, precision tends to increase.

78

It is possible for precision to decrease slightly with an increased threshold. Why?

79 of 84

Recall vs. threshold

As we increase our threshold, we have more and more false negatives.

  • Thus, recall tends to decrease.

79

Recall strictly decreases as we increase our threshold. Why?

80 of 84

Precision and Recall vs. Threshold

80

81 of 84

Precision-recall curves

We can also plot precision vs. recall, for all possible thresholds.

81

🤔

  • Which part of this curve corresponds to T = 0.9?
  • Which part of this curve corresponds to T = 0.1?

82 of 84

Precision-recall curves

We can also plot precision vs. recall, for all possible thresholds.

Answer:

  • Threshold decreases from the top left to the bottom right.
  • In the notebook, there’s an interactive version of this plot.

82

T = 0.9

T = 0.6

T = 0.1

threshold

83 of 84

Precision-recall curves

The “perfect classifier” is one with precision of 1 and recall of 1.

  • We want our PR curve to be as close to the “top right” of this graph as possible.
  • One way to compare our model is to compute its area under curve (AUC).
    • The area under the “optimal PR curve” is 1.
    • More commonly, we look at the area under ROC curve.

83

Perfect Predictor

84 of 84

Logistic Regression II

Content credit: Lisa Yan, Suraj Rampure, Ani Adhikari, Josh Hug, Joseph Gonzalez

84

LECTURE 22