1 of 37

Logistic Regression

classification

2 of 37

Régression logistique – classification - ML

  • Classification ≠ Regression
  • Logistic Regression
    • function logit
  • Confusion Matrix
  • AUC – F1 score
  • logistic regression with scikit-learn
  • categorical predictors
    • dummy encoding
    • binary encoding
  • Imbalanced Datasets
  • Multiclass Classification
    • one vs rest
    • one vs one

3 of 37

The output of linear regression is continuous

If we use linear regression to classify animals (cat = 0, rabbit = 1, dog = 2) we need arbitrary thresholds that do not reflect reality

Binary classification : target variable is categorical

Why would we have an arbitrary order between animals ?

does

    • cats < rabbits < dogs

make more sense than

    • Dogs < cats < rabbits

Furthermore, the output range of linear regression is not constrained or capped..

ˆy

Can't use linear regression for classification

4 of 37

The main idea behind binary classification with logistic regression

Goal: Binary Classification: 0 / 1

  1. Use linear regression
  2. Constrain the estimated values Y between the interval [0, 1]
  3. Interpret the results as a probability P of belonging to one of the categories (0 or 1)
  4. Define a threshold Tau = 0.5
  5. Classify with the rule:
    • if P < Tau => Y belongs to category 0
    • else P > Tau => Y belongs to category 1

Logistic regression

5 of 37

Logistic function

de R -> [0,1]

6 of 37

Logistique regression vs Linear regression

7 of 37

y^ = a_1 x_1 + a_0

y^ = a_1 x_1 + a_2 x_2 + a_0

2D

3D

Find the best hyperplan that separates the data

8 of 37

import statsmodels.formula.api as smf

import pandas as pd

# Load the dataset

df = pd.read_csv(credit_default_sampled.csv')

# instantiate the model le modèle

model = smf.logit('default ~ income + balance', data = df)

# Fit the model to the data

results = model.fit()

# Results

results.summary()

Predictors:

    • continuousbalance, income
    • binarystudent 

Credit default dataset

Target Variable

  • default indicates that a person defaulted on his/ her payments
  • default is a binary variable that takes values 0 or 1
    • No (0) 500
    • Yes (1) 333

9 of 37

default ~ income + balance

R-squared

F-statistic

t-statistique

Logistic regression result

10 of 37

The histogram of estimated values provides a good indication of the model's separation power.

Poor model

very confused

Excellent model

clear separation

y_proba = results.predict(df[['income', 'balance']])

Probabilities histogram

11 of 37

# output of the model predictions as probabilities, yhat in [0,1]

y_proba = results.predict(df[['income', 'balance']])

# transform the probabilities into a class

predicted_class = (y_proba > 0.5).astype(int)

print(predicted_class)

> [1,1,1,1...,0,0,0]

Classes prédites en fct des probabilités de prédiction

Note that the choice of the classification threshold (0.5) remains arbitrary.

12 of 37

4 possible cases

2 correct ones:

  • 1 classified as 1
  • 0 classified as 0

2 false ones:

  • 1 classified as 0
  • 0 classified as 1

 

Predicted 1

Predicted 0

Actual 1

True Positives

False Negatives

Actual 0

False Positives

True Negatives

results.pred_table()

confusion matrix

13 of 37

 

Predicted Default

Predicted Non-Default

Actual Default

286

47

Actual Non-Default

40

460

Out of of 333 default samples:

  • 286 were correctly predicted as default by our model (True positives)
  • 47 were wrongly predicted as non-default by our model (False Negatives)

And out of 500 non-default samples,

  • 460 were correctly predicted as non-default by our model (True Negatives)
  • 40 were wrongly predicted as default by our model (False Positives)

Confusion matrix: default ~ income + balance

14 of 37

 

Predicted �Default

Predicted �Non-Default

Actual Default

286

47

Actual Non-Default

40

460

  • number of samples correctly classified : 460 + 286 = 746
  • total number of samples : 833
  • so our accuracy is 746 / 833 = 89.56%

  • TPR = 286 / 333 = 85.89%
  • FPR = 47 / 333 = 14.11%

classification metrics

We can define multiple metrics using the confusion matrix

  • Accuracy = (TP + TN) / (P + N)
  • True Positive Rate = (TP) / (TP + FP)
  • False Positive Rate = (FP) / (TP + FP)

 

Predicted 1

Predicted 0

Actual 1

True Positives

False Negatives

Actual 0

False Positives

True Negatives

15 of 37

Confusion matrix - (the name is fitting)

from Wikipedia Confusion Matrix page

16 of 37

Your turn – default ~ income + balance + student

Build the model: default ~ income + balance + student

  1. instantiate the smf.logit model
  2. Fit the model
  3. interpret the results
  4. calculate the confusion matrix
  5. calculate accuracy, TPR, FPR
  6. plot the histogram of probabilities

17 of 37

What happens if we use a different classification threshold ?

The confusion matrix and associated metrics also change.

t = 0.75

Predicted Default

Predicted Non-Default

Actual Default

243

90

Actual Non-Default

19

481

What about the threshold ?

 t = 0.5

Predicted Default

Predicted Non-Default

Actual Default

286

47

Actual Non-Default

40

460

Acc (0.75)=(243 + 481)/888 = 81 %

TPR = 243 / 333 = 72.97%

FPR = 90 / 333 = 27.03%

Acc (0.5) = 746 / 833 = 89.56%

TPR = 286 / 333 = 85.89%

FPR = 47 / 333 = 14.11%

18 of 37

Plotting the TPR vs. FPR curve while varying the threshold from 0 to 1 gives the ROC curve.

The area under the curve (AUC) is a more robust metric than accuracy.

yhat= results.predict(df)

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(df['default'], yhat)

from sklearn.metrics import roc_auc_score

score = roc_auc_score(df['default'], yhat)

ROC-AUC

See here and here

19 of 37

Your turn – default ~ income + balance + student

For the model default ~ income + balance + student:

- Build the model

- Interpret the results

- Compute the confusion matrix

- Calculate accuracy, TPR, and FPR

- Compute AUC

- Plot the ROC curve

Compare with the model default ~ income.

20 of 37

Imbalanced datasets

21 of 37

The accuracy paradox

The original credit default dataset contains:

  • 333 default cases
  • 9667 non-default cases

A (stupid, useless) model that always predicts non-default has an accuracy of 96.67%.

The target class is a strong minority.� The dataset is imbalanced.

Four strategies to handle class imbalance:

  • Undersampling the majority class
  • Oversampling the minority class
  • Combination of over- and undersampling
  • SMOTE

22 of 37

Balance the classes

Stratégies pour résoudre le problème du déséquilibre de la classe minoritaire

  • sous-échantilloner la classe majoritaire
  • sur-échantilloner la classe minoritaire

On peut aussi

  • mélanger les 2 approches: sur et sous-échantillonage

Le but n'est pas de balancer parfaitement les classes 50 / 50

23 of 37

A vous – imbalanced dataset

# over sample the minority class

df = pd.read_csv('credit_default.csv')

minority = df[df.default == 0].sample(n = 2000, replace = True)

majority = df[df.default == 1]

data = pd.concat(minority, majority)

# shuffle

data = data.sample(frac = 1)

On the full credit default dataset:

  • Build the model default ~ income + balance + student.
  • Compute accuracy and AUC, and plot the ROC curve.

Then:

  • Oversample the minority class.
  • Undersample the majority class.
  • Vary the ratios.

24 of 37

SMOTE

The SMOTE algorithm is implemented by randomly selecting one of the few classes of sample data, and then selecting several of its neighboring sample data by linear interpolation.

The SMOTE algorithm generates artificial samples in three steps,

  1. selecting a few random sample.
  2. Select instance from its K-nearest minority class neighbors.
  3. Finally, a new sample is created by randomly interpolating two samples

25 of 37

SMOTE does not work!

26 of 37

Imbalanced learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning http://imbalanced-learn.org

27 of 37

Encoding Categorical variables

28 of 37

Encoding Categorical variables

One-hot encoding – Dummy encoding�

How to convert categorical variables into numerical variables?

Binary:

  • Yes/No
  • 1/0
  • Male/Female
  • Spam/Legit
  • Action: Buy, Save, Login

Multinomial, non-ordinal:

  • List of cities, countries, destinations, age groups
  • Education level
  • Car brands

Examples

Car brand: Audi, Renault, Ford, Fiat

If an arbitrary number is assigned to each brand, a hierarchy is created:

  • Audi → 1, Renault → 2, Ford → 3, Fiat → 4

Similarly:

  • Dog, cat, mouse, chicken → {1,2,3,4}
  • Why would a chicken be four times a dog?

Sometimes, assigning a number to each category makes sense—ordered categories:

  • Child, young, adult, elderly → {1,2,3,4}
  • Negative, neutral, positive → {-1, 0, 1}

29 of 37

origin_variables = pd.get_dummies(df.origin)

df = df.merge(origin_variables, left_index=True, right_index= True )

results = smf.ols('mpg ~ Japanese + European', data = df).fit()

Prédicteurs catégoriques – dummy encoding

# Essayer

results = smf.ols('mpg ~ Japanese + European + American', data = df).fit()

Load auto-mpg

Origin (3) and name (many) are unordered (non-ordinal) categories.� How to include origin as a predictor in a linear model?

With Pandas, create one variable per category:

  • American: 0 or 1
  • European: 0 or 1
  • Japanese: 0 or 1

N-1 new variables are needed for N categories.

pd.get_dummies() creates N-1 variables.� Then, define the model:� mpg ~ American + European

30 of 37

Statsmodels encodes categorical variables directly with

mpg ~ C(origin)

Prédicteurs catégoriques – statsmodel

31 of 37

df[['mpg','origin']].groupby(by = 'origin').mean().reset_index()

La moyenne par catégorie

  • Intercept = mpg_American
  • origin[T.European] = mpg_European − mpg_American
  • origin[T.Japanese] = mpg_Japanese − mpg_American

mpg ~ C(Origin)

Interprétation des coefficients - catégories

32 of 37

Binary encoding!

import category_encoders as ce

# define the encoder

encoder = ce.BinaryEncoder(cols=['brand'])

df = encoder.fit_transform(df)

Instead of

  • mpg ~ brand

We use the model

  • mpg ~ brand_0 + brand_1 + … + brand_5

Prédicteurs catégoriques

How to include car brand in auto-mpg?

There are 36 categories.

Some categories have few samples.

33 of 37

Binary encoding

Number of binary variables = INT( log2​(Number of categories))

34 of 37

Category encoders - the library

A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques.

35 of 37

Multiclass Classification - Multinomial

36 of 37

Classification multinomiale

One vs One

Pour N catégories on construit N modèles

Le modèle final est obtenu par vote ou par moyenne

37 of 37

Classification multinomiale

One vs Rest

N-1 modèles sont nécessaires

propagation de l'erreur