1 of 37

Logistic Regression

classification

2 of 37

Régression logistique – classification - ML

Classification ≠ Regression
Logistic Regression

function logit

Confusion Matrix
AUC – F1 score
logistic regression with scikit-learn
categorical predictors

dummy encoding
binary encoding

Imbalanced Datasets
Multiclass Classification

one vs rest
one vs one

3 of 37

The output of linear regression is continuous

If we use linear regression to classify animals (cat = 0, rabbit = 1, dog = 2) we need arbitrary thresholds that do not reflect reality

Binary classification : target variable is categorical

Why would we have an arbitrary order between animals ?

does

cats < rabbits < dogs

make more sense than

Dogs < cats < rabbits

Furthermore, the output range of linear regression is not constrained or capped..

ˆy

Can't use linear regression for classification

4 of 37

The main idea behind binary classification with logistic regression

Goal: Binary Classification: 0 / 1

Use linear regression
Constrain the estimated values Y between the interval [0, 1]
Interpret the results as a probability P of belonging to one of the categories (0 or 1)
Define a threshold Tau = 0.5
Classify with the rule:

if P < Tau => Y belongs to category 0
else P > Tau => Y belongs to category 1

Logistic regression

5 of 37

Logistic function

de R -> [0,1]

6 of 37

Logistique regression vs Linear regression

7 of 37

y^ = a_1 x_1 + a_0

y^ = a_1 x_1 + a_2 x_2 + a_0

2D

3D

Find the best hyperplan that separates the data

8 of 37

import statsmodels.formula.api as smf

import pandas as pd

# Load the dataset

df = pd.read_csv(credit_default_sampled.csv')

# instantiate the model le modèle

model = smf.logit('default ~ income + balance', data = df)

# Fit the model to the data

results = model.fit()

# Results

results.summary()

Predictors:

continuous: balance, income
binary: student

Credit default dataset

Target Variable:

default indicates that a person defaulted on his/ her payments
default is a binary variable that takes values 0 or 1

No (0) 500
Yes (1) 333

9 of 37

default ~ income + balance

R-squared

F-statistic

t-statistique

Logistic regression result

10 of 37

The histogram of estimated values provides a good indication of the model's separation power.

Poor model

very confused

Excellent model

clear separation

y_proba = results.predict(df[['income', 'balance']])

Probabilities histogram

11 of 37

# output of the model predictions as probabilities, yhat in [0,1]

y_proba = results.predict(df[['income', 'balance']])

# transform the probabilities into a class

predicted_class = (y_proba > 0.5).astype(int)

print(predicted_class)

> [1,1,1,1...,0,0,0]

Classes prédites en fct des probabilités de prédiction

Note that the choice of the classification threshold (0.5) remains arbitrary.

12 of 37

4 possible cases

2 correct ones:

1 classified as 1
0 classified as 0

2 false ones:

1 classified as 0
0 classified as 1

	Predicted 1	Predicted 0
Actual 1	True Positives	False Negatives
Actual 0	False Positives	True Negatives

results.pred_table()

confusion matrix

13 of 37

	Predicted Default	Predicted Non-Default
Actual Default	286	47
Actual Non-Default	40	460

Out of of 333 default samples:

286 were correctly predicted as default by our model (True positives)
47 were wrongly predicted as non-default by our model (False Negatives)

And out of 500 non-default samples,

460 were correctly predicted as non-default by our model (True Negatives)
40 were wrongly predicted as default by our model (False Positives)

�

Confusion matrix: default ~ income + balance

14 of 37

	Predicted �Default	Predicted �Non-Default
Actual Default	286	47
Actual Non-Default	40	460

number of samples correctly classified : 460 + 286 = 746
total number of samples : 833
so our accuracy is 746 / 833 = 89.56%

TPR = 286 / 333 = 85.89%
FPR = 47 / 333 = 14.11%

classification metrics

We can define multiple metrics using the confusion matrix

Accuracy = (TP + TN) / (P + N)
True Positive Rate = (TP) / (TP + FP)
False Positive Rate = (FP) / (TP + FP)

	Predicted 1	Predicted 0
Actual 1	True Positives	False Negatives
Actual 0	False Positives	True Negatives

15 of 37

Confusion matrix - (the name is fitting)

from Wikipedia Confusion Matrix page

16 of 37

Your turn – default ~ income + balance + student

Build the model: default ~ income + balance + student

instantiate the smf.logit model
Fit the model
interpret the results
calculate the confusion matrix
calculate accuracy, TPR, FPR
plot the histogram of probabilities

17 of 37

What happens if we use a different classification threshold ?

The confusion matrix and associated metrics also change.

t = 0.75	Predicted Default	Predicted Non-Default
Actual Default	243	90
Actual Non-Default	19	481

What about the threshold ?

t = 0.5	Predicted Default	Predicted Non-Default
Actual Default	286	47
Actual Non-Default	40	460

Acc (0.75)=(243 + 481)/888 = 81 %

TPR = 243 / 333 = 72.97%

FPR = 90 / 333 = 27.03%

Acc (0.5) = 746 / 833 = 89.56%

TPR = 286 / 333 = 85.89%

FPR = 47 / 333 = 14.11%

18 of 37

Plotting the TPR vs. FPR curve while varying the threshold from 0 to 1 gives the ROC curve.

The area under the curve (AUC) is a more robust metric than accuracy.

yhat= results.predict(df)

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(df['default'], yhat)

from sklearn.metrics import roc_auc_score

score = roc_auc_score(df['default'], yhat)

ROC-AUC

See here and here

19 of 37

Your turn – default ~ income + balance + student

For the model default ~ income + balance + student:

- Build the model

- Interpret the results

- Compute the confusion matrix

- Calculate accuracy, TPR, and FPR

- Compute AUC

- Plot the ROC curve

Compare with the model default ~ income.

20 of 37

Imbalanced datasets

21 of 37

The accuracy paradox

The original credit default dataset contains:

333 default cases
9667 non-default cases

A (stupid, useless) model that always predicts non-default has an accuracy of 96.67%.

The target class is a strong minority.� The dataset is imbalanced.

Four strategies to handle class imbalance:

Undersampling the majority class
Oversampling the minority class
Combination of over- and undersampling
SMOTE

22 of 37

Balance the classes

Stratégies pour résoudre le problème du déséquilibre de la classe minoritaire

sous-échantilloner la classe majoritaire
sur-échantilloner la classe minoritaire

On peut aussi

mélanger les 2 approches: sur et sous-échantillonage

Le but n'est pas de balancer parfaitement les classes 50 / 50

23 of 37

A vous – imbalanced dataset

# over sample the minority class

df = pd.read_csv('credit_default.csv')

minority = df[df.default == 0].sample(n = 2000, replace = True)

majority = df[df.default == 1]

data = pd.concat(minority, majority)

# shuffle

data = data.sample(frac = 1)

On the full credit default dataset:

Build the model default ~ income + balance + student.
Compute accuracy and AUC, and plot the ROC curve.

Then:

Oversample the minority class.
Undersample the majority class.
Vary the ratios.

24 of 37

SMOTE

The SMOTE algorithm is implemented by randomly selecting one of the few classes of sample data, and then selecting several of its neighboring sample data by linear interpolation.

The SMOTE algorithm generates artificial samples in three steps,

selecting a few random sample.
Select instance from its K-nearest minority class neighbors.
Finally, a new sample is created by randomly interpolating two samples

from https://www.mdpi.com/2227-9717/10/7/1420

25 of 37

SMOTE does not work!

26 of 37

Imbalanced learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning http://imbalanced-learn.org

�

Tips for Handling Imbalanced Data in Machine Learning - MachineLearningMastery.com

27 of 37

Encoding Categorical variables

28 of 37

Encoding Categorical variables

One-hot encoding – Dummy encoding�

How to convert categorical variables into numerical variables?

Binary:

Yes/No
1/0
Male/Female
Spam/Legit
Action: Buy, Save, Login

Multinomial, non-ordinal:

List of cities, countries, destinations, age groups
Education level
Car brands

Examples

Car brand: Audi, Renault, Ford, Fiat

If an arbitrary number is assigned to each brand, a hierarchy is created:

Audi → 1, Renault → 2, Ford → 3, Fiat → 4

Similarly:

Dog, cat, mouse, chicken → {1,2,3,4}
Why would a chicken be four times a dog?

Sometimes, assigning a number to each category makes sense—ordered categories:

Child, young, adult, elderly → {1,2,3,4}
Negative, neutral, positive → {-1, 0, 1}

29 of 37

origin_variables = pd.get_dummies(df.origin)

df = df.merge(origin_variables, left_index=True, right_index= True )

results = smf.ols('mpg ~ Japanese + European', data = df).fit()

Prédicteurs catégoriques – dummy encoding

# Essayer

results = smf.ols('mpg ~ Japanese + European + American', data = df).fit()

Load auto-mpg

Origin (3) and name (many) are unordered (non-ordinal) categories.� How to include origin as a predictor in a linear model?

With Pandas, create one variable per category:

American: 0 or 1
European: 0 or 1
Japanese: 0 or 1

N-1 new variables are needed for N categories.

pd.get_dummies() creates N-1 variables.� Then, define the model:� mpg ~ American + European

30 of 37

Statsmodels encodes categorical variables directly with

mpg ~ C(origin)

Prédicteurs catégoriques – statsmodel

31 of 37

df[['mpg','origin']].groupby(by = 'origin').mean().reset_index()

La moyenne par catégorie

Intercept = mpg_American
origin[T.European] = mpg_European − mpg_American
origin[T.Japanese] = mpg_Japanese − mpg_American

mpg ~ C(Origin)

Interprétation des coefficients - catégories

32 of 37

Binary encoding!

import category_encoders as ce

# define the encoder

encoder = ce.BinaryEncoder(cols=['brand'])

df = encoder.fit_transform(df)

Instead of

mpg ~ brand

We use the model

mpg ~ brand_0 + brand_1 + … + brand_5

Prédicteurs catégoriques

How to include car brand in auto-mpg?

There are 36 categories.

Some categories have few samples.

33 of 37

Binary encoding

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02/

Number of binary variables = INT( log2(Number of categories))

34 of 37

Category encoders - the library

https://contrib.scikit-learn.org/category_encoders/index.html

A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques.

35 of 37

Multiclass Classification - Multinomial

36 of 37

Classification multinomiale

One vs One

Pour N catégories on construit N modèles

Le modèle final est obtenu par vote ou par moyenne

37 of 37

Classification multinomiale

One vs Rest

N-1 modèles sont nécessaires

propagation de l'erreur