Logistic Regression
classification
Régression logistique – classification - ML
The output of linear regression is continuous
If we use linear regression to classify animals (cat = 0, rabbit = 1, dog = 2) we need arbitrary thresholds that do not reflect reality
Binary classification : target variable is categorical
Why would we have an arbitrary order between animals ?
does
make more sense than
Furthermore, the output range of linear regression is not constrained or capped..
ˆy
Can't use linear regression for classification
The main idea behind binary classification with logistic regression
Goal: Binary Classification: 0 / 1
Logistic regression
Logistic function
de R -> [0,1]
Logistique regression vs Linear regression
y^ = a_1 x_1 + a_0
y^ = a_1 x_1 + a_2 x_2 + a_0
2D
3D
Find the best hyperplan that separates the data
import statsmodels.formula.api as smf
import pandas as pd
# Load the dataset
df = pd.read_csv(credit_default_sampled.csv')
# instantiate the model le modèle
model = smf.logit('default ~ income + balance', data = df)
# Fit the model to the data
results = model.fit()
# Results
results.summary()
Predictors:
Credit default dataset
Target Variable:
default ~ income + balance
R-squared
F-statistic
t-statistique
Logistic regression result
The histogram of estimated values provides a good indication of the model's separation power.
Poor model
very confused
Excellent model
clear separation
y_proba = results.predict(df[['income', 'balance']])
Probabilities histogram
# output of the model predictions as probabilities, yhat in [0,1]
y_proba = results.predict(df[['income', 'balance']])
# transform the probabilities into a class
predicted_class = (y_proba > 0.5).astype(int)
print(predicted_class)
> [1,1,1,1...,0,0,0]
Classes prédites en fct des probabilités de prédiction
Note that the choice of the classification threshold (0.5) remains arbitrary.
4 possible cases
2 correct ones:
2 false ones:
| Predicted 1 | Predicted 0 |
Actual 1 | True Positives | False Negatives |
Actual 0 | False Positives | True Negatives |
results.pred_table()
confusion matrix
| Predicted Default | Predicted Non-Default |
Actual Default | 286 | 47 |
Actual Non-Default | 40 | 460 |
Out of of 333 default samples:
And out of 500 non-default samples,
�
Confusion matrix: default ~ income + balance
| Predicted �Default | Predicted �Non-Default |
Actual Default | 286 | 47 |
Actual Non-Default | 40 | 460 |
classification metrics
We can define multiple metrics using the confusion matrix
| Predicted 1 | Predicted 0 |
Actual 1 | True Positives | False Negatives |
Actual 0 | False Positives | True Negatives |
Confusion matrix - (the name is fitting)
from Wikipedia Confusion Matrix page
Your turn – default ~ income + balance + student
Build the model: default ~ income + balance + student
What happens if we use a different classification threshold ?
The confusion matrix and associated metrics also change.
t = 0.75 | Predicted Default | Predicted Non-Default |
Actual Default | 243 | 90 |
Actual Non-Default | 19 | 481 |
What about the threshold ?
t = 0.5 | Predicted Default | Predicted Non-Default |
Actual Default | 286 | 47 |
Actual Non-Default | 40 | 460 |
Acc (0.75)=(243 + 481)/888 = 81 %
TPR = 243 / 333 = 72.97%
FPR = 90 / 333 = 27.03%
Acc (0.5) = 746 / 833 = 89.56%
TPR = 286 / 333 = 85.89%
FPR = 47 / 333 = 14.11%
Plotting the TPR vs. FPR curve while varying the threshold from 0 to 1 gives the ROC curve.
The area under the curve (AUC) is a more robust metric than accuracy.
yhat= results.predict(df)
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(df['default'], yhat)
from sklearn.metrics import roc_auc_score
score = roc_auc_score(df['default'], yhat)
ROC-AUC
Your turn – default ~ income + balance + student
For the model default ~ income + balance + student:
- Build the model
- Interpret the results
- Compute the confusion matrix
- Calculate accuracy, TPR, and FPR
- Compute AUC
- Plot the ROC curve
Compare with the model default ~ income.
Imbalanced datasets
The accuracy paradox
The original credit default dataset contains:
A (stupid, useless) model that always predicts non-default has an accuracy of 96.67%.
The target class is a strong minority.� The dataset is imbalanced.
Four strategies to handle class imbalance:
Balance the classes
Stratégies pour résoudre le problème du déséquilibre de la classe minoritaire
On peut aussi
Le but n'est pas de balancer parfaitement les classes 50 / 50
A vous – imbalanced dataset
# over sample the minority class
df = pd.read_csv('credit_default.csv')
minority = df[df.default == 0].sample(n = 2000, replace = True)
majority = df[df.default == 1]
data = pd.concat(minority, majority)
# shuffle
data = data.sample(frac = 1)
On the full credit default dataset:
Then:
SMOTE
The SMOTE algorithm is implemented by randomly selecting one of the few classes of sample data, and then selecting several of its neighboring sample data by linear interpolation.
The SMOTE algorithm generates artificial samples in three steps,
SMOTE does not work!
Imbalanced learn
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning http://imbalanced-learn.org
�
Encoding Categorical variables
Encoding Categorical variables
One-hot encoding – Dummy encoding�
How to convert categorical variables into numerical variables?
Binary:
Multinomial, non-ordinal:
Examples
Car brand: Audi, Renault, Ford, Fiat
If an arbitrary number is assigned to each brand, a hierarchy is created:
Similarly:
Sometimes, assigning a number to each category makes sense—ordered categories:
origin_variables = pd.get_dummies(df.origin)
df = df.merge(origin_variables, left_index=True, right_index= True )
results = smf.ols('mpg ~ Japanese + European', data = df).fit()
Prédicteurs catégoriques – dummy encoding
# Essayer
results = smf.ols('mpg ~ Japanese + European + American', data = df).fit()
Load auto-mpg
Origin (3) and name (many) are unordered (non-ordinal) categories.� How to include origin as a predictor in a linear model?
With Pandas, create one variable per category:
N-1 new variables are needed for N categories.
pd.get_dummies() creates N-1 variables.� Then, define the model:� mpg ~ American + European
Statsmodels encodes categorical variables directly with
mpg ~ C(origin)
Prédicteurs catégoriques – statsmodel
df[['mpg','origin']].groupby(by = 'origin').mean().reset_index()
La moyenne par catégorie
mpg ~ C(Origin)
Interprétation des coefficients - catégories
Binary encoding!
import category_encoders as ce
# define the encoder
encoder = ce.BinaryEncoder(cols=['brand'])
df = encoder.fit_transform(df)
Instead of
We use the model
Prédicteurs catégoriques
How to include car brand in auto-mpg?
There are 36 categories.
Some categories have few samples.
Binary encoding
Number of binary variables = INT( log2(Number of categories))
Category encoders - the library
A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques.
Multiclass Classification - Multinomial
Classification multinomiale
One vs One
Pour N catégories on construit N modèles
Le modèle final est obtenu par vote ou par moyenne
Classification multinomiale
One vs Rest
N-1 modèles sont nécessaires
propagation de l'erreur