Get the best from your scikit-learn classifier
EuroSciPy 2023
Guillaume Lemaitre - August 16, 2023
About me
Research Engineer
@glemaitre
@glemaitre@fosstodon.org
2017
2019
2023
Problem statement
Imbalanced classification
The number of cancer voxels is much smaller than the number of healthy voxels.
20:1
Problem statement
Imbalanced classification
from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
model = make_pipeline(SMOTE(), LogisticRegression())
cv_results = cross_validate(model, X, y)
Problem statement
Imbalanced classification
~6 years…
Learning from imbalanced data:
I was wrong but I was not the only one
EuroSciPy 2023
Guillaume Lemaitre - August 16, 2023
Strong claim
There is no problem learning from imbalanced data
[1] Fissler, Tobias, Christian Lorentzen, and Michael Mayer. "Model comparison and calibration assessment: user guide for consistent scoring functions in machine learning and actuarial practice." arXiv preprint arXiv:2202.12780 (2022).
“No resampling technique will magically generate more information out of the few cases with the rare class” [1]
A “typical” use-case
Adult census dataset
<class 'pandas.core.frame.DataFrame'>
Index: 38393 entries, 0 to 6008
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 38393 non-null int64
1 workclass 35821 non-null category
2 education 38393 non-null category
3 marital-status 38393 non-null category
4 occupation 35811 non-null category
5 relationship 38393 non-null category
6 race 38393 non-null category
7 sex 38393 non-null category
8 capital-gain 38393 non-null int64
9 capital-loss 38393 non-null int64
10 hours-per-week 38393 non-null int64
11 native-country 37731 non-null category
dtypes: category(8), int64(4)
memory usage: 1.8 MB
from sklearn.datasets import fetch_openml
data, target = fetch_openml(
"Adult", as_frame=True, return_X_y=True
)
target
<=50K 37155
>50K 1238
Name: count, dtype: int64
A “typical” use-case
Experimental setup
from sklearn.model_selection import cross_validate
cv_results = cross_validate(model, data, target, scoring="balanced_accuracy")
Vanilla Random Forest
Vanilla Logistic Regression
A “typical” use-case
Results of vanilla strategy
A “typical” use-case
“Over”-fighting the imbalance
SMOTE Logistic Regression
Balanced Random Forest
A “typical” use-case
“Over”-fighting the imbalance
Synthetic Minority Oversampling TEchnique
(SMOTE)
Balanced Random Forest
Each tree in the forest is given a dataset derived from the original one by:
A “typical” use-case
“Over”-fighting the imbalance
Resampling for fighting class imbalance
The potential caveats
Resampling breaks calibration!
Resampling is used to alleviate the inflexibility of decision threshold (0.5 by default) but it renders the interpretation of the values returned by model.predict_proba meaningless!
A new scikit-learn estimator [1]
Post-tuning the estimator decision threshold given a metric
from sklearn.model_selection import TunedThresholdClassifier
tuned_model = TunedThresholdClassifier(
estimator=model, objective_metric="balanced_accuracy"
)
Post-tuned threshold
Logistic Regression
A new scikit-learn estimator [1]
Post-tuning the estimator decision threshold given a metric
A new scikit-learn estimator [1]
Post-tuning the estimator decision threshold given a metric
Post-tuned threshold
Logistic Regression
Post-tuned threshold
Random Forest
A new scikit-learn estimator [1]
Post-tuning the estimator decision threshold given a metric
Tune your hyperparameters: which metric?
Unthresholded, probabilistic metric
computed on
model.predict_proba(X_test)
Thresholded metric
computed on
model.predict(X_test)
vs.
Tuning hyperparameters on the probabilistic metric
Effect of hyper-parameter tuning on final metric
Before tuning
Effect of hyper-parameter tuning on final metric
After tuning
Optimum decision threshold
Cost-sensitive learning [1]
| Fraudulent | Legitimate |
Refused | 50€ + amount | -5€ |
Accepted | -amount | 0.02 Ă— amount |
Credit card frauds example
credit_card = fetch_openml(data_id=1597, as_frame=True)
columns_to_drop = ["Class", "Amount"]
data = credit_card.frame.drop(columns=columns_to_drop)
target = credit_card.frame["Class"].astype(int)
amount = credit_card.frame["Amount"].to_numpy()
Cost-sensitive learning
Metadata routing (SLEP006 [1])
def business_metric(y_true, y_pred, amount):
mask_tp, mask_tn = (y_true == 1) & (y_pred == 1), (y_true == 0) & (y_pred == 0)
mask_fp, mask_fn = (y_true == 0) & (y_pred == 1), (y_true == 1) & (y_pred == 0)
fraudulent_refuse = (mask_tp.sum() * 50) + amount[mask_tp].sum()
fraudulent_accept = -amount[mask_fn].sum()
legitimate_refuse = mask_fp.sum() * -5
legitimate_accept = (amount[mask_tn] * 0.02).sum()
return fraudulent_refuse + fraudulent_accept + legitimate_refuse + legitimate_accept
sklearn.set_config(enable_metadata_routing=True)
business_scorer = make_scorer(business_metric).set_score_request(amount=True)
business_scorer(model, data, target, amount=amount)
tuned_model = TunedThresholdClassifier(
estimator=model, objective_metric=business_scorer
)
tuned_model.fit(data_train, target_train, amount=amount_train)
business_score = business_metric(
tuned_model, data_test, target_test, amount=amount_test
)
Cost-sensitive learning
Metadata routing (SLEP006 [1])
cv_results_tuned_model = cross_validate(
model, data, target, params={"amount": amount}, scoring=business_scorer
)
Cost-sensitive learning
Metadata routing (SLEP006 [1])
Conclusion
Take-away