1 of 40

Learning from imbalanced data

Ido Zehori

Big Data Analytics #4

2 of 40

Hello!

I am Ido Zehori

Data scientist @

You can find me at:

@IZehori

idozehori.com

@IdoZehori

3 of 40

Agenda

What is imbalanced data
The accuracy paradox
Alternative performance metrics
Ratio parameter
Resampling methods
Ensemble methods

4 of 40

What is Imbalanced Data?

5 of 40

What is Imbalanced Data?

Cost of misclassification is higher on minority.
Distribution is not uniform among the classes.

6 of 40

Where can we find Imbalanced Data?

9 of 40

The accuracy paradox

10 of 40

Lets generate some data..

12 of 40

Decision boundary

Decision boundary or decision surface is a hypersurface that partitions the underlying vector space into two sets, one for each class.

14 of 40

Accuracy: 94%

What percent of our predictions were correct?

15 of 40

We can do even better..

16 of 40

Accuracy: 95%

17 of 40

The accuracy paradox

We are predicting if a user is going to click on an ad banner

Which is a better classifier

A or B?

Which has a higher accuracy rate?

Accuracy is a bad metric in cases like this

18 of 40

Alternative performance metrics

19 of 40

The confusion matrix

Our model:

20 of 40

ROC curve

AUC

(Area under the curve)

the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one

21 of 40

What is our naive models AUC score?

AUC score: 0.562

Geometric mean score: 0.356

22 of 40

Ratio parameter

23 of 40

AUC score: 0.814�Geometric mean score: 0.812

24 of 40

Resampling

25 of 40

Resampling

Over

Under

26 of 40

Resampling

Over

Under

AUC score:

0.819�Geometric mean score:

0.810

AUC score:

0.816�Geometric mean score:

0.815

27 of 40

Cluster Centroids

Replacing a cluster of majority samples by the cluster centroid of a K-Means algorithm

Advantage:

Avoid the information loss of majority class.

28 of 40

AUC score: 0.806�Geometric mean score: 0.8

29 of 40

One-Sided Selection

The majority samples are roughly divided into 4 groups:

Noise examples: close to any minority example.
Borderline examples: close to the boundary. unreliable.
Redundant examples: can not provide any other useful information.
Safe examples.

30 of 40

AUC score: 0.613�Geometric mean score: 0.482

31 of 40

SMOTE: Synthetic Minority Oversampling Algorithm

Generates new minority class samples along the lines between S and each nearest minority neighbor.

Avoid overfitting which occurs when exact replicas of minority instances are added in random oversampling

32 of 40

AUC score: 0.814�Geometric mean score: 0.814

33 of 40

Tomec links

Can be used as an under-sampling method or as a data cleaning method.

When used with SMOTE it is used on the over-sampled training set as a data cleaning method.

34 of 40

SMOTE + Tomec

Avoid overfitting
New synthetic similar instances are created
The new dataset is used as a sample to train the classification models.

35 of 40

AUC score: 0.82�Geometric mean score: 0.82

36 of 40

Ensemble methods

37 of 40

Easy Ensemble

Create an ensemble sets by iteratively applying random under-sampling.

select a random subset and make an ensemble of the different sets.

38 of 40

Some more techniques

Resampling:

Border SMOTE: generates the synthetic sample along the borderline of minority and majority classes.
DBSMOTE: Density-Based Synthetic Minority Over-sampling Technique is based on clustering algorithm DBSCAN.
ANS: dynamically adapts the number of neighbors needed for oversampling around different minority regions. parameter free.

Ensemble

BalanceCascade: ensure that misclassified samples can again be selected for the next subset. the classifier play the role of a “smart” replacement method.

39 of 40

Summary

Ratio parameter
Over/Under Sampling
Ensemble

40 of 40

Thanks!

Any questions?

zehori.ido@gmail.com

idozehori.com

1 of 40

2 of 40

3 of 40

4 of 40

5 of 40

6 of 40

7 of 40

8 of 40

9 of 40

10 of 40

11 of 40

12 of 40

13 of 40

14 of 40

15 of 40

16 of 40

17 of 40

18 of 40

19 of 40

20 of 40

21 of 40

22 of 40

23 of 40

24 of 40

25 of 40

26 of 40

27 of 40

28 of 40

29 of 40

30 of 40

31 of 40

32 of 40

33 of 40

34 of 40

35 of 40

36 of 40

37 of 40

38 of 40

39 of 40

40 of 40