1 of 40

Learning from imbalanced data

Ido Zehori

Big Data Analytics #4

2 of 40

Hello!

I am Ido Zehori

Data scientist @

You can find me at:

@IZehori

idozehori.com

@IdoZehori

2

3 of 40

Agenda

  • What is imbalanced data
  • The accuracy paradox
  • Alternative performance metrics
  • Ratio parameter
  • Resampling methods
  • Ensemble methods

3

4 of 40

1.

What is Imbalanced Data?

4

5 of 40

What is Imbalanced Data?

  • Cost of misclassification is higher on minority.
  • Distribution is not uniform among the classes.

5

6 of 40

Where can we find Imbalanced Data?

6

7 of 40

7

8 of 40

8

The stack

9 of 40

2.

The accuracy paradox

9

10 of 40

10

Lets generate some data..

11 of 40

11

12 of 40

12

Decision boundary

Decision boundary or decision surface is a hypersurface that partitions the underlying vector space into two sets, one for each class.

13 of 40

13

14 of 40

14

Accuracy: 94%

What percent of our predictions were correct?

15 of 40

15

We can do even better..

16 of 40

16

Accuracy: 95%

17 of 40

The accuracy paradox

We are predicting if a user is going to click on an ad banner

  • Which is a better classifier

A or B?

17

  • Which has a higher accuracy rate?

  • Accuracy is a bad metric in cases like this

18 of 40

3.

Alternative performance metrics

18

19 of 40

The confusion matrix

19

Our model:

20 of 40

ROC curve

20

AUC

(Area under the curve)

the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one

21 of 40

What is our naive models AUC score?

21

AUC score: 0.562

Geometric mean score: 0.356

22 of 40

3.

Ratio parameter

22

23 of 40

23

AUC score: 0.814�Geometric mean score: 0.812

24 of 40

4.

Resampling

24

25 of 40

Resampling

Over

Under

25

26 of 40

Resampling

Over

Under

26

AUC score:

0.819�Geometric mean score:

0.810

AUC score:

0.816�Geometric mean score:

0.815

27 of 40

Cluster Centroids

Replacing a cluster of majority samples by the cluster centroid of a K-Means algorithm

Advantage:

  • Avoid the information loss of majority class.

27

28 of 40

28

AUC score: 0.806�Geometric mean score: 0.8

29 of 40

One-Sided Selection

The majority samples are roughly divided into 4 groups:

  • Noise examples: close to any minority example.
  • Borderline examples: close to the boundary. unreliable.
  • Redundant examples: can not provide any other useful information.
  • Safe examples.

29

30 of 40

30

AUC score: 0.613�Geometric mean score: 0.482

31 of 40

SMOTE: Synthetic Minority Oversampling Algorithm

Generates new minority class samples along the lines between S and each nearest minority neighbor.

  • Avoid overfitting which occurs when exact replicas of minority instances are added in random oversampling

31

32 of 40

32

AUC score: 0.814�Geometric mean score: 0.814

33 of 40

Tomec links

  • Can be used as an under-sampling method or as a data cleaning method.

  • When used with SMOTE it is used on the over-sampled training set as a data cleaning method.

33

34 of 40

SMOTE + Tomec

  • Avoid overfitting
  • New synthetic similar instances are created
  • The new dataset is used as a sample to train the classification models.

34

35 of 40

35

AUC score: 0.82�Geometric mean score: 0.82

36 of 40

5.

Ensemble methods

36

37 of 40

Easy Ensemble

Create an ensemble sets by iteratively applying random under-sampling.

  • select a random subset and make an ensemble of the different sets.

37

38 of 40

Some more techniques

Resampling:

  • Border SMOTE: generates the synthetic sample along the borderline of minority and majority classes.
  • DBSMOTE: Density-Based Synthetic Minority Over-sampling Technique is based on clustering algorithm DBSCAN.
  • ANS: dynamically adapts the number of neighbors needed for oversampling around different minority regions. parameter free.

Ensemble

  • BalanceCascade: ensure that misclassified samples can again be selected for the next subset. the classifier play the role of a “smart” replacement method.

38

39 of 40

Summary

  • Ratio parameter
  • Over/Under Sampling
  • Ensemble

39

40 of 40

Thanks!

Any questions?

zehori.ido@gmail.com

idozehori.com

40