Learning from imbalanced data
Ido Zehori
Big Data Analytics #4
Hello!
I am Ido Zehori
Data scientist @
You can find me at:
@IZehori
idozehori.com
@IdoZehori
2
Agenda
3
1.
What is Imbalanced Data?
4
What is Imbalanced Data?
5
Where can we find Imbalanced Data?
6
7
8
The stack
2.
The accuracy paradox
9
10
Lets generate some data..
11
12
Decision boundary
Decision boundary or decision surface is a hypersurface that partitions the underlying vector space into two sets, one for each class.
13
14
Accuracy: 94%
What percent of our predictions were correct?
15
We can do even better..
16
Accuracy: 95%
The accuracy paradox
We are predicting if a user is going to click on an ad banner
A or B?
17
3.
Alternative performance metrics
18
The confusion matrix
19
Our model:
ROC curve
20
AUC
(Area under the curve)
the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one
What is our naive models AUC score?
21
AUC score: 0.562
Geometric mean score: 0.356
3.
Ratio parameter
22
23
AUC score: 0.814�Geometric mean score: 0.812
4.
Resampling
24
Resampling
Over
Under
25
Resampling
Over
Under
26
AUC score:
0.819�Geometric mean score:
0.810
AUC score:
0.816�Geometric mean score:
0.815
Cluster Centroids
Replacing a cluster of majority samples by the cluster centroid of a K-Means algorithm
Advantage:
27
28
AUC score: 0.806�Geometric mean score: 0.8
One-Sided Selection
The majority samples are roughly divided into 4 groups:
29
30
AUC score: 0.613�Geometric mean score: 0.482
SMOTE: Synthetic Minority Oversampling Algorithm
Generates new minority class samples along the lines between S and each nearest minority neighbor.
31
32
AUC score: 0.814�Geometric mean score: 0.814
Tomec links
33
SMOTE + Tomec
34
35
AUC score: 0.82�Geometric mean score: 0.82
5.
Ensemble methods
36
Easy Ensemble
Create an ensemble sets by iteratively applying random under-sampling.
37
Some more techniques
Resampling:
Ensemble
38
Summary
39
Thanks!
Any questions?
zehori.ido@gmail.com
idozehori.com
40