Machine learning�supervised classification
Hugo Jair Escalante,
Manuel Montes and Luis Villaseñor
Machine Learning
Mencione un criterio cualesquiera para agrupar estos animales
Examples biomedical domain
Electrocardiogramas
Mencione un criterio cualesquiera para agrupar estos datos
Manual classification
Hand-coded rule-based systems
Experts
Labeled�documents
Knowledge�engineers
Rule 1, if … then … else
Rule N, if … then …
Classifier
New data
Data’s category
Example: filtering spam email
Classifier 1
Classifier 2
Taken from Hastie et al. The Elements of Statistical Learning, 2007, Springer.
Supervised classification
Recipe
Assumption: a large enough training set of labeled data is available
Later we will study methods that allow us to relax such assumption [semi-supervised and unsupervised learning]
Evaluation of supervised classification
Evaluation of supervised classification
Attributes (|Attr|)
Documents (M)
m1
m2
m3
Evaluation – general ideas
(Do not mind overfitting!)
PISIS research group, UANL, Monterrey, Mexico, 27 de Noviembre de 2009
12
x1
x2
x1
x2
n-fold cross validation
13
Training data
5-fold CV
Train
Test
Error fold 1
Error fold 2
Error fold 3
Error fold 4
Error fold 5
CV
estimate
Performance metrics
What happen if there are more than two classes?
a
b
d
c
Classifier YES
Classifier NO
Label YES
Label NO
Micro and macro averages
Is it important the selection of the averaging strategy?
What happen if we are very bad classifying the minority class?
Comparison of different classifiers
ROC Curve
100%
100%
For a given threshold on f(x), you get a point on the ROC curve.
Actual ROC
0
Positive class success rate
(hit rate, sensitivity)
1 - negative class success rate (false alarm rate, 1-specificity)
Random ROC
Ideal ROC curve
ROC Curve
Ideal ROC curve (AUC=1)
100%
100%
0 ≤ AUC ≤ 1
Actual ROC
Random ROC (AUC=0.5)
0
For a given threshold on f(x), you get a point on the ROC curve.
Positive class success rate
(hit rate, sensitivity)
1 - negative class success rate (false alarm rate, 1-specificity)
Classification algorithms
Machine learning approach (1)
Ronen Feldman and James Sanger, The Text Mining Handbook
Machine learning approach (2)
Have to be Experts?
Labeled�documents�(training set)
Rules, trees, probabilities, prototypes, etc.
Classifier
New document
Document’s category
Inductive process
Experts
How large has to be?
Which algorithm?
How to represent documents?
Machine learning
Trained machine
Query
Learning machine
Training
data
Answers
Isabelle Guyon. A Practical Guide to Model Selection. In Jeremie Marie, editor, Machine Learning Summer School 2008,
Springer Texts in Statistics, 2011. (slide from I.Guyon’s)
Learning = representation + evaluation + optimization
Machine learning approach to TC
23
C1
C2
C3
C4
Categories
X1
X2
How to learn these functions?
Machine learning: classification
Linear model
Machine learning: classification
K=1
Conventions
X={xij}
n
m
xi
y ={yj}
Slide taken from I. Guyon. Feature and Model Selection. Machine Learning Summer School, Ile de Re, France, 2008.
What is a learning algorithm?
Classification algorithms
Naïve Bayes
Naïve Bayes
Sec.13.2
A. M. Kibriya, E. Frank, B. Pfahringer, G. Holmes. Multinomial Naive Bayes for Text Categorization Revisited. Australian Conference on Artificial Intelligence 2004: 488-499
Naïve Bayes
Naïve Bayes
Sec.13.2
t1
C
t2
t|V|
. . . .
Bayes’ Rule for text classification
Sec.13.2
Bayes’ Rule for classification
Sec.13.2
Prior probability of class cj
Probability of occurrence of attr ti in class cj
Smoothing to avoid zero-values
Naïve Bayes classifier
Comments on NB classifier
Naïve Bayes revisited
Sec.13.2
Prior probability of class cj
Probability of occurrence of word ti in class cj
What is the assumed probability distribution?
KNN: K-nearest neighbors classifier
KNN: K-nearest neighbors classifier
KNN: K-nearest neighbors classifier
Positive examples
Negative examples
KNN: K-nearest neighbors classifier
Positive examples
Negative examples
KNN: K-nearest neighbors classifier
Positive examples
Negative examples
KNN: K-nearest neighbors classifier
Positive examples
Negative examples
KNN – the algorithm
Common similarity measures
wki indicates the weight of word k in document i
Selection of K
k pair or impair?
Decision surface
K=1
Decision surface
K=2
Decision surface
K=5
Decision surface
K=10
Selection of K
How to select a good value for K?
The weighted-sum voting scheme
Other alternatives for computing the weights?
KNN - comments
Centroid-based classification
Centroid-based classification
How to compute the prototypes?
H. Han, G. Karypis. Centroid-based Document Classification: Analysis and Experimental Results. Proc. of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 424—431, 2000.
Centroid-based classification
T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning, Springer, 2009.
Calculating the centroids
Comments on Centroid-Based Classification
Support vector machines (SVM)
Linear models
Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork, In Smola et al Eds. Advances in Large Margin Classifiers. Pages 147--169, MIT Press, 2000.
Linear models
?
x1
x2
No Cancer
Cancer
?
Linear models
Linear support vector machine
Linear models
Non-linear support vector machine
Support vector machines (SVM)
Support vectors
Maximize
margin
Support vector machines (SVM)
Subject to:
Non-linear SVMs (on the inputs)
0
x
x2
SVM – discussion
Decision trees
Decision trees
Select in each level the feature that reduces the entropy
All of the data
f1
f2
Choose f2
Choose f1
Random Forest, L. Breiman, Machine Learning (45):1, 5—32, 2001
Decision trees
70
Outlook | Temperature | Humidity | Windy | Play (positive) / Don't Play (negative) |
sunny | 85 | 85 | false | Don't Play |
sunny | 80 | 90 | true | Don't Play |
overcast | 83 | 78 | false | Play |
rain | 70 | 96 | false | Play |
rain | 68 | 80 | false | Play |
rain | 65 | 70 | true | Don't Play |
overcast | 64 | 65 | true | Play |
sunny | 72 | 95 | false | Don't Play |
sunny | 69 | 70 | false | Play |
rain | 75 | 80 | false | Play |
sunny | 75 | 70 | true | Play |
overcast | 72 | 90 | true | Play |
overcast | 81 | 75 | false | Play |
rain | 71 | 80 | true | Don't Play |
Decision trees
71
Ensembles
Voted classification (ensembles)
k experts may be better than one if their individual judgments are appropriately combined
Voted classification (ensembles)
Robi Polikar. Ensemble Learning. Scholarpedia, 4(1):2776.
When do you think
an ensemble can outperform
any of the individual models?
Voted classification (ensembles)
Bagging
Boosting
AdaBoost algorithm
Decision surface: decision tree
C 4.5
Decision surface: random forest
Random forest
Decision surface: Logit boost
Logitboost-trees
Classification algorithms
Neural networks
Course Machine Learning (Developers Google)
Convolutional neural networks
Colab de ejemplo
Want to learn more?
Suggested readings on text classification