1 of 85

Machine learning�supervised classification

Hugo Jair Escalante,

Manuel Montes and Luis Villaseñor

2 of 85

Machine Learning

  • Intuitively, the task is to determine different categories of objects.

  • Classify objects into categories
    • Determine the elements that describe them
    • Determining the elements that set them apart

3 of 85

Mencione un criterio cualesquiera para agrupar estos animales

4 of 85

Examples biomedical domain

Electrocardiogramas

Mencione un criterio cualesquiera para agrupar estos datos

5 of 85

Manual classification

  • Very accurate when job is done by experts
    • Classify news in general categories.
    • Identify a disease

  • But difficult and expensive to scale
    • Different to classify thousands than millions
    • Consider all possible exceptions given a case

6 of 85

Hand-coded rule-based systems

  • Advantage 🡪 easy-to-understand explanatory model

  • Disadvantage 🡪 knowledge acquisition bottleneck
    • too time consuming, too difficult, inconsistency issues

Experts

Labeled�documents

Knowledge�engineers

Rule 1, if … then … else

Rule N, if … then …

Classifier

New data

Data’s category

7 of 85

Example: filtering spam email

  • Rule-based classifier

Classifier 1

Classifier 2

Taken from Hastie et al. The Elements of Statistical Learning, 2007, Springer.

8 of 85

Supervised classification

  • Machine learning approach:

Recipe

    • Gather labeled data
    • Construction of a classifier
      1. Instance representation
      2. Preprocessing
      3. Dimensionality reduction
      4. Classification methods
    • Evaluation of a TC method

Assumption: a large enough training set of labeled data is available

Later we will study methods that allow us to relax such assumption [semi-supervised and unsupervised learning]

9 of 85

Evaluation of supervised classification

  • What to evaluate?
  • How to carry out this evaluation?
    • Which elements (information) are required?
  • How to know which is the best classifier for a given task?
    • Which things are important to perform a fair comparison?

10 of 85

Evaluation of supervised classification

  • The available data is divided into three subsets:
    • Training (m1)
      • used for the construction (learning) the classifier
    • Validation (m2)
      • Optimization of parameters of the TC method
    • Test (m3)
      • Used for the evaluation of the classifier

Attributes (|Attr|)

Documents (M)

m1

m2

m3

11 of 85

Evaluation – general ideas

  • Performance of classifiers is evaluated experimentally

  • Requires a set of objects labeled with categories.
    • Divided into two parts: training and test sets
    • Usually, the test set is the smaller of the two

  • A method to smooth out the variations in the corpus is the n-fold cross-validation.
    • The whole document collection is divided into n equal parts, and then the training-and-testing process is run n times, each time using a different part of the collection as the test set. Then the results for n folds are averaged.

12 of 85

(Do not mind overfitting!)

  • Tradeoff between robustness and fit to data

PISIS research group, UANL, Monterrey, Mexico, 27 de Noviembre de 2009

12

x1

x2

x1

x2

13 of 85

n-fold cross validation

13

Training data

5-fold CV

Train

Test

Error fold 1

Error fold 2

Error fold 3

Error fold 4

Error fold 5

CV

estimate

14 of 85

Performance metrics

  • Considering a binary problem

  • Recall for a category is defined as the percentage of correctly classified documents among all documents belonging to that category, and precision is the percentage of correctly classified documents among all documents that were assigned to the category by the classifier.

What happen if there are more than two classes?

a

b

d

c

Classifier YES

Classifier NO

Label YES

Label NO

15 of 85

Micro and macro averages

  • Macroaveraging: Compute performance for each category, then average.
    • Gives equal weights to all categories

  • Microaveraging: Compute totals of a, b, c and d for all categories, and then compute performance measures.
    • Gives equal weights to all documents

Is it important the selection of the averaging strategy?

What happen if we are very bad classifying the minority class?

16 of 85

Comparison of different classifiers

  • Direct comparison
    • Compared by testing them on the same collection of documents and with the same background conditions.
    • This is the more reliable method

  • Indirect comparison
    • Two classifiers may be compared when they have been tested on different collections and with possibly different background conditions if both were compared with a common baseline.

17 of 85

ROC Curve

100%

100%

For a given threshold on f(x), you get a point on the ROC curve.

Actual ROC

0

Positive class success rate

(hit rate, sensitivity)

1 - negative class success rate (false alarm rate, 1-specificity)

Random ROC

Ideal ROC curve

18 of 85

ROC Curve

Ideal ROC curve (AUC=1)

100%

100%

0 ≤ AUC ≤ 1

Actual ROC

Random ROC (AUC=0.5)

0

For a given threshold on f(x), you get a point on the ROC curve.

Positive class success rate

(hit rate, sensitivity)

1 - negative class success rate (false alarm rate, 1-specificity)

19 of 85

Classification algorithms

20 of 85

Machine learning approach (1)

  • A general inductive process builds a classifier by learning from a set of preclassified examples.
    • Determines the characteristics associated with each one of the topics.

Ronen Feldman and James Sanger, The Text Mining Handbook

21 of 85

Machine learning approach (2)

Have to be Experts?

Labeled�documents�(training set)

Rules, trees, probabilities, prototypes, etc.

Classifier

New document

Document’s category

Inductive process

Experts

How large has to be?

Which algorithm?

How to represent documents?

22 of 85

Machine learning

Trained machine

Query

Learning machine

Training

data

Answers

Isabelle Guyon. A Practical Guide to Model Selection. In Jeremie Marie, editor, Machine Learning Summer School 2008,

Springer Texts in Statistics, 2011. (slide from I.Guyon’s)

Learning = representation + evaluation + optimization

23 of 85

Machine learning approach to TC

23

C1

C2

C3

C4

Categories

X1

X2

How to learn these functions?

24 of 85

Machine learning: classification

Linear model

25 of 85

Machine learning: classification

K=1

26 of 85

Conventions

X={xij}

n

m

xi

y ={yj}

Slide taken from I. Guyon. Feature and Model Selection. Machine Learning Summer School, Ile de Re, France, 2008.

27 of 85

What is a learning algorithm?

  • A function:

  • Given:

28 of 85

Classification algorithms

  • Popular classification algorithms are:
    • Naïve Bayes
      • Probabilistic approach
    • K-Nearest Neighbors
      • Example-based approach
    • Centroid-based classification
      • Prototype-based approach
    • Support Vector Machines
      • Kernel-based approach
    • Decision trees
      • Rule-based approach
    • Boosting, bagging and ensembles in general
      • Voting systems
    • Neural networks
      • Perceptron -> deep learning

29 of 85

Naïve Bayes

30 of 85

Naïve Bayes

  • It is the simplest probabilistic classifier
    • Based on the application of the Bayes theorem

  • Builds a generative model that approximates how data is produced
    • Uses prior probability of each category given no information about an item
    • Categorization produces a posterior probability distribution over the possible categories given a description of an item.

Sec.13.2

A. M. Kibriya, E. Frank, B. Pfahringer, G. Holmes. Multinomial Naive Bayes for Text Categorization Revisited. Australian Conference on Artificial Intelligence 2004: 488-499

31 of 85

Naïve Bayes

  • Bayes theorem:

  • Why?
    • We know that:

    • Then

    • Then

32 of 85

Naïve Bayes

  • For an object d and a class cj

Sec.13.2

  • Assuming terms are independent of each other given the class (naïve assumption)
  • Assuming each object is equally probable

t1

C

t2

t|V|

. . . .

33 of 85

Bayes’ Rule for text classification

  • For a document d and a class cj

Sec.13.2

34 of 85

Bayes’ Rule for classification

  • For an object d and a class cj

  • Estimation of probabilities

Sec.13.2

Prior probability of class cj

Probability of occurrence of attr ti in class cj

Smoothing to avoid zero-values

35 of 85

Naïve Bayes classifier

  • Assignment of the class:

  • Assignment using underflow prevention:
    • Multiplying lots of probabilities can result in floating-point underflow
    • Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities

36 of 85

Comments on NB classifier

  • Very simple classifier which works very well on numerical data

  • Very easy to implement and computationally cheap when compared to other classification algorithms.

  • One of its major limitations is that it performs very poorly when features are highly correlated.

37 of 85

Naïve Bayes revisited

  • For a document d and a class cj

  • Estimation of probabilities

Sec.13.2

Prior probability of class cj

Probability of occurrence of word ti in class cj

What is the assumed probability distribution?

38 of 85

KNN: K-nearest neighbors classifier

39 of 85

KNN: K-nearest neighbors classifier

  • Do not build explicit declarative representations of categories.
    • This kind of methods are called lazy learners

  • “Training” for such classifiers consists of simply storing the representations of the training objects together with their category labels.

  • To decide whether an object d belongs to the category c, kNN checks whether the k training objects most similar to d belong to c.
    • Key element: a definition of “similarity” between objects

40 of 85

KNN: K-nearest neighbors classifier

Positive examples

Negative examples

41 of 85

KNN: K-nearest neighbors classifier

Positive examples

Negative examples

42 of 85

KNN: K-nearest neighbors classifier

Positive examples

Negative examples

43 of 85

KNN: K-nearest neighbors classifier

Positive examples

Negative examples

44 of 85

KNN – the algorithm

  • Given a new object d:
    1. Find the k most similar objects from the training set.
      • Common similarity measures are the Euclidean distance, Manhattan distance, Cosine similarity.

    • Assign the class to d by considering the classes of its k nearest neighbors
      • Majority voting scheme
      • Weighted-sum voting scheme

45 of 85

Common similarity measures

  • Dice coefficient

  • Cosine measure

wki indicates the weight of word k in document i

46 of 85

Selection of K

k pair or impair?

47 of 85

Decision surface

K=1

48 of 85

Decision surface

K=2

49 of 85

Decision surface

K=5

50 of 85

Decision surface

K=10

51 of 85

Selection of K

How to select a good value for K?

52 of 85

The weighted-sum voting scheme

Other alternatives for computing the weights?

53 of 85

KNN - comments

  • It is robust in the sense of not requiring the categories to be linearly separated.

  • The major drawback is the computational effort during classification.

  • Other limitation is that its performance is primarily determined by the choice of k as well as the distance metric applied.

54 of 85

Centroid-based classification

55 of 85

Centroid-based classification

  • This method has two main phases:
    • Training phase: it considers the construction of one single representative instance, called prototype, for each class.

    • Test phase: each unlabeled object is compared against all prototypes and is assigned to the class having the greatest similarity score.

  • Different from k-NN which represent each object in the training set individually.

How to compute the prototypes?

H. Han, G. Karypis. Centroid-based Document Classification: Analysis and Experimental Results. Proc. of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 424—431, 2000.

56 of 85

Centroid-based classification

T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning, Springer, 2009.

57 of 85

Calculating the centroids

  • Centroid as average

  • Centroid as sum

  • Centroid as normalized sum

  • Centroid computation using the Rocchio formula

58 of 85

Comments on Centroid-Based Classification

  • Computationally simple and fast model
    • Short training and testing time

  • Amenable to changes in the training set

  • Can handle imbalanced document sets

  • Disadvantages:
    • Inadequate for non-linear classification problems
    • Problem of inductive bias or model misfit
      • Classifiers are tuned to the contingent characteristics of the training data rather than the constitutive characteristics of the categories

59 of 85

Support vector machines (SVM)

60 of 85

Linear models

  • Idea: learning a linear function (in the parameters) that allow us to separate data
    • f(x) = wx +b = Σj=1:n wj xj +b (linear discriminant)
    • f(x) = w Φ(x) +b = Σj wj φj(x) +b (the perceptron)
    • f(x) = Σi=1:m αi k(xi,x) +b (Kernel-based methods)

Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork, In Smola et al Eds. Advances in Large Margin Classifiers. Pages 147--169, MIT Press, 2000.

61 of 85

Linear models

  • Classification of DNA micro-arrays

?

x1

x2

No Cancer

Cancer

?

62 of 85

Linear models

Linear support vector machine

63 of 85

Linear models

Non-linear support vector machine

64 of 85

Support vector machines (SVM)

  • A binary SVM classifier can be seen as a hyperplane in the feature space separating the points that represent the positive from negative instances.
    • SVMs selects the hyperplane�that maximizes the margin�around it.
    • Hyperplanes are fully�determined by a small subset�of the training instances, called�the support vectors.

Support vectors

Maximize

margin

65 of 85

Support vector machines (SVM)

  • When data are linearly separable we have:

Subject to:

66 of 85

Non-linear SVMs (on the inputs)

  • What about classes whose training instances are not linearly separable?
    • The original input space can always be mapped to some higher-dimensional feature space where the training set is separable.
      • A kernel function is some function that corresponds to an inner product in some expanded feature space.

0

x

x2

67 of 85

SVM – discussion

  • The support vector machine (SVM) algorithm is very fast and effective for classification problems.

    • Flexibility in choosing a similarity function
      • By means of a kernel function

    • Sparseness of solution when dealing with large data sets
      • Only support vectors are used to specify the separating hyperplane

    • Ability to handle large feature spaces
      • Complexity does not depend on the dimensionality of the feature space

68 of 85

Decision trees

69 of 85

Decision trees

Select in each level the feature that reduces the entropy

All of the data

f1

f2

Choose f2

Choose f1

Random Forest, L. Breiman, Machine Learning (45):1, 5—32, 2001

70 of 85

Decision trees

70

Outlook

Temperature

Humidity

Windy

Play (positive) /

Don't Play (negative)

sunny

85

85

false

Don't Play

sunny

80

90

true

Don't Play

overcast

83

78

false

Play

rain

70

96

false

Play

rain

68

80

false

Play

rain

65

70

true

Don't Play

overcast

64

65

true

Play

sunny

72

95

false

Don't Play

sunny

69

70

false

Play

rain

75

80

false

Play

sunny

75

70

true

Play

overcast

72

90

true

Play

overcast

81

75

false

Play

rain

71

80

true

Don't Play

71 of 85

Decision trees

  • Rule 1 suggests that if "outlook = sunny" and "humidity > 75" then "Don't Play".�Rule 2 suggests that if "outlook = overcast" then "Play".�Rule 3 suggests that if "outlook = rainy" and "windy = true" then "Don't Play".�Rule 4 suggests that if "outlook = rainy" and "windy = false" then "Play".�Otherwise, "Play" is the default class.

71

72 of 85

Ensembles

73 of 85

Voted classification (ensembles)

k experts may be better than one if their individual judgments are appropriately combined

  • Two main issues in ensemble construction:
    • Choice of the k classifiers
    • Choice of a combination function

  • Two main approaches:
    • Bagging 🡪 parallel approach
    • Boosting 🡪 sequential approach

74 of 85

Voted classification (ensembles)

Robi Polikar. Ensemble Learning. Scholarpedia, 4(1):2776.

When do you think

an ensemble can outperform

any of the individual models?

75 of 85

Voted classification (ensembles)

  • Idea: combining the outputs of different classification models:
    • Trained in different subsets of data
    • Using different algorithms
    • Using different features

76 of 85

Bagging

  • Individual classifiers are trained in parallel.

  • To work properly, classifiers must differ significantly from each other:
    • Different document representations
    • Different subsets of features
    • Different learning methods

  • Combining results by:
    • Majority vote
    • Weighted linear combination

77 of 85

Boosting

  • Classifiers are trained sequentially using different subsets of the training set
    • Subsets are randomly selected
    • The probability of selecting an instance is not the same for all; it depends on how often that instance was misclassified by the previous k-1 classifiers

  • The idea is to produce new classifiers that are better able to correctly classify examples for which the performance of previous classifiers are poor
    • The decision is determined by a weighted linear combination of the different predictions.

78 of 85

AdaBoost algorithm

79 of 85

Decision surface: decision tree

C 4.5

80 of 85

Decision surface: random forest

Random forest

81 of 85

Decision surface: Logit boost

Logitboost-trees

82 of 85

Classification algorithms

  • Popular classification algorithms are:
    • Naïve Bayes
      • Probabilistic approach
    • K-Nearest Neighbors
      • Example-based approach
    • Centroid-based classification
      • Prototype-based approach
    • Support Vector Machines
      • Kernel-based approach
    • Decision trees
      • Rule-based approach
    • Boosting, bagging and ensembles in general
      • Voting systems
    • Neural networks
      • Perceptron -> deep learning

83 of 85

Neural networks

Course Machine Learning (Developers Google)

Convolutional neural networks

Colab de ejemplo

84 of 85

Want to learn more?

  • C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

  • T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning, Springer, 2009.

  • R. O. Duda, P. Hart, D. Stork. Pattern Classification. Wiley, 2001.

  • I. Guyon, et al. Feature Extraction: Foundations and Applications, Springer 2006.

  • T. Mitchell. Machine Learning. Mc Graw-Hill

 

85 of 85

Suggested readings on text classification

  • X. Ning, G. Karypis. The Set Classification Problem and Solution Methods. Proc. of International Conference on Data Mining Workshops, IEEE, 2008

  • S. Baccianella, A. Esuli, F. Sebastiani. Using Micro-Documents for Feature Selection: The Case of Ordinal Text Classification. Proceedings of the 2nd Italian Information Retrieval Workshop, 2011

  • J. Wang, J. D. Zucker. Solving the Multiple-Instance Problem: A Lazy Learning Approach. Proc. of ICML 200.

  • A. Sun, E.P. Lim, Y. Liu. On Strategies for imbalanced text classification using SVM: a comparative study. Decision Support Systems, Vol. 48, pp. 191—201, 2009.