1 of 28

Classification

Asrul Abdullah

Adapted from Hands on Machine Learning with Scikit-learn, Keras and Tensorflow – Aurélien Géron

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

2 of 28

MNIST

Set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau
Each images is labeled with the digit it represents
This dataset called the “Hello World” of Machine Learning whenever come up with a new classification algorithm

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

3 of 28

Code MNIST

>>> from sklearn.datasets import fetch_openml�>>> mnist = fetch_openml('mnist_784', version=1)�>>> mnist.keys()�dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details',�'categories', 'url']) �

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

4 of 28

Code

>>> X, y = mnist["data"], mnist["target"]�>>> X.shape�(70000, 784)�>>> y.shape�(70000,) �

There are 70,000 images, and each image has 784 features. This is because each image is 28×28 pixels�

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

5 of 28

Code

import matplotlib as mpl�import matplotlib.pyplot as plt�some_digit = X[0]�some_digit_image = some_digit.reshape(28, 28)�plt.imshow(some_digit_image, cmap = mpl.cm.binary, interpolation="nearest")�plt.axis("off")�plt.show() �

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

6 of 28

Display MNIST

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

7 of 28

Code

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] �

The MNIST dataset is actually already split into a training set (the first 60,000 images) and a test set (the last 10,000 images)

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

8 of 28

Binary Classifier

y_train_5 = (y_train == 5) # True for all 5s, False for all other digits.�y_test_5 = (y_test == 5) �

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

9 of 28

SGD

Stochastic Gradient Descent (SGD) classifier, using Scikit-Learn’s SGDClassifier class.

This classifier has the advantage of being capable of handling very large datasets efficiently �

from sklearn.linear_model import SGDClassifier�sgd_clf = SGDClassifier(random_state=42)�sgd_clf.fit(X_train, y_train_5) �

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

10 of 28

Predict

>>> sgd_clf.predict([some_digit])�array([ True]) �

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

11 of 28

Performance Measures

Evaluating a classifier is often significantly trickier than evaluating a regressor

Measuring Accuracy Using Cross-Validation�

Cross-validation is a robust machine learning technique for evaluating model performance by partitioning data into subsets (folds), training on some, and testing on others iteratively. It prevents overfitting and ensures better generalization on unseen data compared to a single train-test split.

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

12 of 28

Implementing Cross Validation

from sklearn.model_selection import StratifiedKFold�from sklearn.base import clone�skfolds = StratifiedKFold(n_splits=3, random_state=42)�for train_index, test_index in skfolds.split(X_train, y_train_5):�clone_clf = clone(sgd_clf)�X_train_folds = X_train[train_index]�y_train_folds = y_train_5[train_index]�X_test_fold = X_train[test_index]�y_test_fold = y_train_5[test_index]�clone_clf.fit(X_train_folds, y_train_folds)�y_pred = clone_clf.predict(X_test_fold)�n_correct = sum(y_pred == y_test_fold)�print(n_correct / len(y_pred)) # prints 0.9502, 0.96565 and 0.96495

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

13 of 28

Cross Validation

>>> from sklearn.model_selection import cross_val_score�>>> cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")�array([0.96355, 0.93795, 0.95615]) �

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

14 of 28

BaseEstimator

from sklearn.base import BaseEstimator�class Never5Classifier(BaseEstimator):�def fit(self, X, y=None):�pass�def predict(self, X):�return np.zeros((len(X), 1), dtype=bool) �

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

15 of 28

Testing Accuracy

>>> never_5_clf = Never5Classifier()�>>> cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")�array([0.91125, 0.90855, 0.90915]) �

This demonstrates why accuracy is generally not the preferred performance measure for classifiers, especially when you are dealing with skewed datasets (i.e when some classes are much more frequent than others)

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

16 of 28

Confusion Matrix

A much better way to evaluate the performance of a classifier is to look at the confusion matrix �

from sklearn.model_selection import cross_val_predict�y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3) �

>>> from sklearn.metrics import confusion_matrix�>>> confusion_matrix(y_train_5, y_train_pred)�array([[53057, 1522],� [ 1325, 4096]]) �

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

17 of 28

Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data

true positives (TP): These are cases in which we predicted positive (they have the disease), and they do have the disease.
true negatives (TN): We predicted negative, and they don't have the disease.
false positives (FP): We predicted positive, but they don't actually have the disease. (Also known as a "Type I error.")
false negatives (FN): We predicted negative, but they actually do have the disease. (Also known as a "Type II error.")

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

18 of 28

Confusion Matrix

	Actual: Positive	Actual: Negative
Predicted: �Positive	tp	fp
Predicted: Negative	fn	tn

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

19 of 28

Precision and Recall in Text Retrieval

Precision

The ability to retrieve top-ranked documents that are mostly relevant.
Precision P = tp/(tp + fp)

Recall

The ability of the search to find all of the relevant items in the corpus.
Recall R = tp/(tp + fn)

	Relevant	Nonrelevant
Retrieved	tp	fp
Not Retrieved	fn	tn

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

20 of 28

Accuracy

Overall, how often is the classifier correct?

Number of correct predictions / Total number of predictions
Accuracy = tp+tn/(tp + fp + fn + tn)

��

Accuracy = 1+90/(1+1+8+90) = 0.91
91 correct prediction out of 100 total examples
Precision = 1/2 and Recall =1/9
Accuracy alone doesn't tell the full story when you're working with a class imbalanced data set

	Positive	Negative
Predicted Positive	1	1
Predicted Negative	8	90

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

21 of 28

F Measure (F1/Harmonic Mean)

One measure of performance that takes into account both recall and precision.
Harmonic mean of recall and precision:

Why harmonic mean?
harmonic mean emphasizes the importance of small values, whereas the arithmetic mean is affected more by outliers that are unusually large
Data are extremely skewed; over 99% documents are non-relevant. This is why accuracy is not an appropriate measure
Compared to arithmetic mean, both need to be high for harmonic mean to be high.

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

22 of 28

ROC Curve

A receiver operating characteristic curve, i.e. ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
The diagnostic performance of a test, or the accuracy of a test to discriminate diseased cases from normal cases is evaluated using Receiver Operating Characteristic (ROC) curve analysis
A ROC Curve is a way to compare diagnostic tests. It is a plot of the true positive rate against the false positive rate.

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

23 of 28

ROC Curve

This is an ideal situation. Model has an ideal measure of separability. It is perfectly able to distinguish between positive class and negative class.

This is the worst situation. When AUC is approximately 0.5, model has no discrimination capacity to distinguish between positive class and negative class. Random predictions.

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

24 of 28

Multiple ROC Curves

Comparison of multiple classifiers is usually straight-forward especially when no curves cross each other. Curves close to the perfect ROC curve have a better performance level than the ones closes to the baseline.

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

25 of 28

Precision/Recall Tradeoﬀ

a dilemma in machine learning where increasing the accuracy of positive predictions (precision) tends to decrease the model's ability to find all positive cases (recall), and vice versa.

This tradeoff arises when setting the classification threshold:

Increased Precision (Decreased Recall): As the threshold is raised, the model becomes more stringent. The model only predicts positive if it is very confident (reducing false positives), but it risks missing less obvious positives (increasing false negatives).

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

26 of 28

Precision/Recall Tradeoﬀ

Increased Recall (Decreasing Precision): As the threshold is lowered, the model becomes more stringent. The model detects almost all positives (decreasing false negatives), but it also flags more negatives as positives (increasing false positives).

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

27 of 28

Case Example

Spam Detection (Precision Priority):

We want all emails that land in the spam folder to be spam. We don't want important emails to end up there.

Tradeoff: Some spam emails might make it through to the inbox (low recall).

Cancer Detection (Recall Priority)

We want all cancer patients detected. We don't want to miss any sick patients.

Tradeoff: Some healthy patients might be misdiagnosed (low precision), but they will be re-examined by a doctor.

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas

28 of 28

How to Handle it ?

Since it is difficult to maximize both, a metric is used to find a balance, namely the F1-Score, which is the harmonic mean of precision and recall.

Universitas Muhammadiyah Pontianak

www.asrulabdullah.my.id

inovasi, kolaborasi & integritas