1 of 59

How to do Classification

-Fardina Fathmiul Alam

CMSC 320 - Introduction to Data Science

Fall 2025

2 of 59

Our DS Process

Acquire data
Explore data
Clean data
Feature engineer
Train
Test
Iterate
Present

3 of 59

We're Going to Cover Different Classification Methods:

Naive Bayes
Support Vector Machines
Random Forest

4 of 59

Naive Bayes Classifiers

5 of 59

Naive Bayes Algorithm

A probabilistic classification algorithm based on Bayes' Theorem. It assumes that the features are independent given the class label. This assumption of independence is what makes it "naive."

6 of 59

Naive Bayes Algorithm

By substituting for X and expanding using the chain rule we get,

Where:

we need to find the class y with maximum probability.

Final Decision Rule:

Since the denominator is constant across all data points, we can ignore it, simplifying the calculation to proportionality.

Variable y represents the class labels.

Variable X represent the features.

7 of 59

Assumptions of Naive Bayes Algorithm

Independence Assumption: All features X1,X2,…., Xn are independent given the class y.

Features are equally important: Treat all features X1,X2, … Xn as they contribute equally to the prediction of the class label.

Continuous features are assumed to be normally distributed: within each class (Gaussian Naive Bayes).
No missing data

8 of 59

Example: Assumptions of Naive Bayes Algorithm

Task: Classify if a day is suitable for playing golf based on features.
Features: Columns represent features; rows represent individual entries.

Assumptions:

Independence: Hot temperature does not imply high humidity.
Equal Impact: Windy conditions do not outweigh other factors in the decision to play golf.

Consider the problem of playing golf.

9 of 59

Uses of Naive Bayes Algorithm

Classification problems; commonly used in text classification within Natural Language Processing.

Why? Text data has high dimensionality, with each word represent one feature (e.g., spam filtering, sentiment analysis).

Process:

Compute Probabilities: Calculate probabilities for each label/tag based on features.
Select Best Label: Choose the label with the highest calculated probability.
Apply Bayes' Theorem: Utilize Bayes' Theorem, incorporating prior knowledge of the features.

10 of 59

Example: Naive Bayes

TASK: Predict the label ("Sports" or "Non-Sports") for the new text data "A very close game". Is it categorized as "Sports" or "Non-Sports"?

P (Sports | A very close game)

P (Not sports | A very close game)

Assume you have the following documents (text) and their associated tags.

Each word (token) in the document is treated as a feature (Feature Representation →LATER TOPIC).

Training Data

Steps: Calculate Prior Probabilities:

P(Sports) = 3/5 ; P(Not Sports) = 2/5

11 of 59

Example: Naive Bayes

P (Sports | A very close game)

P (Not sports | A very close game)

Steps Likelihood: We want to count how many times the sentence “A very close game” appears in the Sports or Non-Sports tag

We compare:

Training Data

and

12 of 59

Example: Naive Bayes

P (Sports | A very close game)

P (Not sports | A very close game)

Steps Likelihood: We want to count how many times the sentence “A very close game” appears in the Sports or Non-Sports tag

Training Data

Did you see any issue?

13 of 59

Being Naive: Here comes the Naive Part!

Naive part: Naive Bayes assume that every word (features) in a sentence is conditionally independent of the other ones given the class. This means that we’re no longer looking at entire sentences, but rather at individual words.

Each word is treated as a separate feature that contributes to the overall classification decision.

14 of 59

Back to Example: Naive Bayes

P (Sports | A very close game)

P (Not sports | A very close game)

Steps Likelihood: We want to count how many times the sentence “A very close game” appears in the Sports or Non-Sports tag

Calculate Probabilities of each individual word :

Training Data

Did you see any issue?

Example:

= 2/11

15 of 59

Back to Example: Naive Bayes

P (Sports | A very close game)

P (Not sports | A very close game)

Steps Likelihood: We want to count how many times the sentence “A very close game” appears in the Sports or Non-Sports tag

Calculate Probabilities of each individual word :

Training Data

Solution: Laplace smoothing: we add 1 to every count so it’s never zero.

16 of 59

Laplace / Additive smoothing

A technique in Naive Bayes that addresses zero probabilities when a word is absent from training data.

Adds a small constant alpha (usually 1) to the numerator and the number of unique words to the denominator, ensuring that every word has a non-zero probability.

Training Data

['a', 'great', 'very', 'over', 'it', 'but', 'game', 'election', 'clean', 'close', 'the', 'was', 'forgettable', 'match'].

V= number of possible words = 14

17 of 59

Full Solution using Laplace smoothing

Training Data

18 of 59

Python Implementation for Naive Bayes using Scikit Learn

19 of 59

(2) Support Vector Machine (SVMs)

20 of 59

Support Vector Machines

A supervised ML algorithm used for both regression and classification tasks. Aim to find the best possible decision boundary (identify a hyperplane) that separates data points into different classes.

21 of 59

Support Vector Machines

Objective: find a optimal hyperplane in an N-dimensional space(N — the number of features) that separates different classes in the feature space.

Divides data based on the maximum margin hyperplane.

What is Support Vectors?: Data points closest to the hyperplane that influence its position and orientation. These points are critical for the model's decision boundary.

22 of 59

What is Support Vector

The closest data points to the hyperplane, which makes a critical role in deciding the hyperplane and margin (distance between the support vector and hyperplane.). They are the only instances that determine, or support, the hyperplane. It influence the position and orientation of the hyperplane.

23 of 59

What is Hyperplane?

Hyperplanes are decision boundaries that help classify the data points. It is is a flat surface that is one dimension lower than the input feature space.

dimension of the hyperplane depends upon the number of features.

In a two-dimensional feature space, a hyperplane is a line like y=mx+b.

In a three-dimensional feature space, a hyperplane is a plane like x+3y-2z=5.

Intuitively, an SVM hyperplane separates the feature space into two regions. (Ref: zybook)

24 of 59

Back to: Support Vector Machines

Consider 2-dimensional class: (many possible hyperplanes could be chosen)

Objective: find a optimal hyperplane that has the maximum margin, i.e the maximum distance between data points of both classes.

25 of 59

What is Margin in Support Vector Machines

Margin: The distance between the hyperplane and the nearest support vectors.

SVM aims to maximize this margin for better generalization.

Multiple hyperplanes separate the data from two classes.

Q: Which one to choose?

One reasonable choice as the best hyperplane is the one that represents the largest separation or margin between the two classes.

Why maximizing the margin?

helps create a wider gap between data points of different classes, making the model better at classifying new data by allowing a larger buffer zone between categories.

26 of 59

27 of 59

Remember

Support vectors are the critical elements of the training set that can (or may) change the position of the dividing hyperplane if removed.

28 of 59

SVM: Loss function

In the SVM algorithm, we are looking to maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is hinge loss.

Where

y = True class label (lets say, +1 or -1

f(x) = predicted output label for the given input X

how far a particular data point is from being correctly classified

29 of 59

However

Not all data are linearly separable (meaning there is no single hyperplane that can perfectly separate the classes).

When data isn't separable by a straight line, SVMs use a clever math trick → “Kernel Trick”

30 of 59

Kernel Trick:

Mathematical method used to transform the original data into a higher-dimensional space where the data becomes linearly separable.

This mapping uses a kernel function, which computes the inner products between data points in the higher-dimensional space directly, instead of explicitly calculating their coordinates.
Popular kernel functions include:

Mapping 1D data to 2D to become able to separate the two classes using Kernel

Here, new variable yi as a function of distance from origin o

https://medium.com/@zxr.nju/what-is-the-kernel-trick-why-is-it-important-98a98db0961d

31 of 59

Python Implementation

for SVM using

Scikit Learn

32 of 59

Next: Random Forest Classifier

But before that, let’s talk abou “Overfitting” and “Underfitting” a bit.

Recap: We split dataset into training and testing before before training and evaluating a machine learning model. But why do we test?

We don't want to use a model that doesn't work! There are many reasons a model might break, but the main ones are:

Overfitting
Underfitting

33 of 59

Overfitting and Underfitting are two crucial concepts in machine learning and are the prevalent causes for the poor performance of a machine learning model.

34 of 59

Combatting Training Failures

35 of 59

Overfitting

Do we think decision trees are more prone to over or under-fitting?

What about KNN where K=1? Or K=N?

36 of 59

Overfitting - in KNN

In KNN k=1 (or small value of k): Prone to overfitting.

k=1: Overfitting (too specific): With a small k, the model becomes highly sensitive to noise and outliers in the training data.

It reacts too strongly to individual points, including noise.
It memorizes training data points, capturing noise instead of learning general patterns.
This results in poor performance on new, unseen data as it fails to filter out irrelevant information.

37 of 59

But in KNN

K=N: Underfitting (too general)!

Why? consider all data points as neighbors, overly generalized predictions that might underfit the data by not effectively capturing local patterns.

With k=n, the model uses all the training points for classification. This often leads to underfitting because it averages all points, missing local patterns
Result in a simple decision boundary that might not capture the underlying patterns in the data effectively.
simplifies the model too much
often performs poorly because it's too simple to understand the data.

38 of 59

https://towardsdatascience.com/k-nearest-neighbors-94395f445221#:~:text=At%20K%3D1%2C%20the%20KNN,for%20different%20values%20of%20K.&text=As%20K%20increases%2C%20the%20KNN,smoother%20curve%20to%20the%20data.

As K increases, the KNN fits a smoother curve to the data. Why? a higher value of K reduces the edginess by taking more data into account, thus reducing the overall complexity and flexibility of the model.

But observe top figure, where increasing the value of K improves the score to a certain point, after which it again starts dropping (indicating underfitting).

39 of 59

Overfitting - How about decision tree?

Decision trees have a really bad overfitting problem!

A decision tree will use every scrap of information you give it and will memorize the entire training set

Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting.

40 of 59

How to deal with overfitting? Play with Hyperparameters

Hyperparameter: An argument that determines how the model works.

K for KNN

Often times, we play around with a bunch of hyperparameters, and see what works best for the test set.

41 of 59

How to deal with overfitting in a decision tree

Apply a depth limit (fixed depth/ early stopping)
Apply Pruning where we just chop off everything after four levels
We can try different ones and see how it affects our accuracy on the test set
Or, use ensembles of different trees (random forests)

42 of 59

Ensemble methods

Ensemble methods, such as bagging and boosting, can also mitigate overfitting.

These techniques combine multiple models to make predictions, reducing the impact of individual model's biases and errors.

By leveraging the diversity of these models, ensemble methods enhance the generalization ability and minimize overfitting.

43 of 59

Random Forest

Random Forest is an ensemble learning method primarily used for classification and regression tasks.

Builds a multitude of decision trees during training and merges their predictions to improve accuracy and control overfitting. Here’s how it works

44 of 59

Key Steps in Random Forest

Bootstrap Sampling (Bagging): creates N subsets of the original training data by randomly sampling with replacement. Each subset will be used to train a different decision tree.

Building Decision Trees:

For each bootstrap sample, a decision tree is constructed. Random Forest randomly select features for splitting to increase tree diversity.

45 of 59

Key Steps in Random Forest

Predictions:

Classification: Use majority voting from all trees.
Regression: Calculate the average prediction from all trees.

If a decision tree is overfit or has locked on to a weird correlation, it gets outvoted
Not great for super high-dimensional data

'N' is a hyperparameter that determines the number of trees in the forest.

Usually the higher the number of trees the better to learn the data. However, adding a lot of trees can slow down the training process considerably, therefore we do a parameter search to find the sweet spot.

46 of 59

Random Forest Implementation using Scikit Learn

47 of 59

How do Code?

(SELF STUDY - NOT FOR EXAM)

48 of 59

Scikit Learn

So far, we have our data in a dataframe. Scikit learn has a classifier object that all classifiers inherit. It will have the following methods:

fit(X, y): A function that trains the model, where X is something like a dataframe of training data, and y is the labels

predict(X): Takes in unlabeled data and produces an array of labels

49 of 59

SKLearn Example

from sklearn.datasets import load_iris

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()

clf.fit(iris.data, iris.target)

clf.predict(unlabeled_data)

Load Iris dataset: Imports the Iris dataset.
Initialize a Decision Tree Classifier: Creates a decision tree classifier.
Train the Classifier: Uses the classifier to fit (train) on the Iris dataset's features (iris.data) and their labels (iris.target).
Make Predictions: Once trained, uses the trained model to predict labels for new, unlabeled data (unlabeled_data).

50 of 59

SKLearn Example

from sklearn.datasets import load_iris

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=4)

clf.fit(iris.data, iris.target)

clf.predict(unlabeled_data)

Controlling the depth can help prevent the model from overfitting and can lead to a more straightforward or less complex decision tree.

51 of 59

Troubleshooting

52 of 59

Debugging