How to do Classification
-Fardina Fathmiul Alam
CMSC 320 - Introduction to Data Science
Fall 2025
Our DS Process
We're Going to Cover Different Classification Methods:
Naive Bayes Algorithm
A probabilistic classification algorithm based on Bayes' Theorem. It assumes that the features are independent given the class label. This assumption of independence is what makes it "naive."
Naive Bayes Algorithm
By substituting for X and expanding using the chain rule we get,
Where:
we need to find the class y with maximum probability.
Final Decision Rule:
Since the denominator is constant across all data points, we can ignore it, simplifying the calculation to proportionality.
Variable y represents the class labels.
Variable X represent the features.
Assumptions of Naive Bayes Algorithm
Example: Assumptions of Naive Bayes Algorithm
Assumptions:
Consider the problem of playing golf.
Uses of Naive Bayes Algorithm
Classification problems; commonly used in text classification within Natural Language Processing.
Process:
Example: Naive Bayes
TASK: Predict the label ("Sports" or "Non-Sports") for the new text data "A very close game". Is it categorized as "Sports" or "Non-Sports"?
Assume you have the following documents (text) and their associated tags.
Training Data
Steps: Calculate Prior Probabilities:
P(Sports) = 3/5 ; P(Not Sports) = 2/5
Example: Naive Bayes
P (Sports | A very close game)
P (Not sports | A very close game)
Steps Likelihood: We want to count how many times the sentence “A very close game” appears in the Sports or Non-Sports tag
We compare:
Training Data
and
Example: Naive Bayes
P (Sports | A very close game)
P (Not sports | A very close game)
Steps Likelihood: We want to count how many times the sentence “A very close game” appears in the Sports or Non-Sports tag
Training Data
Did you see any issue?
Being Naive: Here comes the Naive Part!
Naive part: Naive Bayes assume that every word (features) in a sentence is conditionally independent of the other ones given the class. This means that we’re no longer looking at entire sentences, but rather at individual words.
Each word is treated as a separate feature that contributes to the overall classification decision.
Back to Example: Naive Bayes
P (Sports | A very close game)
P (Not sports | A very close game)
Steps Likelihood: We want to count how many times the sentence “A very close game” appears in the Sports or Non-Sports tag
Calculate Probabilities of each individual word :
Training Data
Did you see any issue?
Example:
= 2/11
Back to Example: Naive Bayes
P (Sports | A very close game)
P (Not sports | A very close game)
Steps Likelihood: We want to count how many times the sentence “A very close game” appears in the Sports or Non-Sports tag
Calculate Probabilities of each individual word :
Training Data
Solution: Laplace smoothing: we add 1 to every count so it’s never zero.
Laplace / Additive smoothing
A technique in Naive Bayes that addresses zero probabilities when a word is absent from training data.
Training Data
['a', 'great', 'very', 'over', 'it', 'but', 'game', 'election', 'clean', 'close', 'the', 'was', 'forgettable', 'match'].
V= number of possible words = 14
Full Solution using Laplace smoothing
Training Data
Python Implementation for Naive Bayes using Scikit Learn
(2) Support Vector Machine (SVMs)
Support Vector Machines
A supervised ML algorithm used for both regression and classification tasks. Aim to find the best possible decision boundary (identify a hyperplane) that separates data points into different classes.
Support Vector Machines
Objective: find a optimal hyperplane in an N-dimensional space(N — the number of features) that separates different classes in the feature space.
Divides data based on the maximum margin hyperplane.
What is Support Vectors?: Data points closest to the hyperplane that influence its position and orientation. These points are critical for the model's decision boundary.
What is Support Vector
The closest data points to the hyperplane, which makes a critical role in deciding the hyperplane and margin (distance between the support vector and hyperplane.). They are the only instances that determine, or support, the hyperplane. It influence the position and orientation of the hyperplane.
What is Hyperplane?
Hyperplanes are decision boundaries that help classify the data points. It is is a flat surface that is one dimension lower than the input feature space.
dimension of the hyperplane depends upon the number of features.
In a two-dimensional feature space, a hyperplane is a line like y=mx+b.
In a three-dimensional feature space, a hyperplane is a plane like x+3y-2z=5.
Intuitively, an SVM hyperplane separates the feature space into two regions. (Ref: zybook)
Back to: Support Vector Machines
Consider 2-dimensional class: (many possible hyperplanes could be chosen)
Objective: find a optimal hyperplane that has the maximum margin, i.e the maximum distance between data points of both classes.
What is Margin in Support Vector Machines
Margin: The distance between the hyperplane and the nearest support vectors.
SVM aims to maximize this margin for better generalization.
Multiple hyperplanes separate the data from two classes.
Q: Which one to choose?
One reasonable choice as the best hyperplane is the one that represents the largest separation or margin between the two classes.
Why maximizing the margin?
helps create a wider gap between data points of different classes, making the model better at classifying new data by allowing a larger buffer zone between categories.
Remember
Support vectors are the critical elements of the training set that can (or may) change the position of the dividing hyperplane if removed.
SVM: Loss function
In the SVM algorithm, we are looking to maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is hinge loss.
Where
Where
y = True class label (lets say, +1 or -1
f(x) = predicted output label for the given input X
how far a particular data point is from being correctly classified
However
Not all data are linearly separable (meaning there is no single hyperplane that can perfectly separate the classes).
When data isn't separable by a straight line, SVMs use a clever math trick → “Kernel Trick”
Kernel Trick:
Mathematical method used to transform the original data into a higher-dimensional space where the data becomes linearly separable.
Mapping 1D data to 2D to become able to separate the two classes using Kernel
Here, new variable yi as a function of distance from origin o
https://medium.com/@zxr.nju/what-is-the-kernel-trick-why-is-it-important-98a98db0961d
Python Implementation
for SVM using
Scikit Learn
Next: Random Forest Classifier
But before that, let’s talk abou “Overfitting” and “Underfitting” a bit.
Recap: We split dataset into training and testing before before training and evaluating a machine learning model. But why do we test?
We don't want to use a model that doesn't work! There are many reasons a model might break, but the main ones are:
Overfitting and Underfitting are two crucial concepts in machine learning and are the prevalent causes for the poor performance of a machine learning model.
Combatting Training Failures
Overfitting
Do we think decision trees are more prone to over or under-fitting?
What about KNN where K=1? Or K=N?
Overfitting - in KNN
In KNN k=1 (or small value of k): Prone to overfitting.
k=1: Overfitting (too specific): With a small k, the model becomes highly sensitive to noise and outliers in the training data.
But in KNN
K=N: Underfitting (too general)!
Why? consider all data points as neighbors, overly generalized predictions that might underfit the data by not effectively capturing local patterns.
https://towardsdatascience.com/k-nearest-neighbors-94395f445221#:~:text=At%20K%3D1%2C%20the%20KNN,for%20different%20values%20of%20K.&text=As%20K%20increases%2C%20the%20KNN,smoother%20curve%20to%20the%20data.
As K increases, the KNN fits a smoother curve to the data. Why? a higher value of K reduces the edginess by taking more data into account, thus reducing the overall complexity and flexibility of the model.
But observe top figure, where increasing the value of K improves the score to a certain point, after which it again starts dropping (indicating underfitting).
Overfitting - How about decision tree?
Decision trees have a really bad overfitting problem!
A decision tree will use every scrap of information you give it and will memorize the entire training set
Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting.
How to deal with overfitting? Play with Hyperparameters
Hyperparameter: An argument that determines how the model works.
Often times, we play around with a bunch of hyperparameters, and see what works best for the test set.
How to deal with overfitting in a decision tree
Ensemble methods
Ensemble methods, such as bagging and boosting, can also mitigate overfitting.
These techniques combine multiple models to make predictions, reducing the impact of individual model's biases and errors.
By leveraging the diversity of these models, ensemble methods enhance the generalization ability and minimize overfitting.
Random Forest
Random Forest is an ensemble learning method primarily used for classification and regression tasks.
Key Steps in Random Forest
Bootstrap Sampling (Bagging): creates N subsets of the original training data by randomly sampling with replacement. Each subset will be used to train a different decision tree.
Building Decision Trees:
For each bootstrap sample, a decision tree is constructed. Random Forest randomly select features for splitting to increase tree diversity.
Key Steps in Random Forest
Predictions:
'N' is a hyperparameter that determines the number of trees in the forest.
Usually the higher the number of trees the better to learn the data. However, adding a lot of trees can slow down the training process considerably, therefore we do a parameter search to find the sweet spot.
Random Forest Implementation using Scikit Learn
How do Code?
(SELF STUDY - NOT FOR EXAM)
Scikit Learn
So far, we have our data in a dataframe. Scikit learn has a classifier object that all classifiers inherit. It will have the following methods:
fit(X, y): A function that trains the model, where X is something like a dataframe of training data, and y is the labels
predict(X): Takes in unlabeled data and produces an array of labels
SKLearn Example
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(iris.data, iris.target)
clf.predict(unlabeled_data)
SKLearn Example
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=4)
clf.fit(iris.data, iris.target)
clf.predict(unlabeled_data)
Controlling the depth can help prevent the model from overfitting and can lead to a more straightforward or less complex decision tree.
Troubleshooting
Debugging
ML is hard to debug, and it's scary not being able to track what's going on directly. So, when I do projects I create three other dataframes:
Debugging
ML is hard to debug, and it's scary not being able to track what's going on directly. So, when I do projects I create three other dataframes:
Debugging
ML is hard to debug, and it's scary not being able to track what's going on directly. So, when I do projects I create three other dataframes:
If I've done everything correctly, I should get accuracy no better than random guessing
Random
ML is hard to debug, and it's scary not being able to track what's going on directly. So, when I do projects I create three other dataframes:
If I've done everything correctly, this should have a 100%
Obvious
ML is hard to debug, and it's scary not being able to track what's going on directly. So, when I do projects I create three other dataframes:
If I've done everything correctly, this should have a 100%
Easy
ML is hard to debug, and it's scary not being able to track what's going on directly. So, when I do projects I create three other dataframes:
If I've done everything correctly, this should get close to 100%
How to Use Them
When I go to calibrate, train and test everything, I do all my operations on all three of these, in addition to the original dataset. If something goes wrong, I'll know right away what's happening.
Other Tips: Class imbalance
Class imbalance can be really challenging to deal with!
Options: