Unsupervised Learning
“Clustering”
-Fardina Fathmiul Alam
CMSC 320 - Introduction to Data Science
2026
When Dealing with Real-World Problems
Most of the time, data does not come with predefined labels!!!
What can we do?
We can develop machine learning models that can correctly classify unlabeled data by autonomously identifying commonalities in the features. These commonalities are then used to predict classes for new data.
Unsupervised Learning
Definition: Unsupervised Learning
Unsupervised learning, also known as unsupervised machine learning, uses machine learning (ML) algorithms to analyze and cluster unlabeled data sets.
The main goal of these types of algorithms is to study the intrinsic and hidden structure of the data in order to get meaningful insights, segment the datasets in similar groups or to simplify them
Some Applications
What can we Cluster in Practice?
Clustering
Cluster analysis is a powerful unsupervised learning technique used to identify natural groupings or clusters within datasets.
Objectives of “Clustering”
Finding groups of objects such that the objects within the same group are similar or related to one another, while being different from or unrelated to the objects in other groups.
8
Most Common Clustering Algorithms
KMEANS CLUSTERING
K-Means
In Clustering, Quick Recap
K-Means Clustering
Algorithm: steps
How K-Means Works
This is my Training Data
How K-Means Works
Let’s Number of K=3
Initialize 3 Centroids randomly
2
1
Step 3.1 Calculate the distance between each data point X and centroids
Assign Data Points to the Nearest Cluster (2 STEPS)
3
Step 3.2 Each point joins the closest/nearest cluster (based on its minimum distance to the centroid).
Calculate the distance between each data point X and centroids and Each point joins the closest/nearest cluster (based on its minimum distance to the centroid).
Assign Data Points to the Nearest Cluster (2 STEPS)
3
Re-initialize the centroids by calculating the average of all data points of that cluster.
4
‘Nᵢ’ represents the number of data points Xᵢ in ith cluster Cᵢ.
In figure, for Red cluster,
Cᵣₑ: (x’,y’)= ( (x1+x2+...+x5)/5 , (y1+y2+...+y5)/5)
where Nᵢ =5
“PREVIOUS STEP FROM LAST SLIDE”
How does K-means work?
Repeat steps 3 and 4 until convergence
Repeat Steps 3 and 4 until optimal centroids and the assignments of data points to correct clusters are not changing anymore.
17
Figure: Repeat Step 3 and 4 until Convergence
5
2
Convergence/ Stopping Criteria for K-Means Clustering
19
Important Question to Consider
How to determine K ?
20
How to choose K?
21
To select an ideal K, ensure that the identified clusters are distinct from each other.
The Objective of K-Means Clustering
22
Optimization function for k-mean clustering
“Minimize total intra-cluster variance, or Within-Cluster Sum of Square (WCSS)or Inertia
Where WCSS= sum of squares of the distances of each data point in all clusters to their respective centroids (WCSS).
Assumptions:
Finding the optimal number of clusters K - Elbow Method
23
Observe:
Another Method: The silhouette score method to determine the appropriate number of clusters
Finding the optimal number of clusters K - Elbow Method
** Visualization Courtesy: Gavin Hung, Former CMSC320 Student
This is an unsupervised clustering algorithm that assigns points to a centroid
K-Means Properties
SOME SOLUTIONS COULD BE
Run K-Means multiple times with different initializations
Use K-Means++ initialization
Try alternative clustering algorithms
Hierarchical Clustering
Builds a hierarchy of clusters represented as a tree (dendrogram).
Key Concepts:
Hierarchical Clustering
DBSCAN (Density-Based Spatial Clustering)
Density-based clustering algorithm that groups points closely packed together.�
Can find clusters of arbitrary shape and identify noise/outliers.
Key Parameters:
DBSCAN: Core, Border & Noise Points
Core point: Forms & expands clusters; has ≥ minPts neighbors within ε.�
Border point: Defines cluster edges (Within ε of a core point but fewer than minPts neighbors.)�
Noise: Neither core nor border point. Example: Isolated point, not in any group
Why Both Core & Border?
Hard vs Soft Clustering Algorithms
Hard Clustering
Each data point belongs to exactly one cluster (or is marked as noise in DBSCAN).
Soft Clustering
Each data point can belong to multiple clusters with probabilities
Even though DBSCAN and Hierarchical are more flexible in shape and structure than K-Means, they still assign points deterministically, so they fall under hard clustering.
Conclusion