1 of 34

Unsupervised Learning

“Clustering”

-Fardina Fathmiul Alam

CMSC 320 - Introduction to Data Science

2026

2 of 34

3 of 34

When Dealing with Real-World Problems

Most of the time, data does not come with predefined labels!!!

What can we do?

We can develop machine learning models that can correctly classify unlabeled data by autonomously identifying commonalities in the features. These commonalities are then used to predict classes for new data.

Unsupervised Learning

4 of 34

Definition: Unsupervised Learning

Unsupervised learning, also known as unsupervised machine learning, uses machine learning (ML) algorithms to analyze and cluster unlabeled data sets.

  • These algorithms discover hidden patterns or data groupings without the need for human intervention. E.g. Clustering.

The main goal of these types of algorithms is to study the intrinsic and hidden structure of the data in order to get meaningful insights, segment the datasets in similar groups or to simplify them

5 of 34

Some Applications

  • Cluster Analysis: Segmenting datasets by shared attributes to identify natural groupings or clusters within the data.

  • Dimensionality Reduction: Simplifying datasets by aggregating variables with similar attributes, thereby reducing the number of dimensions while preserving the essential structure of the data.

  • Anomaly Detection: Identifying anomalies or outliers in the data that do not conform to the patterns exhibited by the majority of the dataset.

6 of 34

What can we Cluster in Practice?

  • News articles or web pages by topic
  • Protein sequences by function, or genes according to expression profile
  • Users of social networks by interest
  • Customers according to purchase history

7 of 34

Clustering

Cluster analysis is a powerful unsupervised learning technique used to identify natural groupings or clusters within datasets.

  • Group similar data points or objects into clusters or categories.
  • By grouping data points with similar characteristics, cluster analysis helps reveal hidden patterns, trends, and relationships.

8 of 34

Objectives of “Clustering”

Finding groups of objects such that the objects within the same group are similar or related to one another, while being different from or unrelated to the objects in other groups.

  • Intra-Cluster Distance are Minimized: Group objects (observations) that are very similar or homogeneous together (in same cluster).
  • Inter-Cluster Distance are Maximized: Observations belonging to different clusters are different or heterogeneous
  • Facilitates interpretation.

8

9 of 34

Most Common Clustering Algorithms

  • K-Means
  • Hierarchical Clustering
  • Density Based Scan Clustering (DBSCAN)
  • Gaussian Clustering Model

10 of 34

KMEANS CLUSTERING

11 of 34

K-Means

  • As unsupervised learning algorithm
  • An algorithm designed to assign clusters to each of your datapoints
  • You begin with k, how many clusters you expect
  • The algorithm randomly starts, then converges

12 of 34

In Clustering, Quick Recap

13 of 34

K-Means Clustering

Algorithm: steps

14 of 34

How K-Means Works

This is my Training Data

15 of 34

How K-Means Works

Let’s Number of K=3

Initialize 3 Centroids randomly

2

1

Step 3.1 Calculate the distance between each data point X and centroids

Assign Data Points to the Nearest Cluster (2 STEPS)

3

Step 3.2 Each point joins the closest/nearest cluster (based on its minimum distance to the centroid).

16 of 34

Calculate the distance between each data point X and centroids and Each point joins the closest/nearest cluster (based on its minimum distance to the centroid).

Assign Data Points to the Nearest Cluster (2 STEPS)

3

Re-initialize the centroids by calculating the average of all data points of that cluster.

4

‘Nᵢ’ represents the number of data points Xᵢ in ith cluster Cᵢ.

In figure, for Red cluster,

Cᵣₑ: (x’,y’)= ( (x1+x2+...+x5)/5 , (y1+y2+...+y5)/5)

where Nᵢ =5

“PREVIOUS STEP FROM LAST SLIDE”

17 of 34

How does K-means work?

Repeat steps 3 and 4 until convergence

Repeat Steps 3 and 4 until optimal centroids and the assignments of data points to correct clusters are not changing anymore.

17

Figure: Repeat Step 3 and 4 until Convergence

5

18 of 34

2

19 of 34

Convergence/ Stopping Criteria for K-Means Clustering

  • Centroids of newly formed clusters do not change → not learning any new pattern.
  • Points remain in the same cluster.
  • Maximum number of iterations are reached → how many times you want to run.

19

20 of 34

Important Question to Consider

How to determine K ?

20

21 of 34

How to choose K?

  • If K is very large → a cluster for every data point.
    • May defeats the purpose of clustering ( assign cluster for every data point)
  • If K = 1
    • One big cluster for the entire data set.

21

To select an ideal K, ensure that the identified clusters are distinct from each other.

    • The distance from each point to its cluster's center is much smaller than the distance between the centers of different clusters.

22 of 34

The Objective of K-Means Clustering

22

Optimization function for k-mean clustering

“Minimize total intra-cluster variance, or Within-Cluster Sum of Square (WCSS)or Inertia

Where WCSS= sum of squares of the distances of each data point in all clusters to their respective centroids (WCSS).

Assumptions:

  • Each observation belongs to at least one of the K clusters
  • The clusters do not overlap
  • Variation within each cluster is minimized.

23 of 34

Finding the optimal number of clusters K - Elbow Method

  • X axis: vary the number of clusters K
  • Y axis -> for each value of K, calculate WCSS ( Within-Cluster Sum of Square ).

23

Observe:

  • As the value of K increases, the sum of square of distances decreases for every step.
  • But this decreases is very fast for the initial values of K, and then it slows down. This defines the right value of K.

Another Method: The silhouette score method to determine the appropriate number of clusters

24 of 34

Finding the optimal number of clusters K - Elbow Method

** Visualization Courtesy: Gavin Hung, Former CMSC320 Student

This is an unsupervised clustering algorithm that assigns points to a centroid

25 of 34

K-Means Properties

26 of 34

27 of 34

28 of 34

SOME SOLUTIONS COULD BE

Run K-Means multiple times with different initializations

  • Multiple runs (e.g., 10-50 times) of K-Means with different random centroid seeds.
  • Choose the clustering result with the lowest total within-cluster variance (inertia).

Use K-Means++ initialization

  • An improved method to initialize centroids more strategically rather than randomly.
  • Spreads out the initial centroids to better cover the space and reduce chances of poor local minima.

Try alternative clustering algorithms

  • consider Hierarchical Clustering or DBSCAN that do not rely on random centroid initialization

29 of 34

Hierarchical Clustering

Builds a hierarchy of clusters represented as a tree (dendrogram).

  • Does not require pre-specifying the number of clusters (k).
  • Two types:
    • Agglomerative (bottom-up): Start with each point as its own cluster and merge closest clusters step-by-step.
    • Divisive (top-down): Start with one cluster and recursively split.

Key Concepts:

  • Linkage methods: Determine how to measure distance between clusters�
    • Single linkage (min distance)
    • Complete linkage (max distance)
    • Average linkage (average distance)

30 of 34

Hierarchical Clustering

31 of 34

DBSCAN (Density-Based Spatial Clustering)

Density-based clustering algorithm that groups points closely packed together.�

Can find clusters of arbitrary shape and identify noise/outliers.

Key Parameters:

  • ε (epsilon): Radius to search for neighboring points.�
  • minPts: Minimum number of points required in an ε-neighborhood to form a dense region.

32 of 34

DBSCAN: Core, Border & Noise Points

Core point: Forms & expands clusters; has ≥ minPts neighbors within ε.�

Border point: Defines cluster edges (Within ε of a core point but fewer than minPts neighbors.)�

Noise: Neither core nor border point. Example: Isolated point, not in any group

Why Both Core & Border?

  • Core: Ensures dense clusters grow
  • Border: Smooths edges, avoids abrupt cuts�→ Together, they find natural-shaped clusters!

33 of 34

Hard vs Soft Clustering Algorithms

  • K-Means
  • Hierarchical Clustering
  • Density Based Scan Clustering (DBSCAN)
  • Gaussian Clustering Model: Gaussian Mixture Models (GMM)

Hard Clustering

Each data point belongs to exactly one cluster (or is marked as noise in DBSCAN).

Soft Clustering

Each data point can belong to multiple clusters with probabilities

Even though DBSCAN and Hierarchical are more flexible in shape and structure than K-Means, they still assign points deterministically, so they fall under hard clustering.

34 of 34

Conclusion

  • Clustering is a powerful unsupervised learning technique for discovering natural groupings in data.
  • While K-Means is simple and fast, algorithms like Hierarchical Clustering and DBSCAN provide more flexibility to handle complex data shapes and noise.
  • Choosing the right method depends on the data characteristics and the analysis goals.