3 of 73

Supervised vs. unsupervised learning

Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute.

These patterns are then utilized to predict the values of the target attribute in future data instances.

Unsupervised learning: The data have no target attribute.

We want to explore the data to find some intrinsic structures in them.

Dinesh K. Vishwakarma

4 of 73

Clustering

Clustering is a technique for finding similarity groups in data, called clusters. i.e.,

it groups data instances that are similar to (near) each other in one cluster and data instances that are very different (far away) from each other into different clusters.

Clustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given, which is the case in supervised learning.
In fact, association rule mining is also unsupervised.

Dinesh K. Vishwakarma

5 of 73

What is Clusters?

The organization of unlabeled data into similarity groups called ‘clusters’.
A cluster is a collection of data items which are “similar” between them, & “dissimilar” to data items in other clusters.
The data set has three natural groups of data points, i.e., 3 natural clusters.

Dinesh K. Vishwakarma

6 of 73

What is clustering for?

Example 1: groups people of similar sizes together to make “small”, “medium” and “large” T-Shirts.

Tailor-made for each person: too expensive
One-size-fits-all: does not fit all.

Example 2: In marketing, segment customers according to their similarities

To do targeted marketing.
Residential Area

Dinesh K. Vishwakarma

7 of 73

What is clustering for? (cont…)

Example 3: Given a collection of text documents, we want to organize them according to their content similarities,

To produce a topic hierarchy

In fact, clustering is one of the most utilized data mining techniques.

It has a long history, and used in almost every field, e.g., medicine, psychology, botany, sociology, biology, archeology, marketing, insurance, libraries, etc.
In recent years, due to the rapid increase of online documents, text clustering becomes important.

Dinesh K. Vishwakarma

8 of 73

What is clustering for? (cont…)

Computer Vision

Dinesh K. Vishwakarma

9 of 73

What do we need for Clustering?

Dinesh K. Vishwakarma

Large d, & small s

small d, & large s

Good

Bad

10 of 73

Distance Measurement

Dinesh K. Vishwakarma

11 of 73

Clustering E.g.

Dinesh K. Vishwakarma

C3 {X2, X4, X9}

C1 {X7,X8}

C2 { X1, X3, X5, X6}

Clustering

12 of 73

Aspects of Clustering

A clustering algorithm

Partitional clustering
Hierarchical clustering

A distance (similarity, or dissimilarity) function
Clustering quality

Inter-clusters distance ⇒ maximized
Intra-clusters distance ⇒ minimized

The quality of a clustering result depends on the algorithm, the distance function, and the application.

Dinesh K. Vishwakarma

13 of 73

Hierarchical vs Partitional Clustering

A distinction among different types of clustering is whether the set of clusters is ‘nested’ or ‘unnested’.
A partitional clustering a simply a division of the set of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset.
A hierarchical clustering is a set of nested clusters that are organized as a tree.

Dinesh K. Vishwakarma

14 of 73

K-means clustering

K-means is a ‘Partitional Clustering algorithm.
Let the set of data points (or instances) D be {x₁, x₂, …, x_n},

where x_i = (x_i₁, x_i₂, …, x_ir) is a vector in a real-valued space X ⊆ R^r, and r is the number of attributes (dimensions) in the data.

The k-means algorithm partitions the given data into k clusters.

Each cluster has a cluster center, called centroid.
k is specified by the user.

Dinesh K. Vishwakarma

15 of 73

K-means algorithm

Given k, the k-means algorithm works as follows:

Randomly choose k data points (seeds) to be the initial centroids, cluster centers
Assign each data point to the closest centroid.
Re-compute the centroids using the current cluster memberships.
If a convergence criterion is not met, go to 2).

Dinesh K. Vishwakarma

16 of 73

K-means algorithm – (cont …)

Dinesh K. Vishwakarma

17 of 73

Stopping/convergence criterion

No (or minimum) re-assignments of data points to different clusters,
No (or minimum) change of centroids, or
Minimum decrease in the sum of squared error (SSE),

C_j is the j^th cluster, m_j is the centroid of cluster C_j (the mean vector of all the data points in C_j), and dist(x, m_j) is the distance between data point x and centroid m_j.

Dinesh K. Vishwakarma

(1)

18 of 73

An example

Dinesh K. Vishwakarma

(A) Random selection of k centres

Iteration 1 (B) Cluster assignment

Iteration 2 (D) Cluster assignment

(E) Re-compute centroids

Iteration 3 (F) Cluster assignment

19 of 73

Example

Step 1: Randomly initialize cluster centre

Dinesh K. Vishwakarma

Step 2: determine cluster membership to each input

20 of 73

Example…

Step 3: Re-estimate cluster centre

Dinesh K. Vishwakarma

Result of first iteration

21 of 73

Example…

Second iteration

Dinesh K. Vishwakarma

Result of Second iteration

22 of 73

Example of K-Mean Clustering

Consider a data table

Dinesh K. Vishwakarma

Height (X)	Weight (Y)
185	72
170	56
168	60
179	68
182	72
188	77
180	71
180	70
183	84
180	88
180	67
177	76

Consider there are two clusters (K1, K2) and their centroids are (185, 72) & (170, 56)
Euclidian Distance for data point ‘3’

K1={185, 72}

K2={170, 56}

Hence, data point 3 belongs to K2

New Centroid for K2

23 of 73

Strengths of k-means

Strengths:

Simple: easy to understand and to implement
Efficient: Time complexity: O(tkn),

where n is the number of data points,

k is the number of clusters, and

t is the number of iterations.

Since both k and t are small. k-means is considered a linear algorithm.

K-means is the most popular clustering algorithm.
Note that: it terminates at a local optimum if SSE is used. The global optimum is hard to find due to complexity.

Dinesh K. Vishwakarma

24 of 73

Weaknesses of k-means

The algorithm is only applicable if the mean is defined.

For categorical data, k-mode - the centroid is represented by most frequent values.

The user needs to specify k.
The algorithm is sensitive to outliers

Outliers are data points that are very far away from other data points.
Outliers could be errors in the data recording or some special data points with very different values.

Dinesh K. Vishwakarma

25 of 73

Weaknesses of k-means…

Problems with outliers

Dinesh K. Vishwakarma

(A) Undesirable cluster

(B) Ideal cluster

Outlier

26 of 73

Weaknesses of k-means...

To deal with outliers
One method is to remove some data points in the clustering process that are much further away from the centroids than other data points.

To be safe, we may want to monitor these possible outliers over a few iterations and then decide to remove them.

Another method is to perform random sampling. Since in sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small.

Assign the rest of the data points to the clusters by distance or similarity comparison, or classification

Dinesh K. Vishwakarma

27 of 73

Weaknesses of k-means (cont …)

The algorithm is sensitive to initial seeds.
If we use different seeds: good results

Dinesh K. Vishwakarma

28 of 73

Weaknesses of k-means (cont …)

Special Data Structures: The k-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres).

Dinesh K. Vishwakarma

Two natural clusters

K-mean clusters

29 of 73

K-means summary

Despite weaknesses, k-means is still the most popular algorithm due to its simplicity, efficiency and

other clustering algorithms have their own lists of weaknesses.

No clear evidence that any other clustering algorithm performs better in general

although they may be more suitable for some specific types of data or applications.

Comparing different clustering algorithms is a difficult task. No one knows the correct clusters!

Dinesh K. Vishwakarma

30 of 73

Outlines

Basic concepts
K-means algorithm
Representation of clusters
Hierarchical clustering
Which clustering algorithm to use?
Cluster evaluation
Summary

Dinesh K. Vishwakarma

31 of 73

Common ways to represent clusters

Use the centroid of each cluster to represent the cluster.

compute the radius and
standard deviation of the cluster to determine its spread in each dimension
The centroid representation alone works well if the clusters are of the hyper-spherical shape.
If clusters are elongated or are of other shapes, centroids are not sufficient

Dinesh K. Vishwakarma

32 of 73

Using classification model

All the data points in a cluster are regarded to have the same class label, e.g., the cluster ID.

run a supervised learning algorithm on the data to find a classification model.

Dinesh K. Vishwakarma

33 of 73

Use frequent values to represent cluster

This method is mainly for clustering of categorical data (e.g., k-modes clustering).
Main method used in text clustering, where a small set of frequent words in each cluster is selected to represent the cluster.

Dinesh K. Vishwakarma

34 of 73

Clusters of arbitrary shapes

Hyper-elliptical and hyper-spherical clusters are usually easy to represent, using their centroid together with spreads.
Irregular shape clusters are hard to represent. They may not be useful in some applications.

Using centroids are not suitable (upper figure) in general.
K-means clusters may be more useful (lower figure), e.g., for making 2 size T-shirts.

Dinesh K. Vishwakarma

35 of 73

Outlines

Basic concepts
K-means algorithm
Representation of clusters
Hierarchical clustering
Which clustering algorithm to use?
Cluster evaluation
Summary

Dinesh K. Vishwakarma

36 of 73

Hierarchical Clustering

In previous, a ‘flat’ clustering is considered.

For some data hierarchical clustering is more appropriate than ‘flat’.

Dinesh K. Vishwakarma

Hierarchical Clustering

37 of 73

Why Hierarchical Clustering?

Dinesh K. Vishwakarma

38 of 73

Why Hierarchical Clustering? E.g.

Dinesh K. Vishwakarma

39 of 73

Types of hierarchical clustering

Agglomerative (bottom up) clustering: It builds the dendrogram (tree) from the bottom level, and

merges the most similar (or nearest) pair of clusters
stops when all the data points are merged into a single cluster (i.e., the root cluster).

Divisive (top down) clustering: It starts with all data points in one cluster, the root.

Splits the root into a set of child clusters. Each child cluster is recursively divided further
stops when only singleton clusters of individual data points remain, i.e., each cluster with only a single point

Dinesh K. Vishwakarma

40 of 73

Agglomerative approach

Dinesh K. Vishwakarma

a b

d e

c d e

a b c d e

Step 0

Step 1

Step 2

Step 3

Step 4

bottom-up

Initialization:

Each object is a cluster

Iteration:

Merge two clusters which are

most similar to each other;

Until all objects are merged

into a single cluster

41 of 73

Divisive Approaches

Dinesh K. Vishwakarma

a b

d e

c d e

a b c d e

Step 4

Step 3

Step 2

Step 1

Step 0

Top-down

Initialization:

All objects stay in one cluster

Iteration:

Select a cluster and split it into

two sub clusters

Until each leaf cluster contains

only one object

42 of 73

Dendrogram

A binary tree that shows how clusters are merged/split hierarchically
Each node on the tree is a cluster; each leaf node is a singleton cluster

Dinesh K. Vishwakarma

No. of clusters

43 of 73

Agglomerative Algorithms

Dinesh K. Vishwakarma

44 of 73

How to Merge Clusters?

How to measure the distance between clusters?

Dinesh K. Vishwakarma

Single-link
Complete-link
Average-link
Centroid distance

Distance?

Hint: Distance between clusters is usually defined on the basis of distance between objects.

45 of 73

How to Define Inter-Cluster Distance

Dinesh K. Vishwakarma

Single-link
Complete-link
Average-link
Centroid distance

The distance between two clusters is represented by the distance of the closest pair of data objects belonging to different clusters.

46 of 73

How to Define Inter-Cluster Distance

Dinesh K. Vishwakarma

Single-link
Complete-link
Average-link
Centroid distance

The distance between two clusters is represented by the distance of the farthest pair of data objects belonging to different clusters.

47 of 73

How to Define Inter-Cluster Distance

Dinesh K. Vishwakarma

Single-link
Complete-link
Average-link
Centroid distance

The distance between two clusters is represented by the average distance of all pairs of data objects belonging to different clusters.

48 of 73

How to Define Inter-Cluster Distance

Dinesh K. Vishwakarma

Single-link
Complete-link
Average-link
Centroid distance

m_i,m_j are the means of C_i, C_j,

The distance between two clusters is represented by the distance between the means of the cluters.

49 of 73

E.g. Agglomerative Hierarchical Clustering

For the following data set, we will get different clustering results with the single-link and complete-link algorithms.

Dinesh K. Vishwakarma

Result of the Single-Link algorithm

Result of the complete-Link algorithm

50 of 73

Hierarchical Clustering: Comparison

Dinesh K. Vishwakarma

Average-link

Centroid distance

Single-link

Complete-link

51 of 73

Comparison of Dendrograms

Dinesh K. Vishwakarma

1 2 5 3 6 4

2 5 3 6 4 1

Average-link

Centroid distance

Single-link

Complete-link

52 of 73

E.G. of Clustering

Dinesh K. Vishwakarma

53 of 73

E.G. of Clustering using Single Link

Dinesh K. Vishwakarma

54 of 73

E.G. of Clustering using Single Link

Dinesh K. Vishwakarma

55 of 73

E.G. of Clustering using Single Link

Dinesh K. Vishwakarma

56 of 73

E.G. of Clustering using Single Link

Dinesh K. Vishwakarma

57 of 73

Agglomerative clustering

It is more popular then divisive methods.

At the beginning, each data point forms a cluster (also called a node).
Merge nodes/clusters that have the least distance.
Go on merging
Eventually all nodes belong to one cluster

Dinesh K. Vishwakarma

58 of 73

Agglomerative clustering algorithm

Dinesh K. Vishwakarma

59 of 73

The complexity

All the algorithms are at least O(n²). n is the number of data points.
Single link can be done in O(n²).
Complete and average links can be done in O(n²logn).
Due the complexity, hard to use for large data sets.

Sampling
Scale-up methods (e.g., BIRCH).

Dinesh K. Vishwakarma

60 of 73

Outlines

Basic concepts
K-means algorithm
Representation of clusters
Hierarchical clustering
Which clustering algorithm to use?
Cluster evaluation
Summary

Dinesh K. Vishwakarma

61 of 73

How to choose a clustering algorithm

Clustering research has a long history. A vast collection of algorithms are available.

We only introduced few key algorithms.

Choosing the “best” algorithm is a challenge.

Every algorithm has limitations and works well with certain data distributions.
It is very hard, if not impossible, to know what distribution the application data follow. The data may not fully follow any “ideal” structure or distribution required by the algorithms.
One also needs to decide how to standardize the data, to choose a suitable distance function and to select other parameter values.

Dinesh K. Vishwakarma

62 of 73

Choose a clustering algorithm (cont …)

Due to these complexities, the common practice is to

run several algorithms using different distance functions and parameter settings, and
then carefully analyze and compare the results.

The interpretation of the results must be based on insight into the meaning of the original data together with knowledge of the algorithms used.
Clustering is highly application dependent and to certain extent subjective (personal preferences).

Dinesh K. Vishwakarma

63 of 73

Outlines

Basic concepts
K-means algorithm
Representation of clusters
Hierarchical clustering
Which clustering algorithm to use?
Cluster evaluation
Summary

Dinesh K. Vishwakarma

64 of 73

Cluster Evaluation: hard problem

The quality of a clustering is very hard to evaluate because

We do not know the correct clusters

Some methods are used:

User inspection

Study centroids, and spreads
Rules from a decision tree.
For text documents, one can read some documents in clusters.

Dinesh K. Vishwakarma

65 of 73

Cluster evaluation: ground truth

We use some labeled data (for classification)
Assumption: Each class is a cluster.
After clustering, a confusion matrix is constructed. From the matrix, we compute various measurements, entropy, purity, precision, recall and F-score.

Let the classes in the data D be C = (c₁, c₂, …, c_k). The clustering method produces k clusters, which divides D into k disjoint subsets, D₁, D₂, …, D_k.

Dinesh K. Vishwakarma

66 of 73

Evaluation measures: Entropy

Dinesh K. Vishwakarma

67 of 73

Evaluation measures: purity

Dinesh K. Vishwakarma

68 of 73

An example

Assume we have a text collection Dof 900 documents from three topics (or three classes), Science, Sports, and Politics. Each class has 300 documents, and each document in D is labeled with one of the topics (classes). We use this collection to perform clustering to find three clusters. Class/topic labels are not used in clustering. After clustering, we want to measure the effectiveness of the clustering algorithm.

Dinesh K. Vishwakarma

69 of 73

A remark about ground truth evaluation

Commonly used to compare different clustering algorithms.
A real-life data set for clustering has no class labels.

Thus although an algorithm may perform very well on some labeled data sets, no guarantee that it will perform well on the actual application data at hand.

The fact that it performs well on some label data sets does give us some confidence of the quality of the algorithm.
This evaluation method is said to be based on external data or information.

Dinesh K. Vishwakarma

70 of 73

Evaluation based on internal information

Intra-cluster cohesion (compactness):

Cohesion measures how near the data points in a cluster are to the cluster centroid.
Sum of squared error (SSE) is a commonly used measure.

Inter-cluster separation (isolation):

Separation means that different cluster centroids should be far away from one another.

In most applications, expert judgments are still the key.

Dinesh K. Vishwakarma

71 of 73

Indirect evaluation

In some applications, clustering is not the primary task, but used to help perform another task.
We can use the performance on the primary task to compare clustering methods.
For instance, in an application, the primary task is to provide recommendations on book purchasing to online shoppers.

If we can cluster books according to their features, we might be able to provide better recommendations.
We can evaluate different clustering algorithms based on how well they help with the recommendation task.
Here, we assume that the recommendation can be reliably evaluated.

Dinesh K. Vishwakarma

72 of 73

An example

Dinesh K. Vishwakarma

73 of 73

Summary

Clustering is has along history and still active

There are a huge number of clustering algorithms
More are still coming every year.

We only introduced several main algorithms. There are many others, e.g.,

density based algorithm, sub-space clustering, scale-up methods, neural networks based methods, fuzzy clustering, co-clustering, etc.

Clustering is hard to evaluate, but very useful in practice. This partially explains why there are still a large number of clustering algorithms being devised every year.
Clustering is highly application dependent and to some extent subjective.

Dinesh K. Vishwakarma