1 of 73

Unsupervised Learning

Dr. Dinesh Kumar Vishwakarma

Professor,

Department of Information Technological University, Delhi-110042

2 of 73

Outline

  • Basic concepts
  • K-means algorithm
  • Representation of clusters
  • Hierarchical clustering
  • Which clustering algorithm to use?
  • Cluster evaluation
  • Summary

Dinesh K. Vishwakarma

2

3 of 73

Supervised vs. unsupervised learning

  • Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute.
    • These patterns are then utilized to predict the values of the target attribute in future data instances.
  • Unsupervised learning: The data have no target attribute.
    • We want to explore the data to find some intrinsic structures in them.

Dinesh K. Vishwakarma

3

4 of 73

Clustering

  • Clustering is a technique for finding similarity groups in data, called clusters. i.e.,
    • it groups data instances that are similar to (near) each other in one cluster and data instances that are very different (far away) from each other into different clusters.
  • Clustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given, which is the case in supervised learning.
  • In fact, association rule mining is also unsupervised.

Dinesh K. Vishwakarma

4

5 of 73

What is Clusters?

  • The organization of unlabeled data into similarity groups called ‘clusters’.
  • A cluster is a collection of data items which are “similar” between them, & “dissimilar” to data items in other clusters.
  • The data set has three natural groups of data points, i.e., 3 natural clusters.

Dinesh K. Vishwakarma

5

6 of 73

What is clustering for?

  • Example 1: groups people of similar sizes together to make “small”, “medium” and “large” T-Shirts.
    • Tailor-made for each person: too expensive
    • One-size-fits-all: does not fit all.
  • Example 2: In marketing, segment customers according to their similarities
    • To do targeted marketing.
    • Residential Area

Dinesh K. Vishwakarma

6

7 of 73

What is clustering for? (cont…)

  • Example 3: Given a collection of text documents, we want to organize them according to their content similarities,
    • To produce a topic hierarchy
  • In fact, clustering is one of the most utilized data mining techniques.
    • It has a long history, and used in almost every field, e.g., medicine, psychology, botany, sociology, biology, archeology, marketing, insurance, libraries, etc.
    • In recent years, due to the rapid increase of online documents, text clustering becomes important.

Dinesh K. Vishwakarma

7

8 of 73

What is clustering for? (cont…)

  • Computer Vision

Dinesh K. Vishwakarma

8

9 of 73

What do we need for Clustering?

  •  

Dinesh K. Vishwakarma

9

Large d, & small s

small d, & large s

Good

Bad

10 of 73

Distance Measurement

  •  

Dinesh K. Vishwakarma

10

  •  
  •  

11 of 73

Clustering E.g.

  •  

Dinesh K. Vishwakarma

11

C3 {X2, X4, X9}

C1 {X7,X8}

C2 { X1, X3, X5, X6}

  • Clustering

12 of 73

Aspects of Clustering

  • A clustering algorithm
    • Partitional clustering
    • Hierarchical clustering
  • A distance (similarity, or dissimilarity) function
  • Clustering quality
    • Inter-clusters distance ⇒ maximized
    • Intra-clusters distance ⇒ minimized
  • The quality of a clustering result depends on the algorithm, the distance function, and the application.

Dinesh K. Vishwakarma

12

13 of 73

Hierarchical vs Partitional Clustering

  • A distinction among different types of clustering is whether the set of clusters is ‘nested’ or ‘unnested’.
  • A partitional clustering a simply a division of the set of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset.
  • A hierarchical clustering is a set of nested clusters that are organized as a tree.

Dinesh K. Vishwakarma

13

14 of 73

K-means clustering

  • K-means is a ‘Partitional Clustering algorithm.
  • Let the set of data points (or instances) D be {x1, x2, …, xn},
    • where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X Rr, and r is the number of attributes (dimensions) in the data.
  • The k-means algorithm partitions the given data into k clusters.
    • Each cluster has a cluster center, called centroid.
    • k is specified by the user.

Dinesh K. Vishwakarma

14

15 of 73

K-means algorithm

  • Given k, the k-means algorithm works as follows:
    1. Randomly choose k data points (seeds) to be the initial centroids, cluster centers
    2. Assign each data point to the closest centroid.
    3. Re-compute the centroids using the current cluster memberships.
    4. If a convergence criterion is not met, go to 2).

Dinesh K. Vishwakarma

15

16 of 73

K-means algorithm – (cont …)

  •  

Dinesh K. Vishwakarma

16

17 of 73

Stopping/convergence criterion

  1. No (or minimum) re-assignments of data points to different clusters,
  2. No (or minimum) change of centroids, or
  3. Minimum decrease in the sum of squared error (SSE),

    • Cj is the jth cluster, mj is the centroid of cluster Cj (the mean vector of all the data points in Cj), and dist(x, mj) is the distance between data point x and centroid mj.

Dinesh K. Vishwakarma

17

(1)

18 of 73

An example

Dinesh K. Vishwakarma

18

(A) Random selection of k centres

Iteration 1 (B) Cluster assignment

(C) Re-compute centroids

Iteration 2 (D) Cluster assignment

(E) Re-compute centroids

Iteration 3 (F) Cluster assignment

19 of 73

Example

  • Step 1: Randomly initialize cluster centre

Dinesh K. Vishwakarma

19

  • Step 2: determine cluster membership to each input

20 of 73

Example…

  • Step 3: Re-estimate cluster centre

Dinesh K. Vishwakarma

20

  • Result of first iteration

21 of 73

Example…

  • Second iteration

Dinesh K. Vishwakarma

21

  • Result of Second iteration

22 of 73

Example of K-Mean Clustering

  • Consider a data table

Dinesh K. Vishwakarma

22

Height (X)

Weight (Y)

185

72

170

56

168

60

179

68

182

72

188

77

180

71

180

70

183

84

180

88

180

67

177

76

  • Consider there are two clusters (K1, K2) and their centroids are (185, 72) & (170, 56)
  • Euclidian Distance for data point ‘3’

K1={185, 72}

K2={170, 56}

  •  
  •  

Hence, data point 3 belongs to K2

  • New Centroid for K2
  •  
  •  
  •  

23 of 73

Strengths of k-means

  • Strengths:
    • Simple: easy to understand and to implement
    • Efficient: Time complexity: O(tkn),

where n is the number of data points,

k is the number of clusters, and

t is the number of iterations.

    • Since both k and t are small. k-means is considered a linear algorithm.
  • K-means is the most popular clustering algorithm.
  • Note that: it terminates at a local optimum if SSE is used. The global optimum is hard to find due to complexity.

Dinesh K. Vishwakarma

23

24 of 73

Weaknesses of k-means

  • The algorithm is only applicable if the mean is defined.
    • For categorical data, k-mode - the centroid is represented by most frequent values.
  • The user needs to specify k.
  • The algorithm is sensitive to outliers
    • Outliers are data points that are very far away from other data points.
    • Outliers could be errors in the data recording or some special data points with very different values.

Dinesh K. Vishwakarma

24

25 of 73

Weaknesses of k-means…

  • Problems with outliers

Dinesh K. Vishwakarma

25

(A) Undesirable cluster

(B) Ideal cluster

Outlier

Outlier

26 of 73

Weaknesses of k-means...

  • To deal with outliers
  • One method is to remove some data points in the clustering process that are much further away from the centroids than other data points.
    • To be safe, we may want to monitor these possible outliers over a few iterations and then decide to remove them.
  • Another method is to perform random sampling. Since in sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small.
    • Assign the rest of the data points to the clusters by distance or similarity comparison, or classification

Dinesh K. Vishwakarma

26

27 of 73

Weaknesses of k-means (cont …)

  • The algorithm is sensitive to initial seeds.
  • If we use different seeds: good results

Dinesh K. Vishwakarma

27

28 of 73

Weaknesses of k-means (cont …)

  • Special Data Structures: The k-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres).

Dinesh K. Vishwakarma

28

Two natural clusters

K-mean clusters

29 of 73

K-means summary

  • Despite weaknesses, k-means is still the most popular algorithm due to its simplicity, efficiency and
    • other clustering algorithms have their own lists of weaknesses.
  • No clear evidence that any other clustering algorithm performs better in general
    • although they may be more suitable for some specific types of data or applications.
  • Comparing different clustering algorithms is a difficult task. No one knows the correct clusters!

Dinesh K. Vishwakarma

29

30 of 73

Outlines

  • Basic concepts
  • K-means algorithm
  • Representation of clusters
  • Hierarchical clustering
  • Which clustering algorithm to use?
  • Cluster evaluation
  • Summary

Dinesh K. Vishwakarma

30

31 of 73

Common ways to represent clusters

  • Use the centroid of each cluster to represent the cluster.
    • compute the radius and
    • standard deviation of the cluster to determine its spread in each dimension
    • The centroid representation alone works well if the clusters are of the hyper-spherical shape.
    • If clusters are elongated or are of other shapes, centroids are not sufficient

Dinesh K. Vishwakarma

31

32 of 73

Using classification model

  • All the data points in a cluster are regarded to have the same class label, e.g., the cluster ID.
    • run a supervised learning algorithm on the data to find a classification model.

Dinesh K. Vishwakarma

32

  •  

33 of 73

Use frequent values to represent cluster

  • This method is mainly for clustering of categorical data (e.g., k-modes clustering).
  • Main method used in text clustering, where a small set of frequent words in each cluster is selected to represent the cluster.

Dinesh K. Vishwakarma

33

34 of 73

Clusters of arbitrary shapes

  • Hyper-elliptical and hyper-spherical clusters are usually easy to represent, using their centroid together with spreads.
  • Irregular shape clusters are hard to represent. They may not be useful in some applications.
    • Using centroids are not suitable (upper figure) in general.
    • K-means clusters may be more useful (lower figure), e.g., for making 2 size T-shirts.

Dinesh K. Vishwakarma

34

35 of 73

Outlines

  • Basic concepts
  • K-means algorithm
  • Representation of clusters
  • Hierarchical clustering
  • Which clustering algorithm to use?
  • Cluster evaluation
  • Summary

Dinesh K. Vishwakarma

35

36 of 73

Hierarchical Clustering

  • In previous, a ‘flat’ clustering is considered.

  • For some data hierarchical clustering is more appropriate than ‘flat’.

Dinesh K. Vishwakarma

36

Hierarchical Clustering

37 of 73

Why Hierarchical Clustering?

  •  

Dinesh K. Vishwakarma

37

38 of 73

Why Hierarchical Clustering? E.g.

Dinesh K. Vishwakarma

38

39 of 73

Types of hierarchical clustering

  • Agglomerative (bottom up) clustering: It builds the dendrogram (tree) from the bottom level, and
    • merges the most similar (or nearest) pair of clusters
    • stops when all the data points are merged into a single cluster (i.e., the root cluster).
  • Divisive (top down) clustering: It starts with all data points in one cluster, the root.
    • Splits the root into a set of child clusters. Each child cluster is recursively divided further
    • stops when only singleton clusters of individual data points remain, i.e., each cluster with only a single point

Dinesh K. Vishwakarma

39

40 of 73

Agglomerative approach

Dinesh K. Vishwakarma

40

b

d

c

e

a

a b

d e

c d e

a b c d e

Step 0

Step 1

Step 2

Step 3

Step 4

bottom-up

Initialization:

Each object is a cluster

Iteration:

Merge two clusters which are

most similar to each other;

Until all objects are merged

into a single cluster

41 of 73

Divisive Approaches

Dinesh K. Vishwakarma

41

41

b

d

c

e

a

a b

d e

c d e

a b c d e

Step 4

Step 3

Step 2

Step 1

Step 0

Top-down

Initialization:

All objects stay in one cluster

Iteration:

Select a cluster and split it into

two sub clusters

Until each leaf cluster contains

only one object

42 of 73

Dendrogram

  • A binary tree that shows how clusters are merged/split hierarchically
  • Each node on the tree is a cluster; each leaf node is a singleton cluster

Dinesh K. Vishwakarma

42

9

6

4

3

2

No. of clusters

43 of 73

Agglomerative Algorithms

Dinesh K. Vishwakarma

43

44 of 73

How to Merge Clusters?

  • How to measure the distance between clusters?

Dinesh K. Vishwakarma

44

  • Single-link
  • Complete-link
  • Average-link
  • Centroid distance

Distance?

Hint: Distance between clusters is usually defined on the basis of distance between objects.

45 of 73

How to Define Inter-Cluster Distance

Dinesh K. Vishwakarma

45

  • Single-link
  • Complete-link
  • Average-link
  • Centroid distance

The distance between two clusters is represented by the distance of the closest pair of data objects belonging to different clusters.

46 of 73

How to Define Inter-Cluster Distance

Dinesh K. Vishwakarma

46

  • Single-link
  • Complete-link
  • Average-link
  • Centroid distance

The distance between two clusters is represented by the distance of the farthest pair of data objects belonging to different clusters.

47 of 73

How to Define Inter-Cluster Distance

Dinesh K. Vishwakarma

47

  • Single-link
  • Complete-link
  • Average-link
  • Centroid distance

The distance between two clusters is represented by the average distance of all pairs of data objects belonging to different clusters.

48 of 73

How to Define Inter-Cluster Distance

Dinesh K. Vishwakarma

48

  • Single-link
  • Complete-link
  • Average-link
  • Centroid distance

×

×

mi,mj are the means of Ci, Cj,

The distance between two clusters is represented by the distance between the means of the cluters.

49 of 73

E.g. Agglomerative Hierarchical Clustering

  • For the following data set, we will get different clustering results with the single-link and complete-link algorithms.

Dinesh K. Vishwakarma

49

1

2

3

4

5

6

1

3

4

5

2

6

1

2

3

4

5

6

1

3

2

4

5

6

Result of the Single-Link algorithm

Result of the complete-Link algorithm

50 of 73

Hierarchical Clustering: Comparison

Dinesh K. Vishwakarma

Average-link

Centroid distance

1

2

3

4

5

6

1

2

5

3

4

Single-link

Complete-link

1

2

3

4

5

6

1

2

5

3

4

1

2

3

4

5

6

1

2

5

3

4

1

2

3

4

5

6

1

2

3

4

5

51 of 73

Comparison of Dendrograms

Dinesh K. Vishwakarma

51

1 2 5 3 6 4

1 2 5 3 6 4

1 2 5 3 6 4

2 5 3 6 4 1

Average-link

Centroid distance

Single-link

Complete-link

52 of 73

E.G. of Clustering

Dinesh K. Vishwakarma

52

53 of 73

E.G. of Clustering using Single Link

Dinesh K. Vishwakarma

53

54 of 73

E.G. of Clustering using Single Link

Dinesh K. Vishwakarma

54

55 of 73

E.G. of Clustering using Single Link

Dinesh K. Vishwakarma

55

56 of 73

E.G. of Clustering using Single Link

Dinesh K. Vishwakarma

56

57 of 73

Agglomerative clustering

It is more popular then divisive methods.

  • At the beginning, each data point forms a cluster (also called a node).
  • Merge nodes/clusters that have the least distance.
  • Go on merging
  • Eventually all nodes belong to one cluster

Dinesh K. Vishwakarma

57

58 of 73

Agglomerative clustering algorithm

Dinesh K. Vishwakarma

58

59 of 73

The complexity

  • All the algorithms are at least O(n2). n is the number of data points.
  • Single link can be done in O(n2).
  • Complete and average links can be done in O(n2logn).
  • Due the complexity, hard to use for large data sets.
    • Sampling
    • Scale-up methods (e.g., BIRCH).

Dinesh K. Vishwakarma

59

60 of 73

Outlines

  • Basic concepts
  • K-means algorithm
  • Representation of clusters
  • Hierarchical clustering
  • Which clustering algorithm to use?
  • Cluster evaluation
  • Summary

Dinesh K. Vishwakarma

60

61 of 73

How to choose a clustering algorithm

  • Clustering research has a long history. A vast collection of algorithms are available.
    • We only introduced few key algorithms.
  • Choosing the “best” algorithm is a challenge.
    • Every algorithm has limitations and works well with certain data distributions.
    • It is very hard, if not impossible, to know what distribution the application data follow. The data may not fully follow any “ideal” structure or distribution required by the algorithms.
    • One also needs to decide how to standardize the data, to choose a suitable distance function and to select other parameter values.

Dinesh K. Vishwakarma

61

62 of 73

Choose a clustering algorithm (cont …)

  • Due to these complexities, the common practice is to
    • run several algorithms using different distance functions and parameter settings, and
    • then carefully analyze and compare the results.
  • The interpretation of the results must be based on insight into the meaning of the original data together with knowledge of the algorithms used.
  • Clustering is highly application dependent and to certain extent subjective (personal preferences).

Dinesh K. Vishwakarma

62

63 of 73

Outlines

  • Basic concepts
  • K-means algorithm
  • Representation of clusters
  • Hierarchical clustering
  • Which clustering algorithm to use?
  • Cluster evaluation
  • Summary

Dinesh K. Vishwakarma

63

64 of 73

Cluster Evaluation: hard problem

  • The quality of a clustering is very hard to evaluate because
    • We do not know the correct clusters
  • Some methods are used:
    • User inspection
      • Study centroids, and spreads
      • Rules from a decision tree.
      • For text documents, one can read some documents in clusters.

Dinesh K. Vishwakarma

64

65 of 73

Cluster evaluation: ground truth

  • We use some labeled data (for classification)
  • Assumption: Each class is a cluster.
  • After clustering, a confusion matrix is constructed. From the matrix, we compute various measurements, entropy, purity, precision, recall and F-score.
    • Let the classes in the data D be C = (c1, c2, …, ck). The clustering method produces k clusters, which divides D into k disjoint subsets, D1, D2, …, Dk.

Dinesh K. Vishwakarma

65

66 of 73

Evaluation measures: Entropy

Dinesh K. Vishwakarma

66

67 of 73

Evaluation measures: purity

Dinesh K. Vishwakarma

67

68 of 73

An example

  • Assume we have a text collection Dof 900 documents from three topics (or three classes), Science, Sports, and Politics. Each class has 300 documents, and each document in D is labeled with one of the topics (classes). We use this collection to perform clustering to find three clusters. Class/topic labels are not used in clustering. After clustering, we want to measure the effectiveness of the clustering algorithm.

Dinesh K. Vishwakarma

68

69 of 73

A remark about ground truth evaluation

  • Commonly used to compare different clustering algorithms.
  • A real-life data set for clustering has no class labels.
    • Thus although an algorithm may perform very well on some labeled data sets, no guarantee that it will perform well on the actual application data at hand.
  • The fact that it performs well on some label data sets does give us some confidence of the quality of the algorithm.
  • This evaluation method is said to be based on external data or information.

Dinesh K. Vishwakarma

69

70 of 73

Evaluation based on internal information

  • Intra-cluster cohesion (compactness):
    • Cohesion measures how near the data points in a cluster are to the cluster centroid.
    • Sum of squared error (SSE) is a commonly used measure.
  • Inter-cluster separation (isolation):
    • Separation means that different cluster centroids should be far away from one another.
  • In most applications, expert judgments are still the key.

Dinesh K. Vishwakarma

70

71 of 73

Indirect evaluation

  • In some applications, clustering is not the primary task, but used to help perform another task.
  • We can use the performance on the primary task to compare clustering methods.
  • For instance, in an application, the primary task is to provide recommendations on book purchasing to online shoppers.
    • If we can cluster books according to their features, we might be able to provide better recommendations.
    • We can evaluate different clustering algorithms based on how well they help with the recommendation task.
    • Here, we assume that the recommendation can be reliably evaluated.

Dinesh K. Vishwakarma

71

72 of 73

An example

Dinesh K. Vishwakarma

72

73 of 73

Summary

  • Clustering is has along history and still active
    • There are a huge number of clustering algorithms
    • More are still coming every year.
  • We only introduced several main algorithms. There are many others, e.g.,
    • density based algorithm, sub-space clustering, scale-up methods, neural networks based methods, fuzzy clustering, co-clustering, etc.
  • Clustering is hard to evaluate, but very useful in practice. This partially explains why there are still a large number of clustering algorithms being devised every year.
  • Clustering is highly application dependent and to some extent subjective.

Dinesh K. Vishwakarma

73