1 of 70

Unsupervised Learning

Dr. Dinesh Kumar Vishwakarma

Professor,

Department of Information Technological University, Delhi-110042

2 of 70

Outline

  • Basic concepts
  • K-means algorithm
  • Representation of clusters
  • Hierarchical clustering
  • Which clustering algorithm to use?
  • Cluster evaluation
  • Summary

Dinesh K. Vishwakarma

2

3 of 70

Supervised vs. unsupervised learning

  • Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute.
    • These patterns are then utilized to predict the values of the target attribute in future data instances.
  • Unsupervised learning: The data have no target attribute.
    • We want to explore the data to find some intrinsic structures in them.

Dinesh K. Vishwakarma

3

4 of 70

Clustering

  • Clustering is a technique for finding similarity groups in data, called clusters. i.e.,
    • it groups data instances that are similar to (near) each other in one cluster and data instances that are very different (far away) from each other into different clusters.
  • Clustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given, which is the case in supervised learning.
  • In fact, association rule mining is also unsupervised.

Dinesh K. Vishwakarma

4

5 of 70

What is Clusters?

  • The organization of unlabeled data into similarity groups called ‘clusters’.
  • A cluster is a collection of data items which are “similar” between them, & “dissimilar” to data items in other clusters.
  • The data set has three natural groups of data points, i.e., 3 natural clusters.

Dinesh K. Vishwakarma

5

6 of 70

What is clustering for?

  • Example 1: groups people of similar sizes together to make “small”, “medium” and “large” T-Shirts.
    • Tailor-made for each person: too expensive
    • One-size-fits-all: does not fit all.
  • Example 2: In marketing, segment customers according to their similarities
    • To do targeted marketing.
    • Residential Area

Dinesh K. Vishwakarma

6

7 of 70

What is clustering for? (cont…)

  • Example 3: Given a collection of text documents, we want to organize them according to their content similarities,
    • To produce a topic hierarchy
  • In fact, clustering is one of the most utilized data mining techniques.
    • It has a long history, and used in almost every field, e.g., medicine, psychology, botany, sociology, biology, archeology, marketing, insurance, libraries, etc.
    • In recent years, due to the rapid increase of online documents, text clustering becomes important.

Dinesh K. Vishwakarma

7

8 of 70

What is clustering for? (cont…)

  • Computer Vision

Dinesh K. Vishwakarma

8

9 of 70

What do we need for Clustering?

  •  

Dinesh K. Vishwakarma

9

Large d, & small s

small d, & large s

Good

Bad

10 of 70

Distance Measurement

  •  

Dinesh K. Vishwakarma

10

  •  
  •  

11 of 70

Clustering E.g.

  •  

Dinesh K. Vishwakarma

11

C3 {X2, X4, X9}

C1 {X7,X8}

C2 { X1, X3, X5, X6}

  • Clustering

12 of 70

Fundamentals of Clustering

Dinesh K. Vishwakarma

12

13 of 70

Hierarchical vs Partitional Clustering

Dinesh K. Vishwakarma

13

14 of 70

K-means clustering

Dinesh K. Vishwakarma

14

15 of 70

K-means clustering…

  • K-means is a ‘Partitional Clustering algorithm.
  • Let the set of data points (or instances) D be {x1, x2, …, xn},
    • where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X Rr, and r is the number of attributes (dimensions) in the data.
  • The k-means algorithm partitions the given data into k clusters.
    • Each cluster has a cluster center, called centroid.
    • k is specified by the user.

Dinesh K. Vishwakarma

15

16 of 70

K-means algorithm

  • Given k, the k-means algorithm works as follows:
    1. Randomly choose k data points (seeds) to be the initial centroids, cluster centers
    2. Assign each data point to the closest centroid.
    3. Re-compute the centroids using the current cluster memberships.
    4. If a convergence criterion is not met, go to 2).

Dinesh K. Vishwakarma

16

17 of 70

K-means algorithm – (cont …)

  •  

Dinesh K. Vishwakarma

17

18 of 70

Stopping/convergence criterion

  1. No (or minimum) re-assignments of data points to different clusters,
  2. No (or minimum) change of centroids, or
  3. Minimum decrease in the sum of squared error (SSE),

    • Cj is the jth cluster, mj is the centroid of cluster Cj (the mean vector of all the data points in Cj), and dist(x, mj) is the distance between data point x and centroid mj.

Dinesh K. Vishwakarma

18

(1)

19 of 70

An example

Dinesh K. Vishwakarma

19

(A) Random selection of k centres

Iteration 1 (B) Cluster assignment

(C) Re-compute centroids

Iteration 2 (D) Cluster assignment

(E) Re-compute centroids

Iteration 3 (F) Cluster assignment

20 of 70

Example

  • Step 1: Randomly initialize cluster centre

Dinesh K. Vishwakarma

20

  • Step 2: determine cluster membership to each input

21 of 70

Example…

  • Step 3: Re-estimate cluster centre

Dinesh K. Vishwakarma

21

  • Result of first iteration

22 of 70

Example…

  • Second iteration

Dinesh K. Vishwakarma

22

  • Result of Second iteration

23 of 70

Example of K-Mean Clustering

  • Consider a data table

Dinesh K. Vishwakarma

23

Height (X)

Weight (Y)

185

72

170

56

168

60

179

68

182

72

188

77

180

71

180

70

183

84

180

88

180

67

177

76

  • Consider there are two clusters (K1, K2) and their centroids are (185, 72) & (170, 56)
  • Euclidian Distance for data point ‘3’

K1={185, 72}

K2={170, 56}

  •  
  •  

Hence, data point 3 belongs to K2

  • New Centroid for K2
  •  
  •  
  •  

24 of 70

Strengths of k-means

Dinesh K. Vishwakarma

24

25 of 70

Strengths of k-means…

  • Strengths:
    • Simple: easy to understand and to implement
    • Efficient: Time complexity: O(tkn),

where n is the number of data points,

k is the number of clusters, and

t is the number of iterations.

    • Since both k and t are small. k-means is considered a linear algorithm.
  • Note that: it terminates at a local optimum if SSE is used. The global optimum is hard to find due to complexity.

Dinesh K. Vishwakarma

25

26 of 70

Weaknesses of k-means

  • The algorithm is only applicable if the mean is defined.
    • For categorical data, k-mode - the centroid is represented by most frequent values.
  • The user needs to specify k.
  • The algorithm is sensitive to outliers
    • Outliers are data points that are very far away from other data points.
    • Outliers could be errors in the data recording or some special data points with very different values.

Dinesh K. Vishwakarma

26

27 of 70

Weaknesses of k-means…

  • Problems with outliers

Dinesh K. Vishwakarma

27

(A) Undesirable cluster

(B) Ideal cluster

Outlier

Outlier

28 of 70

Weaknesses of k-means...

  • To deal with outliers
  • One method is to remove some data points in the clustering process that are much further away from the centroids than other data points.
    • To be safe, we may want to monitor these possible outliers over a few iterations and then decide to remove them.
  • Another method is to perform random sampling. Since in sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small.
    • Assign the rest of the data points to the clusters by distance or similarity comparison, or classification

Dinesh K. Vishwakarma

28

29 of 70

Weaknesses of k-means (cont …)

  • The algorithm is sensitive to initial seeds.
  • If we use different seeds: good results

Dinesh K. Vishwakarma

29

30 of 70

Weaknesses of k-means (cont …)

  • Special Data Structures: The k-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres).

Dinesh K. Vishwakarma

30

Two natural clusters

K-mean clusters

31 of 70

K-means summary

  • Despite weaknesses, k-means is still the most popular algorithm due to its simplicity, efficiency and
    • other clustering algorithms have their own lists of weaknesses.
  • No clear evidence that any other clustering algorithm performs better in general
    • although they may be more suitable for some specific types of data or applications.
  • Comparing different clustering algorithms is a difficult task. No one knows the correct clusters!

Dinesh K. Vishwakarma

31

32 of 70

Outlines

  • Basic concepts
  • K-means algorithm
  • Representation of clusters
  • Hierarchical clustering
  • Which clustering algorithm to use?
  • Cluster evaluation
  • Summary

Dinesh K. Vishwakarma

32

33 of 70

Hierarchical Clustering

Dinesh K. Vishwakarma

33

34 of 70

Why Hierarchical Clustering?

  •  

Dinesh K. Vishwakarma

34

35 of 70

Why Hierarchical Clustering? E.g.

Dinesh K. Vishwakarma

35

36 of 70

Types of hierarchical clustering

  • Agglomerative (bottom up) clustering: It builds the dendrogram (tree) from the bottom level, and
    • merges the most similar (or nearest) pair of clusters
    • stops when all the data points are merged into a single cluster (i.e., the root cluster).
  • Divisive (top down) clustering: It starts with all data points in one cluster, the root.
    • Splits the root into a set of child clusters. Each child cluster is recursively divided further
    • stops when only singleton clusters of individual data points remain, i.e., each cluster with only a single point

Dinesh K. Vishwakarma

36

37 of 70

Agglomerative approach

Dinesh K. Vishwakarma

37

b

d

c

e

a

a b

d e

c d e

a b c d e

Step 0

Step 1

Step 2

Step 3

Step 4

bottom-up

Initialization:

Each object is a cluster

Iteration:

Merge two clusters which are

most similar to each other;

Until all objects are merged

into a single cluster

38 of 70

Divisive Approaches

Dinesh K. Vishwakarma

38

38

b

d

c

e

a

a b

d e

c d e

a b c d e

Step 4

Step 3

Step 2

Step 1

Step 0

Top-down

Initialization:

All objects stay in one cluster

Iteration:

Select a cluster and split it into

two sub clusters

Until each leaf cluster contains

only one object

39 of 70

Dendrogram

  • A binary tree that shows how clusters are merged/split hierarchically
  • Each node on the tree is a cluster; each leaf node is a singleton cluster

Dinesh K. Vishwakarma

39

9

6

4

3

2

No. of clusters

40 of 70

Agglomerative Algorithms

Dinesh K. Vishwakarma

40

41 of 70

How to Merge Clusters?

  • How to measure the distance between clusters?

Dinesh K. Vishwakarma

41

  • Single-link
  • Complete-link
  • Average-link
  • Centroid distance

Distance?

Hint: Distance between clusters is usually defined on the basis of distance between objects.

42 of 70

How to Define Inter-Cluster Distance

Dinesh K. Vishwakarma

42

  • Single-link
  • Complete-link
  • Average-link
  • Centroid distance

The distance between two clusters is represented by the distance of the closest pair of data objects belonging to different clusters.

43 of 70

How to Define Inter-Cluster Distance

Dinesh K. Vishwakarma

43

  • Single-link
  • Complete-link
  • Average-link
  • Centroid distance

The distance between two clusters is represented by the distance of the farthest pair of data objects belonging to different clusters.

44 of 70

How to Define Inter-Cluster Distance

Dinesh K. Vishwakarma

44

  • Single-link
  • Complete-link
  • Average-link
  • Centroid distance

The distance between two clusters is represented by the average distance of all pairs of data objects belonging to different clusters.

45 of 70

How to Define Inter-Cluster Distance

Dinesh K. Vishwakarma

45

  • Single-link
  • Complete-link
  • Average-link
  • Centroid distance

×

×

mi,mj are the means of Ci, Cj,

The distance between two clusters is represented by the distance between the means of the cluters.

46 of 70

E.g. Agglomerative Hierarchical Clustering

  • For the following data set, we will get different clustering results with the single-link and complete-link algorithms.

Dinesh K. Vishwakarma

46

1

2

3

4

5

6

1

3

4

5

2

6

1

2

3

4

5

6

1

3

2

4

5

6

Result of the Single-Link algorithm

Result of the complete-Link algorithm

47 of 70

Hierarchical Clustering: Comparison

Dinesh K. Vishwakarma

Average-link

Centroid distance

1

2

3

4

5

6

1

2

5

3

4

Single-link

Complete-link

1

2

3

4

5

6

1

2

5

3

4

1

2

3

4

5

6

1

2

5

3

4

1

2

3

4

5

6

1

2

3

4

5

48 of 70

Comparison of Dendrograms

Dinesh K. Vishwakarma

48

1 2 5 3 6 4

1 2 5 3 6 4

1 2 5 3 6 4

2 5 3 6 4 1

Average-link

Centroid distance

Single-link

Complete-link

49 of 70

E.G. of Clustering

Dinesh K. Vishwakarma

49

50 of 70

E.G. of Clustering using Single Link

Dinesh K. Vishwakarma

50

51 of 70

E.G. of Clustering using Single Link

Dinesh K. Vishwakarma

51

52 of 70

E.G. of Clustering using Single Link

Dinesh K. Vishwakarma

52

53 of 70

E.G. of Clustering using Single Link

Dinesh K. Vishwakarma

53

54 of 70

Agglomerative clustering

It is more popular then divisive methods.

  • At the beginning, each data point forms a cluster (also called a node).
  • Merge nodes/clusters that have the least distance.
  • Go on merging
  • Eventually all nodes belong to one cluster

Dinesh K. Vishwakarma

54

55 of 70

Agglomerative clustering algorithm

Dinesh K. Vishwakarma

55

56 of 70

The complexity

  • All the algorithms are at least O(n2). n is the number of data points.
  • Single link can be done in O(n2).
  • Complete and average links can be done in O(n2logn).
  • Due the complexity, hard to use for large data sets.
    • Sampling
    • Scale-up methods (e.g., BIRCH).

Dinesh K. Vishwakarma

56

57 of 70

Outlines

  • Basic concepts
  • K-means algorithm
  • Representation of clusters
  • Hierarchical clustering
  • Which clustering algorithm to use?
  • Cluster evaluation
  • Summary

Dinesh K. Vishwakarma

57

58 of 70

How to choose a clustering algorithm

  • Clustering research has a long history. A vast collection of algorithms are available.
    • We only introduced few key algorithms.
  • Choosing the “best” algorithm is a challenge.
    • Every algorithm has limitations and works well with certain data distributions.
    • It is very hard, if not impossible, to know what distribution the application data follow. The data may not fully follow any “ideal” structure or distribution required by the algorithms.
    • One also needs to decide how to standardize the data, to choose a suitable distance function and to select other parameter values.

Dinesh K. Vishwakarma

58

59 of 70

Choose a clustering algorithm (cont …)

  • Due to these complexities, the common practice is to
    • run several algorithms using different distance functions and parameter settings, and
    • then carefully analyze and compare the results.
  • The interpretation of the results must be based on insight into the meaning of the original data together with knowledge of the algorithms used.
  • Clustering is highly application dependent and to certain extent subjective (personal preferences).

Dinesh K. Vishwakarma

59

60 of 70

Outlines

  • Basic concepts
  • K-means algorithm
  • Representation of clusters
  • Hierarchical clustering
  • Which clustering algorithm to use?
  • Cluster evaluation
  • Summary

Dinesh K. Vishwakarma

60

61 of 70

Cluster Evaluation: hard problem

  • The quality of a clustering is very hard to evaluate because
    • We do not know the correct clusters
  • Some methods are used:
    • User inspection
      • Study centroids, and spreads
      • Rules from a decision tree.
      • For text documents, one can read some documents in clusters.

Dinesh K. Vishwakarma

61

62 of 70

Cluster evaluation: ground truth

  • We use some labeled data (for classification)
  • Assumption: Each class is a cluster.
  • After clustering, a confusion matrix is constructed. From the matrix, we compute various measurements, entropy, purity, precision, recall and F-score.
    • Let the classes in the data D be C = (c1, c2, …, ck). The clustering method produces k clusters, which divides D into k disjoint subsets, D1, D2, …, Dk.

Dinesh K. Vishwakarma

62

63 of 70

Evaluation measures: Entropy

Dinesh K. Vishwakarma

63

64 of 70

Evaluation measures: purity

Dinesh K. Vishwakarma

64

65 of 70

An example

  • Assume we have a text collection Dof 900 documents from three topics (or three classes), Science, Sports, and Politics. Each class has 300 documents, and each document in D is labeled with one of the topics (classes). We use this collection to perform clustering to find three clusters. Class/topic labels are not used in clustering. After clustering, we want to measure the effectiveness of the clustering algorithm.

Dinesh K. Vishwakarma

65

66 of 70

A remark about ground truth evaluation

  • Commonly used to compare different clustering algorithms.
  • A real-life data set for clustering has no class labels.
    • Thus although an algorithm may perform very well on some labeled data sets, no guarantee that it will perform well on the actual application data at hand.
  • The fact that it performs well on some label data sets does give us some confidence of the quality of the algorithm.
  • This evaluation method is said to be based on external data or information.

Dinesh K. Vishwakarma

66

67 of 70

Evaluation based on internal information

  • Intra-cluster cohesion (compactness):
    • Cohesion measures how near the data points in a cluster are to the cluster centroid.
    • Sum of squared error (SSE) is a commonly used measure.
  • Inter-cluster separation (isolation):
    • Separation means that different cluster centroids should be far away from one another.
  • In most applications, expert judgments are still the key.

Dinesh K. Vishwakarma

67

68 of 70

Indirect evaluation

  • In some applications, clustering is not the primary task, but used to help perform another task.
  • We can use the performance on the primary task to compare clustering methods.
  • For instance, in an application, the primary task is to provide recommendations on book purchasing to online shoppers.
    • If we can cluster books according to their features, we might be able to provide better recommendations.
    • We can evaluate different clustering algorithms based on how well they help with the recommendation task.
    • Here, we assume that the recommendation can be reliably evaluated.

Dinesh K. Vishwakarma

68

69 of 70

An example

Dinesh K. Vishwakarma

69

70 of 70

Summary

  • Clustering is has along history and still active
    • There are a huge number of clustering algorithms
    • More are still coming every year.
  • We only introduced several main algorithms. There are many others, e.g.,
    • density based algorithm, sub-space clustering, scale-up methods, neural networks based methods, fuzzy clustering, co-clustering, etc.
  • Clustering is hard to evaluate, but very useful in practice. This partially explains why there are still a large number of clustering algorithms being devised every year.
  • Clustering is highly application dependent and to some extent subjective.

Dinesh K. Vishwakarma

70