1 of 37

Clustering

2 of 37

Outline

  • Introduction
  • K-means clustering
  • Hierarchical clustering

2

3 of 37

Classification vs. Clustering

3

Classification: Supervised learning:

Learns a method for predicting the instance class from pre-labeled (classified) instances

4 of 37

Clustering

4

Unsupervised learning:

Finds “natural” grouping of instances given un-labeled data

5 of 37

Clustering Methods

  • Many different method and algorithms:
    • For numeric and/or symbolic data
    • Deterministic vs. probabilistic
    • Exclusive vs. overlapping
    • Hierarchical vs. flat
    • Top-down vs. bottom-up

5

6 of 37

Clusters: �exclusive vs. overlapping

6

Simple 2-D representation

Non-overlapping

Venn diagram

Overlapping

a

k

j

i

h

g

f

e

d

c

b

7 of 37

Clustering Evaluation

  • Manual inspection
  • Benchmarking on existing labels
  • Cluster quality measures
    • distance measures
    • high similarity within a cluster, low across clusters

7

8 of 37

The distance function

  • Simplest case: one numeric attribute A
    • Distance(X,Y) = A(X) – A(Y)
  • Several numeric attributes:
    • Distance(X,Y) = Euclidean distance between X,Y
  • Nominal attributes: distance is set to 1 if values are different, 0 if they are equal
  • Are all attributes equally important?
    • Weighting the attributes might be necessary

8

9 of 37

Simple Clustering: K-means

Works with numeric data only

  1. Pick a number (K) of cluster centers (at random)
  2. Assign every item to its nearest cluster center (e.g. using Euclidean distance)
  3. Move each cluster center to the mean of its assigned items
  4. Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)

9

10 of 37

K-means example, step 1

10

k1

k2

k3

X

Y

Pick 3

initial

cluster

centers

(randomly)

11 of 37

K-means example, step 2

11

k1

k2

k3

X

Y

Assign

each point

to the closest

cluster

center

12 of 37

K-means example, step 3

12

X

Y

Move

each cluster center

to the mean

of each cluster

k1

k2

k2

k1

k3

k3

13 of 37

K-means example, step 4

13

X

Y

Reassign

points

closest to a different new cluster center

Q: Which points are reassigned?

k1

k2

k3

14 of 37

K-means example, step 4 …

14

X

Y

A: three points with animation

k1

k3

k2

15 of 37

K-means example, step 4b

15

X

Y

re-compute cluster means

k1

k3

k2

16 of 37

K-means example, step 5

16

X

Y

move cluster centers to cluster means

k2

k1

k3

17 of 37

Discussion, 1

What can be the problems with

K-means clustering?

17

18 of 37

Discussion, 2

  • Result can vary significantly depending on initial choice of seeds (number and position)
  • Can get trapped in local minimum
    • Example:

  • Q: What can be done?

18

instances

initial cluster centers

19 of 37

Discussion, 3

A: To increase chance of finding global optimum: restart with different random seeds.

19

20 of 37

K-means clustering summary

Advantages

  • Simple, understandable
  • items automatically assigned to clusters

Disadvantages

  • Must pick number of clusters before hand
  • All items forced into a cluster
  • Too sensitive to outliers

20

21 of 37

K-means clustering - outliers ?

What can be done about outliers?

21

22 of 37

K-means variations

  • K-medoids – instead of mean, use medians of each cluster
    • Mean of 1, 3, 5, 7, 9 is
    • Mean of 1, 3, 5, 7, 1009 is
    • Median of 1, 3, 5, 7, 1009 is
    • Median advantage: not affected by extreme values
  • For large databases, use sampling

22

5

205

5

23 of 37

*Hierarchical clustering

  • Bottom up
    • Start with single-instance clusters
    • At each step, join the two closest clusters
    • Design decision: distance between clusters
      • E.g. two closest instances in clusters� vs. distance between means
  • Top down
    • Start with one universal cluster
    • Find two clusters
    • Proceed recursively on each subset
    • Can be very fast
  • Both methods produce a�dendrogram

23

24 of 37

Agglomerative Hierarchical Clustering

24

25 of 37

*Incremental clustering

  • Heuristic approach (COBWEB/CLASSIT)
  • Form a hierarchy of clusters incrementally
  • Start:
    • tree consists of empty root node
  • Then:
    • add instances one by one
    • update tree appropriately at each stage
    • to update, find the right leaf for an instance
    • May involve restructuring the tree
  • Base update decisions on category utility

25

26 of 37

*Clustering weather data

1

26

ID

Outlook

Temp.

Humidity

Windy

A

Sunny

Hot

High

False

B

Sunny

Hot

High

True

C

Overcast

Hot

High

False

D

Rainy

Mild

High

False

E

Rainy

Cool

Normal

False

F

Rainy

Cool

Normal

True

G

Overcast

Cool

Normal

True

H

Sunny

Mild

High

False

I

Sunny

Cool

Normal

False

J

Rainy

Mild

Normal

False

K

Sunny

Mild

Normal

True

L

Overcast

Mild

High

True

M

Overcast

Hot

Normal

False

N

Rainy

Mild

High

True

2

3

27 of 37

*Clustering weather data

4

27

ID

Outlook

Temp.

Humidity

Windy

A

Sunny

Hot

High

False

B

Sunny

Hot

High

True

C

Overcast

Hot

High

False

D

Rainy

Mild

High

False

E

Rainy

Cool

Normal

False

F

Rainy

Cool

Normal

True

G

Overcast

Cool

Normal

True

H

Sunny

Mild

High

False

I

Sunny

Cool

Normal

False

J

Rainy

Mild

Normal

False

K

Sunny

Mild

Normal

True

L

Overcast

Mild

High

True

M

Overcast

Hot

Normal

False

N

Rainy

Mild

High

True

3

Merge best host and runner-up

5

Consider splitting the best host if merging doesn’t help

28 of 37

*Final hierarchy

28

ID

Outlook

Temp.

Humidity

Windy

A

Sunny

Hot

High

False

B

Sunny

Hot

High

True

C

Overcast

Hot

High

False

D

Rainy

Mild

High

False

Oops! a and b are actually very similar

29 of 37

*Example: the iris data (subset)

29

30 of 37

*Clustering with cutoff

30

31 of 37

*Category utility

  • Category utility: quadratic loss function�defined on conditional probabilities:

  • Every instance in different category �numerator becomes��

31

maximum

number of attributes

32 of 37

*Overfitting-avoidance heuristic

  • If every instance gets put into a different category the numerator becomes (maximal):

Where n is number of all possible attribute values.

  • So without k in the denominator of the CU-formula, every cluster would consist of one instance!

32

Maximum value of CU

33 of 37

Other Clustering Approaches

  • EM – probability based clustering
  • Bayesian clustering
  • SOM – self-organizing maps

33

34 of 37

Discussion

  • Can interpret clusters by using supervised learning
    • learn a classifier based on clusters
  • Decrease dependence between attributes?
    • pre-processing step
    • E.g. use principal component analysis
  • Can be used to fill in missing values
  • Key advantage of probabilistic clustering:
    • Can estimate likelihood of data
    • Use it to compare different models objectively

34

35 of 37

Examples of Clustering Applications

  • Marketing: discover customer groups and use them for targeted marketing and re-organization
  • Astronomy: find groups of similar stars and galaxies
  • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
  • Genomics: finding groups of gene with similar expressions

35

36 of 37

Clustering Summary

  • Unsupervised
  • Many approaches
    • K-means – simple, sometimes useful
      • K-medoids is less sensitive to outliers
    • Hierarchical clustering – works for symbolic attributes
  • Evaluation is a problem

36

37 of 37

Principal Component Analysis (PCA)

37