1 of 27

JSC 270 - LAb 9

Clustering

2 of 27

data

40 points

2 features

2 clusters

3 of 27

data

Original data

What we see

A subsample of 11 points

4 of 27

K-means - Algorithm

Decide on the number of clusters K

Pick K centroids (ie. cluster centers) at random

Repeat the following until convergence:

  1. Compute the distance from each point to each centroid and assign points to the cluster with the closest centroid
  2. Re-compute the centroid for the current cluster using all assigned points

5 of 27

K-Means

A

B

C

D

E

F

G

H

I

J

K

6 of 27

K-Means

A

B

C

D

E

F

G

H

I

J

K

x

x

7 of 27

K-Means

A

B

C

D

E

F

G

H

I

J

K

x

x

8 of 27

K-Means

A

B

C

D

E

F

G

H

I

J

K

x

x

x

x

9 of 27

K-Means

A

B

C

D

E

F

G

H

I

J

K

x

x

x

x

10 of 27

K-Means

A

B

C

D

E

F

G

H

I

J

K

x

x

x

x

x

x

11 of 27

K-Means

A

B

C

D

E

F

G

H

I

J

K

x

x

x

x

x

x

12 of 27

K-Means

A

B

C

D

E

F

G

H

I

J

K

x

x

x

x

x

x

x

13 of 27

K-Means

Depends on initial conditions

14 of 27

How to make K-Means more robust?

15 of 27

How to make K-Means more robust?

  • Run many times (scikit version default is 10)
  • Have a better than random initialization
    • Pick data points as initializations
    • k-means++

16 of 27

How to make K-Means more robust? k-means++

  1. Choose one center uniformly at random among the data points.
  2. For each data point x not chosen yet, compute D(x), the distance between x and the nearest center that has already been chosen.
  3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x).
  4. Repeat Steps 2 and 3 until k centers have been chosen.
  5. Now that the initial centers have been chosen, proceed using standard k-means clustering.

17 of 27

k-MEANS

Advantages

Fast

Makes relatively few assumptions

Stable when robust initialization is used

Disadvantages

Depends on the initialization

Can get stuck in local minimum

18 of 27

A bit about selecting the number of clusters

General approach

  • Apply clustering for k = 1,...,K (e.g. K = 10)
  • For each k, compute a statistic
  • Plot statistic as a function of k

19 of 27

VARIOUS STATISTICS for selecting the number of clusters

  • Silhouette
  • Gap
  • Elbow
  • Sanky Plot
  • Information Criterion
    • AIC
    • BIC
  • ...

20 of 27

VARIOUS STATISTICS for selecting the number of clusters

  • Silhouette
  • Gap
  • Elbow
  • Sanky Plot
  • Information Criterion
    • AIC
    • BIC
  • ...

21 of 27

Elbow

Elbow method- heuristic

Elbow in the plot of the sum of squared distances to the nearest centre

22 of 27

Silhouette Score

Kauffman and Rousseeuw (1990)

For each data point i in cluster Ci, let

  • average distance for point i to other points within its cluster
  • min (across clusters) of the averages of intra-cluster distances

Silhouette: , Silhouette Coefficient:

average over all points

23 of 27

Silhouette Score

Kauffman and Rousseeuw (1990)

k=9

24 of 27

GAP statistic

Di=𝝨h,k∈Ci dhk- sum of pairwise distances in cluster i

Wk = 𝝨i=1..k1/(2ni)Di - pooled within cluster sum of distances

Main idea: compare log(Wk) to expectation under the null distribution

Tibshirani et al (2000)

25 of 27

GAP statistic

k=5

k=9

k=5 k=9

26 of 27

Alluvial (SANKY) PLOT

k=3

k=9

27 of 27

SUMMARY

  • There are MANY clustering methods
    • We saw agglomerative hierarchical clustering and K-means
  • There is no one perfect method that works for every dataset
  • There are many ways to select the number of clusters
  • Examine multiple methods for each dataset
  • Visualizations help to analyze clustering