1 of 27

JSC 270 - LAb 9

Clustering

2 of 27

data

40 points

2 features

2 clusters

3 of 27

data

Original data

What we see

A subsample of 11 points

4 of 27

K-means - Algorithm

Decide on the number of clusters K

Pick K centroids (ie. cluster centers) at random

Repeat the following until convergence:

Compute the distance from each point to each centroid and assign points to the cluster with the closest centroid
Re-compute the centroid for the current cluster using all assigned points

5 of 27

K-Means

A

B

C

D

E

F

G

H

I

J

K

6 of 27

K-Means

A

B

C

D

E

F

G

H

I

J

K

x

7 of 27

K-Means

A

B

C

D

E

F

G

H

I

J

K

x

8 of 27

K-Means

A

B

C

D

E

F

G

H

I

J

K

x

9 of 27

K-Means

A

B

C

D

E

F

G

H

I

J

K

x

10 of 27

K-Means

A

B

C

D

E

F

G

H

I

J

K

x

11 of 27

K-Means

A

B

C

D

E

F

G

H

I

J

K

x

12 of 27

K-Means

A

B

C

D

E

F

G

H

I

J

K

x

13 of 27

K-Means

Depends on initial conditions

14 of 27

How to make K-Means more robust?

15 of 27

How to make K-Means more robust?

Run many times (scikit version default is 10)
Have a better than random initialization

Pick data points as initializations
k-means++

16 of 27

How to make K-Means more robust? k-means++

Choose one center uniformly at random among the data points.
For each data point x not chosen yet, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x).
Repeat Steps 2 and 3 until k centers have been chosen.
Now that the initial centers have been chosen, proceed using standard k-means clustering.

17 of 27

k-MEANS

Advantages

Fast

Makes relatively few assumptions

Stable when robust initialization is used

Disadvantages

Depends on the initialization

Can get stuck in local minimum

18 of 27

A bit about selecting the number of clusters

General approach

Apply clustering for k = 1,...,K (e.g. K = 10)
For each k, compute a statistic
Plot statistic as a function of k

19 of 27

VARIOUS STATISTICS for selecting the number of clusters

Silhouette
Gap
Elbow
Sanky Plot
Information Criterion

AIC
BIC

...

20 of 27

VARIOUS STATISTICS for selecting the number of clusters

Silhouette
Gap
Elbow
Sanky Plot
Information Criterion

AIC
BIC

...

21 of 27

Elbow

Elbow method- heuristic

Elbow in the plot of the sum of squared distances to the nearest centre

22 of 27

Silhouette Score

Kauffman and Rousseeuw (1990)

For each data point i in cluster C_i, let

average distance for point i to other points within its cluster

min (across clusters) of the averages of intra-cluster distances

Silhouette: , Silhouette Coefficient:

average over all points

23 of 27

Silhouette Score

Kauffman and Rousseeuw (1990)

k=9

24 of 27

GAP statistic

D_i=𝝨_h,k∈Ci d_hk- sum of pairwise distances in cluster i

W_k = 𝝨_i=1..k1/(2n_i)D_i - pooled within cluster sum of distances

Main idea: compare log(W_k) to expectation under the null distribution

Tibshirani et al (2000)

25 of 27

GAP statistic

k=5

k=9

k=5 k=9

26 of 27

Alluvial (SANKY) PLOT

k=3

k=9

27 of 27

SUMMARY

There are MANY clustering methods

We saw agglomerative hierarchical clustering and K-means

There is no one perfect method that works for every dataset
There are many ways to select the number of clusters
Examine multiple methods for each dataset
Visualizations help to analyze clustering