1 of 48

Clustering Algorithms:

K-means and DBSCAN

Yu-Chun Lo

2 of 48

Preface

The content of this slides is same as Ipython provided in GitHub. The only difference is that I’ve added some easy-read comments and explanations along with the code. Some tedious and unimportant code will not put on the slides, so I would suggest that while reading this tutorial, play with the code in Ipython.

Let’s get started!

5 of 48

First, import the necessary packages.

6 of 48

Generate three clusters with two-dimensional data.

← The row and column of numpy array indicate the number of data points and the number of dimensions. In this case, we generate 5000 data points and each data point is two-dimensional data, so the training data X is a (5000 x 2) matrix.

← Each data point corresponds to a cluster label y (ground truth) and the number of clusters is three (label = {0, 1, 2}), so the label y is a 5000-dimensional vector.

7 of 48

Let’s visualize our training data and ground truth.

8 of 48

Use KMeans from scikit-learn to train our clustering model:

Parameters:

n_clusters：How many clusters you want (choosing “k”).

n_inits：How many times do you want k-means to perform. In this case, we set n_init = 3, this means that we want k-means run three times and choose the lowest “inertia” (objective/cost function, sum squared distance). The Lowest inertia means the “position” in vector space of the centroids are the best for clustering, since the average distance of the data points to their belonging centroids are closest.

init：How k-means initialize centroids. We’ve got “random” and “k-means++” to choose (we’ll leave it behind to explain).

tol：The distance threshold ratio which decides when does k-means converge. The higher the ratio is, the earlier k-means converges.

random_state：Just for reproducibility. The same value produces the same result.

verbose：Set to True to print the logs of training process of k-means and vice versa.

9 of 48

← Use “kmeans.labels_” to retrieve our predictions (5000-D vector).

← Use “kmeans.cluster_centers_” to retrieve trained centroids.

There are three clusters and the data is 2-D, thus it is (3x2) matrix.

← Visualize predictions and centroids.

10 of 48

← After training, use “kmeans.predict()” to predict new data X_new without training again.

← Use “kmeans.transform()” to retrieve the distances between every data point and every centroid. If you have M data points and N clusters，you’ll have (MxN) matrix. Every row contains distances from a data point to N clusters.

13 of 48

If you want to use k-means++，just set init=’k-means++’, simple enough huh? Besides, have you noticed that the speed of convergence is faster than just randomly picking centroids?

Actually in scikit-learn, it use k-means++ as the default option, so you don’t need to set it manually (the reason why we set it manually is just for educational purpose).

14 of 48

The predictions from k-means++.

15 of 48

Please complete “clustering-lab1.ipynb”

16 of 48

Drawbacks of K-means and How to Address Them

17 of 48

Drawback 1:

Need to choose the right “k”

18 of 48

Let’s begin by generating sythesized training data consisted of 3 clusters.

Here’s a question: What if we choose a wrong number of clusters and directly apply k-means on the training data?

19 of 48

The answer is, if we choose the wrong “k”, the k-means algorithm can still work without any error.

In the case of k=2, the cluster label of the center one among these clusters is viewed as the same as the right one (blue), which means that the overall distances between the center one is more closer to the right one (blue), instead of the left (red) one.

However, this result is not good enough. How do we choose “k” properly?

20 of 48

Use supervised method only when you have ground truth with your training data.

21 of 48

Use unsupervised method when you don’t have ground truth with your training data.

22 of 48

First, we generate 4-cluster training data and visualized it. Note that we mean to generate the data that there’re 3 clusters (red, blue and yellow) are overlapped with each other on purpose.

Let’s use the cluster quality measurements that just described above and run different values of “k” to observe which “k” is most preferable.

23 of 48

Code explanation:

First, we set the range of “k”. In this case, we run 5 cases on k=2, 3, 4, 5, 6 simply by using for loop. Each uses these 3 metrics: homogeneity, completeness and silhouette coefficient to evaluate the goodness of the clusters.

Use “homogeneity_score(y, y_pred)” and “completeness_score(y, y_pred)” to compute the corresponding evaluations.

Note that we have 2 options to compute silhouette coefficient in scikit-learn:

Use “silhoette_samples(X, y_pred)” to compute silhouette coefficient for each data point.
Use “silhoette_score(X, y_pred)” to compute the average silhouette coefficient of all data points.

24 of 48

Since the code of visualization part is tedious, please refer it to Ipython.

Let’s talk about how to interpret this diagram.

The left one is very simple, just the predictions of the clusters.

The right one:

X-axis: silhouette coefficient.

Y-axis: data points.

Data points that is in the same cluster will be plotted with the same color.

The silhouette cofficient of each cluster is sorted in decending order.

The red dashed line represents the mean silhouette coefficient computed by “silhouette_score()”.

25 of 48

← When k=4, the silhouette coefficient of all cluster are generally beyond the red dashed line, meaning that these cluster qualities are good.

← When k=5, we can notice that there’re 2 clusters (yellow and tiffany blue) are behind the red dashed line, meaning that their overall silhouette coeffiecients are low，since these 2 clusters are overlapped with each other. This results in low mean nearest cluster distance b(x), thus k=5 is not a good choice.

27 of 48

Drawback 2:

Cannot Handle Noise Data and Outliers

28 of 48

Extended from the same example above, we cluster the data with k=4 and the data seems to be well-clustered.

We can notice that some of the data points which are relatively far away from their belonging clusters, however, are still clustered by k-means, but they are obviously “outliers” .

So … how do we detect these outliers?

29 of 48

You can simply compute the average distance for each cluster as the distance threshold for judging whether a data point is an outlier or not if the distance between the data point to its cluster center is less than the distance threshold (sometimes we control the scale of the distance threshold by multiplying a distance threshold ratio).

30 of 48

← 2. Compute all distances between every data points and their centroid. Average them to get our distance threshold.

← 1. Retrieve the data points and centroid.

← 3. Outlier detection.

31 of 48

← By doing the simple outlier detection we described above, we successfully detect the outliers (red).

However, if there’re are too many outliers in your dataset, this method may not work appropriately, since k-means computes mean from all the data points, including these outliers.

32 of 48

Drawback 3:

Cannot Handle Non-spherical Data

33 of 48

According to the definition of k-means: The algorithm aims to partition the data to the closest mean among the clusters. This is based on an assumption that your data has to be spherical, otherwise, the performance of k-means won’t be desirable.

Let’s see what will the clustering result be when we use k-means with k=2 to cluster the non-spherical data at the left hand side.

34 of 48

It turns out that the concentric-circle data is partitioned from the center into 2 clusters due to the mean values of the inner circle and the outer circle are very closed to each other.

35 of 48

← Helper function for transforming Carteisan coordinates to polar coordinates.

← We only need the feature－radius，the second feature will be filled in zero, then finally apply k-means.

38 of 48

DBSCAN:

Density-Based Spatial Clustering Algorithm with Noise

42 of 48

← 1. Generate training data X (1000x2) and corresponding ground truth y。

← 2. Standardizing features of training data.

← 3. Apply DBSCAN, the parameter ”min_samples” is same as “MinPts” described previously.

← 4. Print cluster qualities.

1 of 48

2 of 48

3 of 48

4 of 48

5 of 48

6 of 48

7 of 48

8 of 48

9 of 48

10 of 48

11 of 48

12 of 48

13 of 48

14 of 48

15 of 48

16 of 48

17 of 48

18 of 48

19 of 48

20 of 48

21 of 48

22 of 48

23 of 48

24 of 48

25 of 48

26 of 48

27 of 48

28 of 48

29 of 48

30 of 48

31 of 48

32 of 48

33 of 48

34 of 48

35 of 48

36 of 48

37 of 48

38 of 48

39 of 48

40 of 48

41 of 48

42 of 48

43 of 48

44 of 48

45 of 48

46 of 48

47 of 48

48 of 48