Clustering Algorithms:
K-means and DBSCAN
Yu-Chun Lo
Preface
The content of this slides is same as Ipython provided in GitHub. The only difference is that I’ve added some easy-read comments and explanations along with the code. Some tedious and unimportant code will not put on the slides, so I would suggest that while reading this tutorial, play with the code in Ipython.
Let’s get started!
K-means
First, import the necessary packages.
Generate three clusters with two-dimensional data.
← The row and column of numpy array indicate the number of data points and the number of dimensions. In this case, we generate 5000 data points and each data point is two-dimensional data, so the training data X is a (5000 x 2) matrix.
← Each data point corresponds to a cluster label y (ground truth) and the number of clusters is three (label = {0, 1, 2}), so the label y is a 5000-dimensional vector.
Let’s visualize our training data and ground truth.
Use KMeans from scikit-learn to train our clustering model:
Parameters:
← Use “kmeans.labels_” to retrieve our predictions (5000-D vector).
← Use “kmeans.cluster_centers_” to retrieve trained centroids.
There are three clusters and the data is 2-D, thus it is (3x2) matrix.
← Visualize predictions and centroids.
← After training, use “kmeans.predict()” to predict new data X_new without training again.
← Use “kmeans.transform()” to retrieve the distances between every data point and every centroid. If you have M data points and N clusters,you’ll have (MxN) matrix. Every row contains distances from a data point to N clusters.
K-means++
If you want to use k-means++,just set init=’k-means++’, simple enough huh? Besides, have you noticed that the speed of convergence is faster than just randomly picking centroids?
Actually in scikit-learn, it use k-means++ as the default option, so you don’t need to set it manually (the reason why we set it manually is just for educational purpose).
The predictions from k-means++.
Please complete “clustering-lab1.ipynb”
Drawbacks of K-means and How to Address Them
Drawback 1:
Need to choose the right “k”
Let’s begin by generating sythesized training data consisted of 3 clusters.
Here’s a question: What if we choose a wrong number of clusters and directly apply k-means on the training data?
The answer is, if we choose the wrong “k”, the k-means algorithm can still work without any error.
In the case of k=2, the cluster label of the center one among these clusters is viewed as the same as the right one (blue), which means that the overall distances between the center one is more closer to the right one (blue), instead of the left (red) one.
However, this result is not good enough. How do we choose “k” properly?
Use supervised method only when you have ground truth with your training data.
Use unsupervised method when you don’t have ground truth with your training data.
First, we generate 4-cluster training data and visualized it. Note that we mean to generate the data that there’re 3 clusters (red, blue and yellow) are overlapped with each other on purpose.
Let’s use the cluster quality measurements that just described above and run different values of “k” to observe which “k” is most preferable.
Code explanation:
Since the code of visualization part is tedious, please refer it to Ipython.
Let’s talk about how to interpret this diagram.
The left one is very simple, just the predictions of the clusters.
The right one:
← When k=4, the silhouette coefficient of all cluster are generally beyond the red dashed line, meaning that these cluster qualities are good.
← When k=5, we can notice that there’re 2 clusters (yellow and tiffany blue) are behind the red dashed line, meaning that their overall silhouette coeffiecients are low,since these 2 clusters are overlapped with each other. This results in low mean nearest cluster distance b(x), thus k=5 is not a good choice.
Drawback 2:
Cannot Handle Noise Data and Outliers
Extended from the same example above, we cluster the data with k=4 and the data seems to be well-clustered.
We can notice that some of the data points which are relatively far away from their belonging clusters, however, are still clustered by k-means, but they are obviously “outliers” .
So … how do we detect these outliers?
You can simply compute the average distance for each cluster as the distance threshold for judging whether a data point is an outlier or not if the distance between the data point to its cluster center is less than the distance threshold (sometimes we control the scale of the distance threshold by multiplying a distance threshold ratio).
← 2. Compute all distances between every data points and their centroid. Average them to get our distance threshold.
← 1. Retrieve the data points and centroid.
← 3. Outlier detection.
← By doing the simple outlier detection we described above, we successfully detect the outliers (red).
However, if there’re are too many outliers in your dataset, this method may not work appropriately, since k-means computes mean from all the data points, including these outliers.
Drawback 3:
Cannot Handle Non-spherical Data
According to the definition of k-means: The algorithm aims to partition the data to the closest mean among the clusters. This is based on an assumption that your data has to be spherical, otherwise, the performance of k-means won’t be desirable.
Let’s see what will the clustering result be when we use k-means with k=2 to cluster the non-spherical data at the left hand side.
It turns out that the concentric-circle data is partitioned from the center into 2 clusters due to the mean values of the inner circle and the outer circle are very closed to each other.
← Helper function for transforming Carteisan coordinates to polar coordinates.
← We only need the feature-radius,the second feature will be filled in zero, then finally apply k-means.
DBSCAN:
Density-Based Spatial Clustering Algorithm with Noise
← 1. Generate training data X (1000x2) and corresponding ground truth y。
← 2. Standardizing features of training data.
← 3. Apply DBSCAN, the parameter ”min_samples” is same as “MinPts” described previously.
← 4. Print cluster qualities.
Let’s try to apply DBSCAN on non-spherical data.
Please complete “clustering-lab2.ipynb”
Thanks for Listening.
Any Questions ?