Clustering is an unsupervised learning method that divides data points into specific groups, such that data points in a group have similar properties than those in other groups.
Contributed by: Pavan Kumar Raja
There are a variety of algorithms, and each defines a cluster differently. Some algorithms look for instances centred around a particular point, called a centroid. Some algorithms look for continuous regions of densely packed instances: these clusters can take on any shape. Some algorithms are hierarchical, looking for clusters of clusters. Unfortunately, it’s hard to tell which one is better for your dataset and the performance of each algorithm depends on the unknown properties of the probability distribution of the underlying dataset.
Centrally, all clustering methods use the same approach i.e. first calculate similarities and then use it to cluster the data points. Here we will focus on the Density-based spatial clustering of applications with noise (DBSCAN) clustering method, which works well in spatial clustering applications.
What is DBSCAN?
DBSCAN is a clustering algorithm that defines clusters as continuous regions of high density and works well if all the clusters are dense enough and well separated by low-density regions.
In the case of DBSCAN, instead of guessing the number of clusters, will define two hyperparameters: epsilon and minPoints to arrive at clusters.
- Epsilon (ε): A distance measure that will be used to locate the points/to check the density in the neighbourhood of any point.
- minPoints(n): The minimum number of points (a threshold) clustered together for a region to be considered dense.
In the case of higher dimensions, epsilon can be viewed as the radius of that hypersphere and minPoints as the minimum number of data points required inside that hypersphere.
How does the DBSCAN Algorithm create Clusters?
Algorithms start by picking a point(one record) x from your dataset at random and assign it to a cluster 1. Then it counts how many points are located within the ε (epsilon) distance from x. If this quantity is greater than or equal to minPoints (n), then considers it as core point, then it will pull out all these ε-neighbours to the same cluster 1. It will then examine each member of cluster 1 and find their respective ε -neighbours. If some member of cluster 1 has n or moreε-neighbours, it will expand cluster 1 by putting those ε-neighbours to the cluster. It will continue expanding cluster 1 until there are no more examples to put in it.
In the latter case, it will pick another point from the dataset not belonging to any cluster and put it to cluster 2. It will continue like this until all examples either belong to some cluster or are marked as outliers.
One can observe three different instances/points as a part of DBSCAN clustering.
- Core Point(x): Data point that has at least minPoints (n) within epsilon (ε) distance.
- Border Point(y): Data point that has at least one core point within epsilon (ε) distance and lower than minPoints (n) within epsilon (ε) distance from it.
- Noise Point(z): Data point that has no core points within epsilon (ε) distance.
DBSCAN Parameter Selection
DBSCAN is very sensitive to the values of epsilon and minPoints. Therefore, it is important to understand how to select the values of epsilon and minPoints. A slight variation in these values can significantly change the results produced by the DBSCAN algorithm.
As a starting point, a minimum n can be derived from the number of dimensions D in the data set, as n ≥ D + 1. For data sets with noise, larger values are usually better and will yield more significant clusters. Hence, n = 2·D can be evaluated, but it may even be necessary to choose larger values for very large data.
If a small epsilon is chosen, a large part of the data will not be clustered. Whereas, for a too high value of ε, clusters will merge and the majority of objects will be in the same cluster. Hence, the value for ε can then be chosen by using a k-graph, plotting the distance to the k = minPoints-1 nearest neighbour ordered from the largest to the smallest value. Good values of ε are where this plot shows an “elbow”:
- Distance Function:
By default, DBSCAN uses Euclidean distance, although other methods can also be used (like great circle distance for geographical data). The choice of distance function is tightly linked to the choice of epsilon (ε) value and has a major impact on the outcomes. Hence, the distance function needs to be chosen appropriately based on the nature of the data set.
DBSCAN Vs K-means Clustering
|S. No.||K-means Clustering||DBSCAN|
|Distance based clustering||Density based clustering|
|Every observation becomes a part of some cluster eventually||Clearly separates outliers and clusters observations in high density areas|
|Build clusters that have a shape of a hypersphere||Build clusters that have an arbitrary shape or clusters within clusters.|
|Sensitive to outliers||Robust to outliers|
|Require no. of clusters as input||Doesn’t require no. of clusters as input|
DBSCAN also produces more reasonable results than k-means across a variety of different distributions. Below figure illustrates the fact:
This brings us to the end of the blog on DBSCAN Algorithm, if you found this helpful and wish to learn more, join Great Learning Academy’s Free Online Courses.
Machine Learning Tutorial for Beginners
What is Hierarchical Clustering? An Introduction to Hierarchical Clustering
Machine Learning Interview Questions for 2020
Clustering Algorithms in Machine Learning