# Density Based Spatial Clustering of Applications with Noise (DBSCAN)

DBSCAN clustering can identify outliers, observations which wonâ€™t belong to any cluster. Since DBSCAN clustering identifies the number of clusters as well, it is very useful with unsupervised learning of the data when we donâ€™t know how many clusters could be there in the data.

**By Abhijit Annaldas, Microsoft.**

DBSCAN is a different type of clustering algorithm with some unique advantages. As the name indicates, this method focuses more on the proximity and density of observations to form clusters. This is very different from KMeans, where an observation becomes a part of cluster represented by nearest centroid. DBSCAN clustering can identify outliers, observations which wonâ€™t belong to any cluster. Since DBSCAN clustering identifies the number of clusters as well, it is very useful with unsupervised learning of the data when we donâ€™t know how many clusters could be there in the data.

K-Means clustering may cluster loosely related observations together. Every observation becomes a part of some cluster eventually, even if the observations are scattered far away in the vector space. Since clusters depend on the mean value of cluster elements, each data point plays a role in forming the clusters. Slight change in data pointsÂ *might*Â affect the clustering outcome. This problem is greatly reduced in DBSCAN due to the way clusters are formed.

In DBSCAN, clustering happens based on two important parameters viz.,

**neighbourhood (n)**Â - cutoff distance of a point from (core point â€“ discussed below) for it to be considered a part of a cluster. Commonly referred to asÂ*epsilon*Â (abbreviated asÂ*eps*).**minimum points (m)**Â - minimum number of points required to form a cluster. Commonly referred to asÂ*minPts*.

There are three types of points after the DBSCAN clustering is complete viz.,

**Core**Â - This is a point which has at leastÂ*m*Â points within distanceÂ*n*Â from itself.**Border**Â - This is a point which has at least one Core point at a distanceÂ*n*.**Noise**Â - This is a point which is neither a Core nor a Border. And it has less thanÂ*m*Â points within distanceÂ*n*Â from itself.

DBSCAN clustering can be summarized in following steps...

- For each pointÂ
*P*Â in dataset, identify pointsÂ*pts*Â within distanceÂ*n*.- ifÂ
*pts*Â >=Â*m*, labelÂ*P*Â as aÂ*Core*Â point - ifÂ
*pts*Â <Â*m*Â and a core point is at distanceÂ*n*, labelÂ*P*Â aÂ*Border*Â point - ifÂ
*pts*Â <Â*m*, labelÂ*P*Â aÂ*Noise*Â point

- ifÂ
- For the sake of explainability, lets refer toÂ
**aÂ**Â as a Core-Set. All the overlapping Core-Sets are grouped together into one cluster. Like multiple individual graphs being connected to form a set of connected graphs.*Core*Â point and all the points within distanceÂ*n*

Since clustering entirely depends on the parametersÂ *n*Â andÂ *m*Â (above), choosing these values correctly is very important. While good domain knowledge of the subject helps choosing good values for these parameters, there are also some approaches where these parameters can be fairly approximated without deep expertise in the domain.

See DBSCAN demo inÂ sklearn examplesÂ and try it withÂ sklearn.cluster.DBSCANÂ in sklearn.

**Bio: Abhijit Annaldas** is a Software Engineer and a voracious learner who has acquired Machine Learning knowledge and expertise to a fair extent. He is improving expertise day by day by learning new stuï¬€ and relentless practice, and has extensive experience building enterprise scale applications in diï¬€erent Microsoft and Open Source technologies as a Software Engineer at Microsoft, India since June 2012.

Original. Reposted with permission.

**Related:**