Clustering Key Terms, Explained

Getting started with Data Science or need a refresher? Clustering is among the most used tools of Data Scientists. Check out these 10 Clustering-related terms and their concise definitions.

Clustering is a method of data analysis which groups data points together in order to  “maximizing the intraclass similarity and minimizing the interclass similarity,” ( by Han, Kamber & Pei) without using predefined labels of points (i.e., an unsupervised learning technique). This post introduces key words for common techniques in cluster analysis.

Different clusters in different color points (From Matthew Mayo‘s Machine Learning Key Terms, Explained)

1. Feature Selection

Feature selection is a data preprocessing step in which redundant and/or irrelevant features are pruned to enhance the quality of clustering. Feature selection can also be integrated directly into the clustering algorithm to gain better locality specific insights.

2. Expectation Maximization (EM)

This is an algorithm is used to estimate the parameters of a specific form assumed of the generative model of data (e.g., mixture of Gaussians).

3. Distance-based Methods

a) k-means is a method of clustering using distance. It is perhaps the most well-known example of a clustering algorithm.  It is the most widely used methods in practical implementations because of its simplicity. The Euclidian distance is used  to compute distances. Then, the partitioning representatives correspond to the mean of each cluster.

b) k-medians is quite similar to  the k-means method but it uses the median along each dimension, instead of the mean. This approach is more stable to noise and outliers, because the median is usually less sensitive to extreme values in the data.

4. Density- and Grid-Based Methods

These methods try to explore the data space at high levels of granularity. Thus, they can be used to reconstruct the entire shape of the data distribution. DBSCAN [1] and STING [2] are two classical examples.

a) The density-based method at any particular point in the data space is defined either in terms of the number of data points in a pre-specified volume of its locality or in terms of a smoother kernel density estimate [3]. This method is naturally defined in a continuous space, therefore, arbitrary data types, e.g. time-series, are not quite as easy to use with density-based methods without specialized transformations.

b) Grid-based methods are a specific class of density-based methods in which the individual regions of the data space which are explored are formed into a grid-like structure.

DBSCAN can find non-linearly separable clusters and outperforms k-means or Gaussian Mixture EM clustering for this example [1].

5. Matrix Factorization

Matrix Factorization is for data which is represented as sparse nonnegative matrices, is also referred to as co-clustering, which clusters the rows and columns of a matrix simultaneously.

6. Spectral Methods

Spectral Methods use the similarity (or distance) matrix of the underlying data, instead of working with the original points and dimensions. They can perform the dual task of embedding these objects into a Euclidian space, while performing the dimensionality reduction. Thus, this type is popular for clustering on arbitrary objects such as node sets in a graph.

7. Graph-based Techniques

Spectral methods can be considered a graph-based technique for clustering of any kinds of data, by converting the similarity matrix into a network structure. Many variants exist in terms of the different choices for constructing the similarity matrix W. Some simpler variants use the mutual k-nearest neighbor graph, or simply the binary graph in which the distances are less than a given threshold.

8. Streaming scenario

Streaming scenario is the continuous accumulation of data over time. This leads to numerous challenges when real-time analysis and scalability issues.

Clustering is one of the top method used in data mining, e.g., applications to customer segmentation, target marketing, and data summarization. In the literature, numerous groups of methods have been proposed. Probabilistic methods, distance-based methods, density-based methods, grid-based methods, factorization techniques, and spectral methods are typical grouping. Integration of feature selection/dimensionality reduction methods with clustering are often found in clustering methods.

Bio: Thuy T. Pham is a student at University of Sydney, Australia.  She is interested in Biomedical wearable devices, Applied machine learning, Big data, Internet of things, and data science.


[1]: M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. ACM KDD Conference, pages 226–231, 1996.

[2]: W. Wang, J. Yang, and R. Muntz. Sting: A statistical information grid approach to spatial data mining. VLDB Conference, 1997.

[3]: B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986