Clustering Key Terms, Explained
Getting started with Data Science or need a refresher? Clustering is among the most used tools of Data Scientists. Check out these 10 Clusteringrelated terms and their concise definitions.
Clustering is a method of data analysis which groups data points together in order to “maximizing the intraclass similarity and minimizing the interclass similarity,” ( by Han, Kamber & Pei) without using predefined labels of points (i.e., an unsupervised learning technique). This post introduces key words for common techniques in cluster analysis.
Different clusters in different color points (From Matthew Mayo‘s Machine Learning Key Terms, Explained)
1. Feature Selection
Feature selection is a data preprocessing step in which redundant and/or irrelevant features are pruned to enhance the quality of clustering. Feature selection can also be integrated directly into the clustering algorithm to gain better locality specific insights.
2. Expectation Maximization (EM)
This is an algorithm is used to estimate the parameters of a specific form assumed of the generative model of data (e.g., mixture of Gaussians).
3. Distancebased Methods
a) kmeans is a method of clustering using distance. It is perhaps the most wellknown example of a clustering algorithm. It is the most widely used methods in practical implementations because of its simplicity. The Euclidian distance is used to compute distances. Then, the partitioning representatives correspond to the mean of each cluster.
b) kmedians is quite similar to the kmeans method but it uses the median along each dimension, instead of the mean. This approach is more stable to noise and outliers, because the median is usually less sensitive to extreme values in the data.
4. Density and GridBased Methods
These methods try to explore the data space at high levels of granularity. Thus, they can be used to reconstruct the entire shape of the data distribution. DBSCAN [1] and STING [2] are two classical examples.
a) The densitybased method at any particular point in the data space is defined either in terms of the number of data points in a prespecified volume of its locality or in terms of a smoother kernel density estimate [3]. This method is naturally defined in a continuous space, therefore, arbitrary data types, e.g. timeseries, are not quite as easy to use with densitybased methods without specialized transformations.
b) Gridbased methods are a specific class of densitybased methods in which the individual regions of the data space which are explored are formed into a gridlike structure.
DBSCAN can find nonlinearly separable clusters and outperforms kmeans or Gaussian Mixture EM clustering for this example [1].
5. Matrix Factorization
Matrix Factorization is for data which is represented as sparse nonnegative matrices, is also referred to as coclustering, which clusters the rows and columns of a matrix simultaneously.
6. Spectral Methods
Spectral Methods use the similarity (or distance) matrix of the underlying data, instead of working with the original points and dimensions. They can perform the dual task of embedding these objects into a Euclidian space, while performing the dimensionality reduction. Thus, this type is popular for clustering on arbitrary objects such as node sets in a graph.
7. Graphbased Techniques
Spectral methods can be considered a graphbased technique for clustering of any kinds of data, by converting the similarity matrix into a network structure. Many variants exist in terms of the different choices for constructing the similarity matrix W. Some simpler variants use the mutual knearest neighbor graph, or simply the binary graph in which the distances are less than a given threshold.
8. Streaming scenario
Streaming scenario is the continuous accumulation of data over time. This leads to numerous challenges when realtime analysis and scalability issues.
Clustering is one of the top method used in data mining, e.g., applications to customer segmentation, target marketing, and data summarization. In the literature, numerous groups of methods have been proposed. Probabilistic methods, distancebased methods, densitybased methods, gridbased methods, factorization techniques, and spectral methods are typical grouping. Integration of feature selection/dimensionality reduction methods with clustering are often found in clustering methods.
Bio: Thuy T. Pham is a student at University of Sydney, Australia. She is interested in Biomedical wearable devices, Applied machine learning, Big data, Internet of things, and data science.
References:
[1]: M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A densitybased algorithm for discovering clusters in large spatial databases with noise. ACM KDD Conference, pages 226–231, 1996.
[2]: W. Wang, J. Yang, and R. Muntz. Sting: A statistical information grid approach to spatial data mining. VLDB Conference, 1997.
[3]: B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986
Related:
Top Stories Past 30 Days

