Comparing Clustering Techniques: A Concise Technical Overview
A wide array of clustering techniques are in use today. Given the widespread use of clustering in everyday data mining, this post provides a concise technical overview of 2 such exemplar techniques.
Clustering is used for analyzing data which does not include prelabeled classes. Data instances are grouped together using the concept of maximizing intraclass similarity and minimizing the similarity between differing classes. This translates to the clustering algorithm identifying and grouping instances which are very similar, as opposed to ungrouped instances which are much lesssimilar to one another. As clustering does not require the prelabeling of classes, it is a form of unsupervised learning.
kmeans Clustering is perhaps the most wellknown example of a clustering algorithm. However, it is not the only one. Completely different schemes exist, such hierarchical clustering, fuzzy clustering, and density clustering, as do different takes on centroidstyle clustering, such as using the median (as opposed to the mean), or ensuring that the centroid is a cluster member (read on for some context).
Below is a brief technical overview of the kmeans clustering algorithm, as well as ExpectationMaximization (EM), a Gaussian distributionbased clustering method. Both approaches attempt to maximize intraclass similarity and minimize interclass similarity, yet take different approaches in doing so.
kmeans Clustering
kmeans is a simple, yet often effective, approach to clustering. k points are randomly chosen as cluster centers, or centroids, and all training instances are plotted and added to the closest cluster. After all instances have been added to clusters, the centroids, representing the mean of the instances of each cluster are recalculated, with these recalculated centroids becoming the new centers of their respective clusters.
At this point, all cluster membership is reset, and all instances of the training set are replotted and added to their closest, possibly recentered, cluster. This iterative process continues until there is no change to the centroids or their membership, and the clusters are considered settled.
Fig.1: kmeans Clustering Algorithm.
Convergence is achieved once the recalculated centroids match the previous iteration’s centroids, or are within some preset margin. The measure of distance is generally Euclidean in kmeans, which, given 2 points in the form of (x, y), can be represented as:
Of technical note, especially in the era of parallel computing, iterative clustering in kmeans is serial in nature; however, the distance calculations within an iteration need not be. Therefore, for sets of a significant size, distance calculations are a worthy target for parallelization in the kmeans clustering algorithm.
ExpectationMaximization
Probabilistic clustering aims to determine the most likely set of clusters, given a set of data. EM is a probabilistic clustering algorithm, and, as such, involves determining the probabilities that instances belong to particular clusters. EM ”approaches maximum likelihood or maximum a posteriori estimates of parameters in statistical models” (Han, Kamber & Pei). The EM process begins with a set of parameters, iterating until clustering is maximized, with respect to k clusters.
As its name suggests, EM consists of 2 distinct steps: expectation and maximization. The expectation step (Estep) assigns particular objects to clusters based on parameters. This can also be referred to as the cluster probability calculation step, the cluster probabilities being the ”expected” class values. The maximization step (Mstep) calculates the distribution parameters, maximizing expected likelihood.
The parameter estimation equations express the fact that cluster probabilities are known for each cluster, as opposed to the clusters themselves. The mean for a cluster is calculated as
and the standard deviation of a cluster is determined by
where w_{i} is the probability an instance i is a member of cluster C, and x_{i} are all of the dataset’s instances.
When given a new instance to cluster, its cluster membership probability is calculated and compared to each cluster. The instance becomes a member of the cluster with the highest membership probability. These steps are repeated until the intercluster delta is below a predefined threshold. In practice, iteration should continue until the loglikelihood increase is negligible, and the loglikelihood typically increases dramatically during the first number of iterations and converges to this negligible point quite quickly.
It should be noted that EM can also be used for fuzzy clustering, as opposed to probabilistic clustering.
While it is clear that kmeans and Expectationmaximization take different approaches to getting there, they are both clustering techniques. This difference should be a good hint to the variance which exists between the wide array of clustering techniques which are available and in use today.
Related:
Top Stories Past 30 Days

