Iterative Initial Centroid Search via Sampling for k-Means Clustering
Thinking about ways to find a better set of initial centroid positions is a valid approach to optimizing the k-means clustering process. This post outlines just such an approach.
In this post, we will look at using an iterative approach to searching for a better set of initial centroids for k-means clustering, and will do so by performing this process on a sample of our full dataset.
What do we mean by "better?" Since k-means clustering aims to converge on an optimal set of cluster centers (centroids) and cluster membership based on distance from these centroids via successive iterations, it is intuitive that the more optimal the positioning of these initial centroids, the fewer iterations of the k-means clustering algorithms will be required for convergence. Therefore, thinking about ways to find a better set of initial centroid positions is a valid approach to optimizing the k-means clustering process.
What we will do differently, specifically, is to draw a sample of data from our full dataset, and run short runs of of the k-means clustering algorithm on it (not to convergence), short runs which will include, out of necessity, the centroid initialization process. We will repeat these short runs with a number of randomly initialized centroids, and will track the improvement to the measurement metric -- within-cluster sum-of-squares -- for determining goodness of cluster membership (or, at least, one of the valid metrics for measuring this). The final centroids associated with the random centroid initialization iteration process which provide the lowest inertia is the set of centroids which we will carry forward to our full dataset clustering process.
The hope is that this up-front work will lead to a better set of initial centroids for our full clustering process, and, hence, a lesser number of k-means clustering iterations and, ultimately, less time required to fully cluster a dataset.
This would obviously not be the only method of optimizing centroid initialization. In the past we have discussed the naive sharding centroid initialization method, a deterministic method for optimal centroid initialization. Other forms of modifications to the k-means clustering algorithm take different approaches to this problem as well (see k-means++ for comparison).
This post will approach our task as follows:
- prepare the data
- prepare our sample
- perform centroid initialization search iterations to determine our "best" collection of initial centroids
- use results to perform clustering on full dataset
A future post will perform and report comparisons of results between various approaches to centroid initialization, for a more comprehensive understanding of the practicalities of implementation. For now, however, let's introduce and explore this particular approach to centroid initialization.
Preparing the data
For this overview, we will use the 3D road network dataset.
Since this particular dataset has no missing values, and no class labels, our data preparation will primarily constitute normalization, along with dropping a column which identifies a geographical location from where the additional 3 columns worth of measurements come from, which is not useful for our task. See the dataset description for additional details.
Let's check out a sampling of our data:
Preparing the sample
Next, we will pull our sample that will be used to find our "best" initial centroids. Let's be clear about exactly what we are doing:
- we are pulling a single set of samples from our dataset
- we will then perform successive rounds k-means clustering on this sample data, each iteration of which will:
- randomly initialize k centroids and perform n iterations of the k-means clustering algorithm
- the initial inertia (within-cluster sum-of-squares) of each centroid will be noted, as will its final inertia, and the initial centroids which provide the greatest increase in inertia over n iterations will be chosen as our initial centroids for the full dataset clustering
- we will then perform full k-means clustering on the full dataset, using the initial clusters found in the previous step
2 important points:
- Why not use greatest decrease in inertia? (The hope being that the initial momentum in this area will continue.) Doing so would also be a valid choice to explore (and changing a single line of code would allow for this). An arbitrary initial experimentation choice, and one which could use more investigation. However, the repetitive execution and comparison on a number of samples initially showed that the lowest inertia and greatest decrease in inertia coincide a great majority of the time, and so the decision may, in fact, be arbitrary, but also inconsequential in practice.
- Of particular clarification, we are not sampling multiple times from our dataset (e.g. once from our dataset for each iteration of centroid initialization). We are sampling once for all iterations of a single centroid initialization search. One sample, from which we will randomly derive initial centroids many times. Contrast this with the idea of repetitive sampling, once for each centroid initialization iteration.
Below, we set:
- sample size as a ratio of our full dataset
- random state for reproducibility
- number of clusters (k) for our dataset
- number of iterations (n) for our k-means algorithm
- number of attempts at finding our best chance initial centroids while clustering on our sample dataset
We then set our sample data
Now that we have our data sample (
data_sample) we are ready to perform iterations of centroid initialization for comparison and selection.
Clustering the sample data
Since Scikit-learn's k-means clustering implementation does not allow for easily obtaining centroids between clustering iterations, we have to hack the workflow a bit. While the
verbose option does output some useful information in this regard directly to screen, and redirecting this output and then post-parsing it would be one approach to getting what we need, what we will do is write our own outer iteration loop to control for our n variable ourselves.
This means that we need to count iterations and capture what we need between these iterations, after each clustering step has run. We will then wrap that clustering iteration loop in a centroid initialization loop, which will initialize k centroids from our sample data m times. This is the hyperparameter specific to our particular instantiation of the k-means centroid initialization process, beyond "regular" k-means.
Given our above parameters, we will be clustering our dataset into 10 clusters (NUM_CLUSTERS, or k), we will run our centroid search for 3 iterations (NUM_ITER, or n), and we will attempt this with 5 random initial centroids (NUM_ATTEMPTS, or m), after which we will determine our "best" set of centroids to initialize with for full clustering (in our case, the metric is the lowest within-cluster sum-of-squares, or inertia).
Prior to any clustering, however, let's see what a sample of what a single initialization of our k-means looks like, prior to any clustering iterations.
In the code below, note that we have to manually track our centroids at the start of each iteration and at the end of each iteration, given that we are managing these successive iterations ourselves. We then feed these end centroids into our next loop iteration as the initial centroids, and run for one iteration. A bit tedious, and aggravating that we can't get this out of Scikit-learn's implementation directly, but not difficult.
After this is done, let's see how we did in our centroid search. First we check a list of our final inertias (or within-cluster sum-of-squares), looking for the lowest value. We then set the associated centroids as the initial centroids for our next step.
And here's what those centroids look like:
Running full k-means clustering
And now, with our best initial centroids in hand, we can run k-means clustering on our full dataset. As Scikit-learn allows us to pass in a set of initial centroids, we can exploit this by the comparatively straightforward lines below.
This particular run of k-means converged in 13 iterations:
For comparison, here's a full k-means clustering run using only randomly-initialized centroids ("regular" k-means):
This run took 39 iterations, with a nearly-identical inertia:
I leave finding the difference of execution times between the 13 iterations and 39 iterations (or similar) to the reader. Needless to say, eating up a few cycles ahead of time on a sample of data (in our case, 10% of the full dataset) saved considerable cycles in the long run, without sacrifice to our overall clustering metric.
Of course, additional testing before drawing any generalizations is warranted, and in a future post I will run some experiments on a number of centroid initializiation methods on a variety of datasets and compare with some additional metrics, hopefully to get a clearer picture of ways to go about optimizing unsupervised learning workflows.
- Toward Increased k-means Clustering Efficiency with the Naive Sharding Centroid Initialization Method
- Comparing Distance Measurements with Python and SciPy
- Machine Learning Workflows in Python from Scratch Part 2: k-means Clustering