Machine Learning Workflows in Python from Scratch Part 2: k-means Clustering

The second post in this series of tutorials for implementing machine learning workflows in Python from scratch covers implementing the k-means clustering algorithm.

In the first part of this series, we started off rather slowly but deliberately. The previous post laid out our goals, and started off with some basic building blocks for our machine learning workflows and pipelines we will eventually get to. If you have not yet read the first installment in this series, I suggest that you do so before moving on.

This time around we pick up steam, and will be doing so with an implementation of the k-means clustering algorithm. We will discuss specific aspects of k-means as they come up while coding, but if you are interested in a superficial overview of what the algorithm is about, as well as how it relates to other clustering methods, you could check this out.

ML workflows header
The k-means clustering algorithm in Python. From scratch.

The only real prerequisites moving forward are the module we created in the first post, along with the original iris.csv file, so make sure you have both of those handy.


The k-means Clustering Algorithm

k-means is a simple, yet often effective, approach to clustering. k points are randomly chosen as cluster centers, or centroids, and all training instances are plotted and added to the closest cluster. After all instances have been added to clusters, the centroids, representing the mean of the instances of each cluster are re-calculated, with these re-calculated centroids becoming the new centers of their respective clusters.

At this point, all cluster membership is reset, and all instances of the training set are re-plotted and -added to their closest, possibly re-centered, cluster. This iterative process continues until there is no change to the centroids or their membership, and the clusters are considered settled.

As far as approaches to clustering go, k-means is about as simple as they come -- its conceptual simplicity is elegant, almost poetic. It's also a proven workhorse, lending to its staying power, often producing useful results.

What were going to do, in a nutshell, is code up a simple k-means implementation that will group similar things together and keep dissimilar things apart, at least theoretically. Simple enough. Keep in mind that "similar" here is reduced to meaning "relatively closely co-located in Euclidean space," or something very non-philosophical like that.

We will need a few functions here. Thinking about the steps involved in the algorithm can help us solidify what those might be:

  • Set initial centroids
  • Measure distances between data instances and centroids
  • Add data instances as members of closest centroid
  • Re-calculate centroids
  • If necessary, re-measure, re-cluster, re-calculate

That's the algorithm in a nutshell. But before we get there, a (temporary) step backward.


Data Preparation... Again

While writing this post, it eventually occurred to me that an important part of our data preparation workflow was missing. Before we convert our pandas DataFrame to a numpy ndarray (matrix), we will want to make sure our numeric values are actually numeric values, and not strings masquerading as numeric values. Since we read our data from a CSV file last time, even our numeric values were being stored as strings (apparent at the bottom of the previous post by the fact that the numbers are surrounded by single quotes -- e.g. '5.7').

So, the best way to deal with this is to create another function and add it to our file, which will convert strings to their numeric value representations (we already have a function for converting class names to numeric values, and keeping track of these changes). The function went through 3 specific iterations as I played with it, namely those which accepted: 1) a dataset and the name of a single attribute, the corresponding column of which all values should be converted from strings to floats; 2) a dataset and a list of attribute names...; and 3) a dataset and either a single attribute as a string, or a list of attributes as a *gasp* list.

The final iteration (the third, more flexible, option) is the one shown below. Let's add it to the module from last time.

OK, with that out of the way, we will be able to load a dataset, clean it, and create a (fully numeric) matrix representation, which can then be fed into our k-means clustering algorithm, once we have one. Speaking of which...


Initializing Centroids

The first thing our algorithm needs to do is to create a set of k initial centroids. There are a variety of ways to approach this, but we will start with the most basic: random initialization. We need a function that accepts a dataset and an integer k, and which will return an ndarray of that number of randomly-generated centroids.

You may have already imagined how random initialization could go awry. As a quick example, think about 5 distinct, tightly clustered classes in 2 dimensional space, with a poorly initialized set of centroids, as shown below.

Poor initialization
Obviously non-optimal centroid initialization.

Even without the mathematics to support the intuition, it's clear that these centroids are not optimally placed. However, one of k-means' strong attributes is its ability to recover from such initialization, with successive rounds of clustering moving toward optimal placement by minimizing the mean distance between cluster instance members and cluster centroids.

While this can happen remarkably quickly, poor initialization with sufficient amounts of high-dimensionality data can lead to a greater number of clustering iterations. The initial data space survey for random placement can also, itself, become lengthy. And so alternative methods for centroid initialization are available, some of which we may look at in the future.

A quick note about testing our code: doing so as we go with heavily contrived scenarios seems more trouble than it's worth, so we will hold out until the end to see how we did in total.


Measuring Distances

With a dataset in hand and a collection of initialized centroids, we will eventually have to perform a lot of measurement calculations. In fact, for each clustering iteration we will have to measure from each data point to each centroid, in order to know which cluster an instance belongs to (perhaps temporarily). So let's write a measurement function for Euclidean space. We will use Scipy here for the heavy lifting; while coding a distance measurement is not very difficult, Scipy includes a function optimized for such calculations on vectors, which, conveniently, is exactly what we will be doing.

Let's wrap this functionality in our own function, in case we want to later change or experiment with how we calculate distance.

With the ability to initialize centroids and to make our measurements, we can now write a function to control the logic of our clustering algorithm, and perform the few additional required steps.


Clustering Instances

Before we look at the code, here is a simple overview of the process our algorithm will be following, taking into account the few functions from above, as well as below.

k-means diagram
Our k-means clustering algorithm process.

The code below is well-commented, but let's walk through a few of the main points.

First, our function accepts a dataset as a numpy ndarray, as well as the number of clusters we want to use in the clustering process. It returns several things:

  • the resulting centroids after clustering is finished
  • the cluster assignments after clustering
  • the number of iterations which were required by clustering algorithm
  • the original centroids

An ndarray to store instance cluster assignments and their errors are created, and then centroids are initialized, with a copy held to return later.

A while loop then continues until there has been no change to cluster assignments, meaning that convergence has been reached -- note that no additional mechanism to limit the number of iterations exists at this point. Instances to centroid distances are calculated, with the minimum distance tracked, which is used to determine cluster assignment. The closest centroid is then recorded, along with the actual distance between the 2 entities (the error), and a check is performed to see if the cluster assignment for the particular instances has changed.

After this has been performed for each instance, the centroid locations are updated, simply by using the mean values of the member instances as centroid coordinates. The number of iterations are also recorded. A check as to whether convergence has been reached kicks control out of the while loop once appropriate, and the items outlined above are returned.

I want to point out that some of this code, as well as additional inspiration, comes from the book "Machine Learning in Action" (MLIA), by Peter Harrington. I bought this book when it was first released, and it has proven invaluable for reasons related to the criticism the book often receives, most of which focuses on either not enough theory and/or issues with code. In my humble opinion, however, these are both assets. If you are like me, in that you came to the book with theoretical understanding, and are willing and able to fight through code which may need tuning, or which is simply enough that you can provide your own changes, MLIA can be a useful resource to anyone looking to make their first foray into coding machine learning algorithms.


Testing Our k-means Clustering Algorithm

There are a few typical tasks which we are going to forego for this post and will revisit later, but I wanted to point them out here.

First, when clustering, especially with attributes of varying scales, it is generally a good idea to at least consider the idea of scaling or otherwise normalizing the data, to ensure that a single attribute (or collection of attributes) of a vastly greater scale then the others does not end up accounting for more than it should. If we have 3 attributes, with the first 2 in the range [0, 1] and the third in the range [0, 100], it's easy to see that the third variable will dominate the measurements and subsequent cluster membership allocations.

Second, when clustering (as with much of machine learning), we can split a dataset into training and testing sets, allowing us to train a model on one subset of data, and then test the model on the other (separate) set of data. While this is not always part of the goal of a given clustering task, as we may simply want to build a clustering model and not be concerned with using it to classify subsequent instances, it often can be. We will proceed with our test code below without taking either of these into consideration at this point, but may double back in a subsequent post.

Before moving ahead, make sure you have added the dataset-related function outlined further above to the existing module, and have created a module to house all of the relevant functions.

Let's try our code:

Number of iterations: 5

Final centroids:
[[ 6.62244898  2.98367347  5.57346939  2.03265306  2.        ]
 [ 5.006       3.418       1.464       0.244       0.        ]
 [ 5.91568627  2.76470588  4.26470588  1.33333333  1.01960784]]

Cluster membership and error of first 10 instances:
[[ 1.        0.021592]
 [ 1.        0.191992]
 [ 1.        0.169992]
 [ 1.        0.269192]
 [ 1.        0.039192]
 [ 1.        0.467592]
 [ 1.        0.172392]
 [ 1.        0.003592]
 [ 1.        0.641592]
 [ 1.        0.134392]]

Original centroids:
[[ 4.38924071  3.94546253  5.49200482  0.40216215  1.95277771]
 [ 5.43873792  3.58653594  2.73064731  0.79820023  0.97661014]
 [ 4.62570586  2.46497863  3.14311939  2.4121321   0.43495676]]

Looks good! This particular test of our code resulted in the above, but you will find that subsequent iterations return different results -- at least, different numbers of iterations and sets of original centroids.


Looking Ahead

Though we have not yet evaluated our clustering results, we will stop here for now... but, on that note, I bet you can guess what is in store for the next post. Next time we will focus on a few more clustering-related activities. We have an algorithm which we can use to build models, but there needs to be some mechanisms for evaluating and visualizing their results. This is what we will get to next.

Thinking further ahead, I then plan to turn our attention to classification using the k-nearest neighbors algorithm, as well a number of classification-related tasks. Once again, I hope you have found this helpful enough to check the next installment.