The Beautiful Duality of Topological Data Analysis
Topological Data analysis is special, because its methods are both general and precise. Teams that use TDA in their work see the “art of the possible” more broadly and can attack problems that might otherwise be “too hard” using traditional techniques.
Other numbers display overlap, making them difficult to discern. Notice that in the picture the ‘4’ and ‘9’ groups overlap a great deal.
This led me to expect that the samples had lots of 4’s with the loops almost closed and 9’s with them barely closed, which I thought was confirmed by this average of everything on the left side of the 4/9 blob.
However, after I split the samples by classes I suspected this was not so, and looking at a number of individual points let me to conclude that these pictures are more accurate representations of what is happening in that region:
The individual points are perfectly distinguishable – the problem is the Euclidean metric is too simple-minded.
There are, however, TDA methods for generating features, which indicate the presence of loops in noisy data and adding those features would disambiguate almost all of the points in the 4/9 blob that I examined.
This particular example illustrates an important point: when constructing consensus examples it is necessary to be careful – averaging isn’t necessarily the best approach even when it is feasible. For instance, if some clustering mechanism selected a circular or spherical group in a high-dimensional space, an averaging procedure on such a set could compute, as a consensus, a point which was ‘in the hole’ (and which might well be a point excluded by some external constraint).
Whether or not this is okay would depend on the application, but it is worth noting. It was precisely such a consideration which led us to build our consensus recommendations in Ayasdi Care by starting with a most typical sequence to which we added and removed events.
However, in the case of these images, averaging seems a perfectly good method, provided we choose good points to average. Consider the average of all the ‘0’ images again:
This is fairly fuzzy, and it’s easy to see why if we look at just the ‘0’ images:
There is fairly well-defined core, but there are quite a few outliers, such as this one:
What we would like to do is to eliminate the atypical points from the set.
The question becomes what is a TDA way to do so? Or, conversely, what is a TDA way to find the typical points in a set?
Here is a very simple one: define ecc(x,S) to be the sum over all s in S of distance(x,s). Intuitively, the closer x is to ‘the middle of S’, the more distances will be smaller. If we take s0 to be a point in S which minimizes ecc(x,S) for x in S, then we could think of that as a most typical point. In simple cases this corresponds to our intuition. For example, if S is a set of real numbers and distance(x,y) = |x-y|, then a median point of S will minimize ecc(x,S). (Note that this is far from the only possible such function.)
For this use we could simply take the image of a ‘most typical’ as the consensus,
but this example is a bit ‘rough’, and certainly in the more complicated case of Ayasdi Care, taking a consensus from one sample leaves out far too much information. What we can do here is take the 20% ‘most typical’ points, which we can call the ‘20% core’, and average those. Doing so gives us this image:
It is recognizably like our most typical point, but smoother, and we can have more faith in it since it was created from 101 samples. And as a further sanity check, we can look at the set of points we averaged in the picture of the space:
Any two-dimensional representation of a 784-dimensional space is bound to be imperfect, but it is reassuring to see that the core, which was computed using only the metric and without reference to the picture, lies in the middle of the biggest blob of points.
What we have done is describe a process for consensus construction for images of handwritten digits. But note that, except for the method for merging a core set into a consensus, the method applies to any finite metric space.
This is another example of the malleability of TDA – we have come up with an intuitively appealing, principled method for consensus construction that is quite general. And this is just what we used in Ayasdi Care, with a suitable method for finding a consensus in a core set.
One thing that we have not dealt with is how we might recover consistent variations – that is, how to detect when there is more than one consensus. It can also be done in a natural way, but we’ll leave that to a subsequent blog post.