The Beautiful Duality of Topological Data Analysis
Topological Data analysis is special, because its methods are both general and precise. Teams that use TDA in their work see the “art of the possible” more broadly and can attack problems that might otherwise be “too hard” using traditional techniques.
By Harlan Sexton, Ayasdi.
One of my favorite things about Topological Data Analysis (TDA) is how malleable it is, because its methods are both general and precise.
These properties might sound incompatible, but perhaps I can explain it better describing how I used to approach data analysis problems before I started working on TDA.
Naturally the first step (then and now) was to try to understand the data and what is wanted from it.
Next I would reflect on the methods I had previously used and hope one of them would fit well enough to try. If that didn’t work, I would try to make a parameterized mathematical model of the data that seemed plausible and see how to reliably estimate the parameters from the data. I would then check the data against the model predictions and if they were okay, study the model until I could see how to formulate and solve the problem, which would give me the answer I wanted.
In a lot fewer words, if the solutions I used before wouldn’t work, I would start over from scratch.
It is certainly a general process, but it is very time-consuming.
By contrast, the general TDA method for a new problem begins with understanding the data enough to come up with a metric on it. As a reminder, when we say metric what we mean is a notion of similarity or distance between rows of data. There are multiple ways to measure (or define) such similarity – Hamming is one, Euclidean distance is another.
Determining a good metric is a more constrained task than making a mathematical model of the data. Further, with Ayasdi’s suite of existing statistical tools and visualization support, it is also fairly straightforward to get some idea if the metric is a good one.
I contend choosing a metric for a data set is a universally applicable process: for any data set and any concrete problem on that data, if there is an operational solution to the problem there is, at least implicitly, a metric embodied in that solution. That is what I mean which I say a metric is ‘general’.
At the same time a metric is quite specific: it is a function on pairs of data points satisfying certain properties, and if that function is implemented according to an API used in our software, you can add that metric and all of the code in our system written to apply to ‘metric spaces’ (which is almost all of the code) will just work. That is what I mean when I say a metric is ‘precise’.
If, instead, you used a panel of experts to decide on the similarity of pairs of data points, that would not be something you could use with our code – it would be ‘imprecise’ in the sense I am using the word it here, though you might argue that it would be a process which could be generally applied.
There is another sense in which a metric is precise and that is mathematical precision. This type of precision means the idea can be reasoned about, and sufficiently to create books full of theorems on metric spaces (which are themselves generalizations of a wide range of concepts from many branches of mathematics). And I contend that this sort of precision means we can also reason about the abstraction of ‘metric’ as easily as we might reason about Euclidean distance, and come up with solutions to problems in a specific metric space which apply to ALL metric spaces.
This an illustration of what I mean when I say TDA combines generality and precision.
Once you have defined a metric, the rest of TDA applies instantly. This has broad implications, for, as we noted a moment ago, there are many ways to measure similarity and each of those becomes a candidate to be consumed by TDA. Which approach you choose (or the software chooses for you, if you enable that) does not otherwise alter the process.
These characteristics of TDA are as much philosophical as they are about our specific software implementation. The fact that any metric will work with the methodology can be difficult to get one’s head around.
While it may be challenging at first, thinking about data sets as finite metric spaces is a very powerful way of framing problems. Once it becomes familiar, it can guide the search for solutions to problems that might otherwise be intractable.
This is key.
Teams that use TDA in their work see the “art of the possible” more broadly and can attack problems that might otherwise be “too hard” using traditional techniques.
To see how this works, I will describe the process we used in developing the underlying machinery for Ayasdi Care while illustrating some of the points using a much simpler, open source data set. Ayasdi Care has won awards and accolades for its ability to rapidly solve a problem considered to be very complex.
The problem was how to help medical professionals generate the ‘best practices’ among their colleagues for a range of surgical procedures.
Even a relatively routine surgical procedure, such as a total knee replacement, involves a complex series of interactions of many highly trained people with the patient over days or weeks.
Even abstracting away many of the details (e.g., when going in to administer an antibiotic the nurse also tells the patient a joke – this is recorded as ‘administer antibiotic X, dose Y’ ), the records for a procedure can list hundreds of ‘events’ which could occur in a staggering number of possible sequences.
After consultation with the data experts and experimentation, we devised a metric to describe “event sequences” that led to confirmatory findings as well as new insights.
The process of constructing a metric on these sequences is outside the scope of this blog post, and the confidentiality of the patient records involved would make it difficult or impossible to illustrate the points I wish to make here, so instead I will described those using data from the MNIST data set (see https://en.wikipedia.org/wiki/MNIST_database).
The particular version of the data I use here was 5,000 samples chosen at random from the full data set, and the images were encoded as 28 by 28 pixel gray-scale images. It is possible to get decent resolution on the data set simply by using the Euclidean metric on the 784 integer coordinates of each image. Here is a picture of the space (a scatter-plot of the values of the two neighborhood lenses) where individual points are colored by the numeral from 0 to 9 the point represents:
The association between colors and classes is arbitrary, but the labels in the middle of the concentrations for each class should make it clear which is which. Taking the average of the images in each class we get recognizable numerals: