Interview: Anthony Bak, Ayasdi on Managing Data Complexity through Topology
We discuss the definition of Topology, its relevance to Big Data and compare Topological Data Analysis (TDA) with other approaches.
His PhD is from the University of Pennsylvania on the connections between algebraic geometry and string theory. Along the way he co-founded a data analytics company working on political campaigns, worked on quantum circuitry research, and studied chaotic phenomena in sand boxes. His friends say that his best idea was to found a College funded cooking club in order to eat food he couldn't afford otherwise.
Here is my interview with him:
Anmol Rajpurohit: Q1. How do you define Topology? How is it relevant for Big Data?
Topology adds the ability to understand and describe the shape without imposing additional model information, which can be biased and misleading. This leads to a number of concrete benefits, such as predictive model improvement and a better understanding of your data.
This seems like a trivial point, but can be key to solving complex problems with a high degree of accuracy. A simple example of this comes from hospital predictive models. Hospitals want to measure how sick people are and collect a variety of clinical information (blood pressure, heart rate, temperature, breathing rate, oxygen levels etc.) or genetic information (gene expression levels). Typically, they fit a linear regression model that predicts how "sick" patients are. The underlying assumption is that there is a near linear relationship between symptoms and "sickness".
One of Ayasdi's academic collaborators took gene expression data for people at different stages of malaria.
Most real world data sets that I look at are larger and more complicated than this example and we find a variety of structures — cluster, flares, loops and higher dimensional structures — all appearing in a single data set. It is nearly impossible to guess or hypothesize the right structures ahead of time, and TDA is a tool to understand your data in an unbiased way, revealing its true complexity.
AB: Complexity means that it's too hard to hypothesize what the relationships and structures in your data are. Topological Summaries provide a way to understand and then exploit structure without having to first guess.
AR: Q3. What are the unique benefits of Topological Data Analysis (TDA) over other approaches?
AB: There are lots of ways to answer this question but I'll just focus on one unique benefit: coordinate invariance. The things that we care about do not depend on the coordinate system chosen to describe the problem. For example, the boiling point of water does not depend on whether I describe it in Celsius or Fahrenheit. In a similar way, your location can be described by an address, or with lat/long coordinates. The essentials of where you are and your relationship to the rest of the planet do not depend on these coordinate choices.
And yet, for most statistical techniques the details of the coordinate systems matter. Even a rigid rotation of your data in some high dimensional Euclidean space can confound a statistical model.
TDA is more robust way to handle these kinds of choices because of its foundation in Topology. Concretely, this means that there are more ways to get to the same answer, and your feature selection/engineering is less important than when using other methods.
The second part of interview
Related: