Interview: Anthony Bak, Ayasdi on Managing Data Complexity through Topology
Tags: Anthony Bak, Ayasdi, Data Analysis, Data Management, Predictive Modeling, Statistical Analysis, Topological Data Analysis, Topology
We discuss the definition of Topology, its relevance to Big Data and compare Topological Data Analysis (TDA) with other approaches.
Anthony Bak is a principal data scientist at Ayasdi, where he designs machine learning and analytic solutions to solve problems for Ayasdi customers. Prior to Ayasdi he was a postdoc with Ayasdi cofounder Gunnar Carlsson in the Stanford University Mathematics Department. He's held academic positions at the MaxPlanck Institute for Mathematics, Mount Holyoke College and the American Institute of Mathematics.
His PhD is from the University of Pennsylvania on the connections between algebraic geometry and string theory. Along the way he cofounded a data analytics company working on political campaigns, worked on quantum circuitry research, and studied chaotic phenomena in sand boxes. His friends say that his best idea was to found a College funded cooking club in order to eat food he couldn't afford otherwise.
Here is my interview with him:
Anmol Rajpurohit: Q1. How do you define Topology? How is it relevant for Big Data?
Anthony Bak: Topology is the study and description of shape. In Big Data problems, shape arises because you have a notion of similarity or distance between data points. This can be something like Euclidean distance, correlations, a weighted graph distance or even something more esoteric. Shape is exploited in machine learning by bringing in some additional information such as "my data has well defined clusters or classes"; "this outcome is linear "; or "my signal is periodic". Then you would use specialized tools to apply models based on this information.
This seems like a trivial point, but can be key to solving complex problems with a high degree of accuracy. A simple example of this comes from hospital predictive models. Hospitals want to measure how sick people are and collect a variety of clinical information (blood pressure, heart rate, temperature, breathing rate, oxygen levels etc.) or genetic information (gene expression levels). Typically, they fit a linear regression model that predicts how "sick" patients are. The underlying assumption is that there is a near linear relationship between symptoms and "sickness".
One of Ayasdi's academic collaborators took gene expression data for people at different stages of malaria. When examined using TDA, he found patients all lying on a circle sitting inside of a high dimensional space (~1000 features). While in retrospect the circle is obvious, your path from being healthy to sickness and back to healthy does not track up and down through the same set of symptoms, and yet nobody had thought to look for the circle.
Most real world data sets that I look at are larger and more complicated than this example and we find a variety of structures — cluster, flares, loops and higher dimensional structures — all appearing in a single data set. It is nearly impossible to guess or hypothesize the right structures ahead of time, and TDA is a tool to understand your data in an unbiased way, revealing its true complexity.
AR: Q2. How do Topological Summaries help in dealing with the increasing complexity of data?
AB: Complexity means that it's too hard to hypothesize what the relationships and structures in your data are. Topological Summaries provide a way to understand and then exploit structure without having to first guess.
AR: Q3. What are the unique benefits of Topological Data Analysis (TDA) over other approaches?
AB: There are lots of ways to answer this question but I'll just focus on one unique benefit: coordinate invariance. The things that we care about do not depend on the coordinate system chosen to describe the problem. For example, the boiling point of water does not depend on whether I describe it in Celsius or Fahrenheit. In a similar way, your location can be described by an address, or with lat/long coordinates. The essentials of where you are and your relationship to the rest of the planet do not depend on these coordinate choices.
And yet, for most statistical techniques the details of the coordinate systems matter. Even a rigid rotation of your data in some high dimensional Euclidean space can confound a statistical model.
TDA is more robust way to handle these kinds of choices because of its foundation in Topology. Concretely, this means that there are more ways to get to the same answer, and your feature selection/engineering is less important than when using other methods.
The second part of interview
Related:
His PhD is from the University of Pennsylvania on the connections between algebraic geometry and string theory. Along the way he cofounded a data analytics company working on political campaigns, worked on quantum circuitry research, and studied chaotic phenomena in sand boxes. His friends say that his best idea was to found a College funded cooking club in order to eat food he couldn't afford otherwise.
Here is my interview with him:
Anmol Rajpurohit: Q1. How do you define Topology? How is it relevant for Big Data?
Anthony Bak: Topology is the study and description of shape. In Big Data problems, shape arises because you have a notion of similarity or distance between data points. This can be something like Euclidean distance, correlations, a weighted graph distance or even something more esoteric. Shape is exploited in machine learning by bringing in some additional information such as "my data has well defined clusters or classes"; "this outcome is linear "; or "my signal is periodic". Then you would use specialized tools to apply models based on this information.
Topology adds the ability to understand and describe the shape without imposing additional model information, which can be biased and misleading. This leads to a number of concrete benefits, such as predictive model improvement and a better understanding of your data.
This seems like a trivial point, but can be key to solving complex problems with a high degree of accuracy. A simple example of this comes from hospital predictive models. Hospitals want to measure how sick people are and collect a variety of clinical information (blood pressure, heart rate, temperature, breathing rate, oxygen levels etc.) or genetic information (gene expression levels). Typically, they fit a linear regression model that predicts how "sick" patients are. The underlying assumption is that there is a near linear relationship between symptoms and "sickness".
One of Ayasdi's academic collaborators took gene expression data for people at different stages of malaria. When examined using TDA, he found patients all lying on a circle sitting inside of a high dimensional space (~1000 features). While in retrospect the circle is obvious, your path from being healthy to sickness and back to healthy does not track up and down through the same set of symptoms, and yet nobody had thought to look for the circle.
Most real world data sets that I look at are larger and more complicated than this example and we find a variety of structures — cluster, flares, loops and higher dimensional structures — all appearing in a single data set. It is nearly impossible to guess or hypothesize the right structures ahead of time, and TDA is a tool to understand your data in an unbiased way, revealing its true complexity.
AR: Q2. How do Topological Summaries help in dealing with the increasing complexity of data?
AB: Complexity means that it's too hard to hypothesize what the relationships and structures in your data are. Topological Summaries provide a way to understand and then exploit structure without having to first guess.
AR: Q3. What are the unique benefits of Topological Data Analysis (TDA) over other approaches?
AB: There are lots of ways to answer this question but I'll just focus on one unique benefit: coordinate invariance. The things that we care about do not depend on the coordinate system chosen to describe the problem. For example, the boiling point of water does not depend on whether I describe it in Celsius or Fahrenheit. In a similar way, your location can be described by an address, or with lat/long coordinates. The essentials of where you are and your relationship to the rest of the planet do not depend on these coordinate choices.
And yet, for most statistical techniques the details of the coordinate systems matter. Even a rigid rotation of your data in some high dimensional Euclidean space can confound a statistical model.
TDA is more robust way to handle these kinds of choices because of its foundation in Topology. Concretely, this means that there are more ways to get to the same answer, and your feature selection/engineering is less important than when using other methods.
The second part of interview
Related:
Top Stories Past 30 Days

