Interview: Anthony Bak, Ayasdi on Novel Insights using Topological Summaries

We discuss examples of Topological Data Analysis (TDA) revealing new insights, recommended approach for creating Topological Summaries, Manual vs Automation approach and trends.

anthony-headshotAnthony Bak is a principal data scientist at Ayasdi, where he designs machine learning and analytic solutions to solve problems for Ayasdi customers. Prior to Ayasdi he was a postdoc with Ayasdi co-founder Gunnar Carlsson in the Stanford University Mathematics Department. He's held academic positions at the Max-Planck Institute for Mathematics, Mount Holyoke College and the American Institute of Mathematics.

His PhD is from the University of Pennsylvania on the connections between algebraic geometry and string theory. Along the way he co-founded a data analytics company working on political campaigns, worked on quantum circuitry research, and studied chaotic phenomena in sand boxes. His friends say that his best idea was to found a College funded cooking club in order to eat food he couldn't afford otherwise.

First part of interview

Here is second part of my interview with him:

Anmol Rajpurohit: Q4. Can you share examples of TDA revealing some insights on the underlying structure of data that would have been otherwise missed?

Anthony Bak: One of the first examples for the power of TDA was the well known NKI data set. This data set was built over a number of years and represents gene expression profiles and clinical traits for 272 breast cancer patients collected by the Netherlands Cancer Institute (NKI). The data set had been used by countless researchers to develop gene expression signatures to predict good vs. poor prognosis for patients with stage I or stage II breast cancer. This was believed to be one of the most well understood data sets in cancer.

Using TDA researchers discovered a previously unidentified group of cancer survivors in a sub-population that had very high overall mortality rates. More importantly, TDA was able to identify "why” these patients were surviving, with broad implications to understanding the disease and for improving mortality rates. While this breakthrough discovery was notable in itself, the fact that TDA was able to make this discovery within hours, while researchers spent over a decade with more traditional tools, speaks to the ability of TDA to discover hidden patterns and subtle signals.

We see similar results with our customers all the time. For many of our data science problems we work with teams of data scientists and machine learning PhD's. They work on relatively few problems for an extended period of time (think for example of credit card transaction fraud at a major credit card issuer). In almost all cases we are able to give them a new insight into the structure of their data, as well as concrete model improvement. Sometimes they can reproduce our results using traditional methods, sometimes not. The key point is that TDA is the tool that told them where to go look.

AR: Q5. What approach do you recommend for generating Topological Summaries? How should one select which "lenses" to use? How does one know that no important "lens" has been missed out?

AB: lensesLenses are the tool we use to produce topological summaries of shapes. As a metaphor, consider a Shakespeare play. There are many different summaries one could produce — a plot summary, summary of characters, locations in the play, language used, timeline etc. While none completely captures the full story, by looking at multiple summaries together you gain a rich picture of the play. If I want to summarize another play there are certain summaries of plot, character etc. that are universally useful, while others, eg. names of fishing boats, are only useful for specific plays.

In a similar manner certain lenses are universally informative. The summaries produced by a density estimator, or measure of centrality of a data point are informative almost across the board, while others, such as the third moment of feature values for a data point, have more specialized application. That said, at Ayasdi we have a variety of ways to help end users choose good lenses, either through previous experiences with similar problems or through automation. In other words, we can tell you which lenses are producing interesting and useful results leaving it up to the data scientist to interpret and "understand" what is being said.

The goal in producing summaries is to allow quick and efficient understanding of complex data. You don't worry about producing all possible summaries. From a certain perspective this is an absurd notion; instead you want to produce the minimal number to answer a question or solve a problem.

AR: Q6. What part of TDA can be automated? What parts of the process need inputs from the data scientist or others?

AB: TDA is well suited for automation for a variety of reasons. The first is that beyond a notion of distance or similarity, no other assumptions are needed in order to use TDA, so there's a simple and uniform way to specify problems for the system. On the other end of the pipeline, independent of the way the feature generation/selection method, metric or lenses are used to create the summaries, the output is always the same data type: a network or simplicial complex. This means that you need a single tool set to examine and score the output from a wide variety of data analytics and machine learning methods.
For these reasons I describe TDA not as a particular method like a neural network, or a regression model, but as a system for doing data analytics, combining statistics, machine learning, geometry and domain knowledge into a single coherent analytic framework.

AR: Q7. What key trends will drive the growth of Big Data industry for the next 2-3 years and what factors will play a critical role in the success of Big Data projects?

AB: complex-data-analysisI think we're going to start seeing payoff for the storage and processing pipeline investments that companies have made. This means that advanced analytics and machine learning at scale will become ubiquitous across the economy and become "table stakes" in order to survive. The focus will shift from gathering, storing and processing data to using and understanding it in ever more complex ways throughout business.

The ability to handle complex data through automation and smart tools will be critical for the success of these efforts. The data sets are too large, there are too many problems and there are too few machine learning experts to make headway without both automation and advanced analytics.

Third part of the interview