Topological Data Analysis – Open Source Implementations

Topological Data Analysis (TDA) is making waves in the analytics community lately, but are there open source options available?



By Matthew Mayo.

Topological Data Analysis (TDA) is an area of applied mathematics currently garnering all sorts of attention in the world of analytics. It employs modern mathematical concepts such as functorality, and posseses such desirable properties as success in coordinate-freeness and robustness to noise. TDA is able to make some strong claims as to its practical uses; it is, however, one of the most mathematically-rigorous areas of staistical analysis. This post does a nice job of introducing TDA to a machine learning audience.



Before turning our attention to available open source TDA tools, let's have a look at TDA's current driving force in industry. Ayasdi, founded in 2008 by Gunnar Carlsson, Gurjeet Singh, and Harlan Sexton, is the main commercial player in TDA, and was founded afer a dozen years of research at Stanford. But while Ayasdi is arguably the main actor in TDA today, it is hardly the only one; a number of active open source projects exist.

Interested in learning more about TDA? This is a video of Ayasdi co-founder and long-time TDA researcher Gunnar Carlsson very succinctly explaining TDA. This Github repo contains a curated list of resources for learning about TDA, including both gentle non-mathemtical introductions and more rigorous mathematical treatments. Ohio State University offers the course Computational Topology and Data Analysis, and many of the course's notes and resources are available on its site.

Attesting to the growing awareness of TDA, a series of interviews with Ayasdi engineer Anthony Bak have been featured on KDnuggets earlier this year (part 1, part 2, part 3), along with the more recent 6 crazy things Deep Learning and Topological Data Analysis can do with your data. One of the links further above is also a recent featured post. Given this trend, it seems highly probable that the future will see more KDnuggets posts on Topological Data Analysis.

Open Source TDA Tools

We turn our attention to open source TDA projects. While it is outside of the current mainstream and still in its early stages of adoption, it is important to note that TDA is far from a proprietary technology. Ayasdi may be the most noticeable player in the field, but a number of open source implementations of core TDA components exist as well. Ayasdi and its engineers have even contributed to some of these projects.

The following is a list of several open source TDA projects, with brief descriptions from the projects' sources.

Python Mapper

The Mapper algorithm is a method for topological data analysis invented by Gurjeet Singh, Facundo Mémoli and Gunnar Carlsson. See the Reference [R1] for the publication. While the Mapper algorithm alone does not constitute a complete data analysis tool itself, it is the key part of a processing chain with (minimally) filter functions, the Mapper algorithm itself and visualization of the results.
Python Mapper is a realization of this toolchain, written by Daniel Müllner and Aravindakshan Babu. It is open source software and is released under the GNU GPLv3 license.

Proof of Concept Mapper by @mlwave for Digit Recognition (Python)

Description: 1) MinMaxScaler on the train set. 2) t-SNE on first 5k images from train set to 2 components. 3) Create overlapping intervals on first 2 dimensions and cluster points inside this overlap. 4) The clusters then become nodes in a graph. 5) When different clusters have one or more non-unique members we draw an edge. 6) Size the nodes by the number of points in that cluster. 7) Color the nodes by the distance to min of first dimension. 8) Show the images for every cluster member inside a tooltip.

Dionysus (C++, with Python bindings)

Dionysus is a C++ library for computing persistent homology. It provides implementations of the following algorithms:
▪  Persistent homology computation
▪  Vineyards
▪  Persistent cohomology computation
▪  Zigzag persistent homology

TDA: Statistical Tools for Topological Data Analysis (R)

Tools for the statistical analysis of persistent homology and for density clustering. For that, this package provides an R interface for the efficient algorithms of the C++ libraries GUDHI, Dionysus, and PHAT (see vignette).

TDAmapper: Topological Data Analysis using Mapper (R)

An R package for using discrete Morse theory to analyze a data set using the Mapper algorithm described in G. Singh, F. Memoli, G. Carlsson (2007).

JavaPlex

The JavaPlex library implements persistent homology and related techniques from computational and applied topology, in a library designed for ease of use, ease of access from Matlab and java-based systems, and ease of extensions for further research projects and approaches. JavaPlex is mainly developed by the Computational Topology workgroup at Stanford University, and is based on previous similar packages from the same group.

CTL (C++)

This C++11 library provides a set of generic tools for:
▪  Generating point sets (coming soon)
▪  Building Neighborhood Graphs
▪  Building Cellular Complexes
▪  Computing [persistent] homology over finite fields
▪  Parallel algorithm(s) for homology

Kohonen (Python)

This module contains some basic implementations of Kohonen-style vector quantizers: Self-Organizing Map (SOM), Neural Gas, and Growing Neural Gas. Kohonen-style vector quantizers use some sort of explicitly specified topology to encourage good separation among prototype "neurons".

Bio: Matthew Mayo is a computer science graduate student currently working on his thesis parallelizing machine learning algorithms. He is also a student of data mining, a data enthusiast, and an aspiring machine learning scientist.

Related: