Diffusion Map for Manifold Learning, Theory and Implementation
This article aims to introduce one of the manifold learning techniques called Diffusion Map. This technique enables us to understand the underlying geometric structure of high dimensional data as well as to reduce the dimensions, if required, by neatly capturing the nonlinear relationships between the original dimensions.
By Rahul Raj, Indian Institute of Technology Kanpur
Introduction
‘Curse of dimensionality’ is a wellknown problem in Data Science, which often causes poor performance, inaccurate results, and, most importantly, a similarity measure breakdown. The primary cause of this is because high dimensional datasets are typically sparse, and often a lowerdimensional structure or ‘Manifold’ would embed this data. So there is a nonlinear relationship among the variables (or features or dimensions), which we need to learn to compute better similarity.
Manifold learning is an approach to nonlinear dimensionality reduction. The basis for algorithms in manifold learning is that the dimensionality of many data sets is only artificially high ^{1}. In this blog, we learn one of the many techniques in manifold learning called Diffusion Maps. The key idea is that Euclidean Distance, which is the most common measure of similarity, is meaningful only ‘locally.’ Therefore, assuming there is a lowerdimensional structure or manifold to the data, it would be appropriate to measure similarity over this structure rather than in the Euclidean space itself.
Let us begin exploring with the following example of 2D datapoints neatly arranged in S shape.
There is a definite shape to this dataset. Now, if we have to measure the similarity of two points in this set, we measure the Euclidean Distance. If this distance is small, then we say points are similar or if this is large, otherwise. The following figure represents this scenario.
However, knowing the geometric structure, we know that this similarity measure is inaccurate. Since there is a nonlinear relationship between ‘x’ and ‘y’ coordinates, it would be correct if we measure the similarity (or distance) over the very geometric structure itself, as shown in the Figure below.
With Diffusion Map, we can do a nonlinear dimensionality reduction as well as learn the underlying geometry of the high dimensional data. Let’s get straight to the theory and implementation, hand in hand
What is the takeaway?
This blog aims to introduce one of the manifold learning techniques called Diffusion Map^{2}. This technique enables us to understand the underlying geometric structure of high dimensional dataset as well as to reduce the dimensions, if required, by neatly capturing the nonlinear relationships between the original dimensions.
Theory behind (very briefly)
The core idea is a timedependent diffusion process, which is nothing but a random walk on the dataset where each hop has a probability associated with it. When the diffusion process runs for a time t, we get different probabilities of various paths it can take to calculate the distance over the underlying geometric structure. Mathematically, we call this the steadystate probability of the Markov Chain.
The connectivity between two data points, x, and y, is defined as the probability of jumping from x to y in one step of the random walk and is
However, it is useful to express the connectivity as a rownormalized likelihood function, K using Gaussian Kernel. This will be called diffusion kernel
Now we define a rownormalized diffusion matrix, P. Mathematically, this is equivalent to the transition matrix in the Markov chain. While P denotes the probability (or connectivity in this case) of single hopping from point x to point y, P² denotes the probability of reaching y from x in two hops and so on. As we increase the number of hops or Pᵗ for increasing values of t we observe that the diffusion process runs forward. Or in other words, the probability of following the geometric structure increases.
If you would like to dig deeper into the theory behind, the paper has explained it pretty neatly. For a quick overview, this Wikipedia article also explains the theory well.
Implementation
To demonstrate the algorithm, we begin with a dataset having a definite geometric structure. Let us begin by creating a 2D figure, as shown earlier. Our main aim is to find out whether the diffusion map unravels the underlying geometric structure of data or not.
Now, let us add a synthetic 3rd dimension (data drawn from a uniform distribution) to this 2D dataset. Thus our new 3D dataset is as follows
As you might have noticed, the 3D dataset preserves the ‘S’ shape in one angle, and that is correct. Now our goal is to flatten the dimensions to 2 while preserving this shape. After applying the Diffusion Map, hopefully, we should see the ‘S’ like structure in 2D.
Output
Result
As it is evident from the above output figures, with a diffusion map applied to reduce dimension from 3 to 2, we get to understand somewhat the original geometric structure, which is the ‘S’ shape. With varying values for alpha, we get slightly different variations in the final structure. (alpha is the parameter in diffusion kernel described above).
Source Code
Source code is available here and is opensourced under MIT License. You can try out the code in Google Colaboratory with the link provided on top of the gist.
References:
 Manifold Learning methods in ScikitLearn (https://scikitlearn.org/stable/modules/manifold.html)
 Porte, Herbst, Hereman, Walt, An introduction to Diffusion Maps
Bio: Rahul Raj (@rahulrajpl) is currently pursuing Masters in Computer Science and Engineering at Indian Institute of Technology Kanpur, India. His interests are in machine learning, data science and its applications in cybersecurity. More about the Author at: https://randomwalk.in
Original. Reposted with permission.
Related:
 4 Tips for Advanced Feature Engineering and Preprocessing
 7 Steps to Mastering Intermediate Machine Learning with Python — 2019 Edition
 What Is Dimension Reduction In Data Science?
Top Stories Past 30 Days

