5 Lesser-Known Data Transformation Techniques for Better Analysis

Utilize these transformation techniques in your data workflow.

By Cornellius Yudha Wijaya, KDnuggets Technical Content Specialist on October 22, 2024 in Data Science

5 Lesser-Known Data Transformation Techniques for Better Analysis

Image by Author | Ideogram

Data transformation is the process of converting data into another value through certain calculations or methodologies to better represent the data. It’s often used to meet certain statistical test assumptions or to clarify data visualization. There are many formulas for data transformation, but not every transformation is the same and would satisfy your requirements.

Some popular data transformations, such as Normal and Logarithmic transformations, were dominating as they are easy to interpret and achieve the transformation purpose without sacrificing much information. However, there are many lesser-known transformations that you should know.

This article will explore five different data transformations that should improve your analysis. What are they? Let’s get into it.

1. Box-Cox Transformation

Box-Cox transformation is a technique designed so that the data closely follow the normal distribution but are controlled by the λ parameter. As we can control the parameter for the transformation, it’s much more flexible than the simple log transformation.

The Box-Cox transformation is often used when our data must follow a normal distribution closely or we want to stabilize the data variance. By changing the λ parameter, the transformer can have various transformation forms; for example, λ equal to 1 means no change to the data, λ equal to 0 means a log transformation, and any other λ values would be a power transformation to the data.

In Python, we can implement the transformation with the code below.

import numpy as np
from scipy.stats import boxcox

data = np.random.exponential(scale=2, size=1000)

transformed_data = boxcox(data, lmbda = 0.5)

Try out various λ to see if it’s suitable for your analysis.

2. Yeo-Johnson Transformation

Box-Cox transformation is a great data transformation technique as we can control the transformation amount, but it has one weakness: it’s only applicable to positive values. Based on the Box-Cox transformation, a Yeo-Johnson transformation is then developed to handle negative values.

Like the Box-Cox transformation, Yeo-Johnson is controlled by the λ parameter and can be changed to your requirements. Also, it’s useful for improving the data normality and homoscedasticity if you need to meet the linear model assumption.

You can apply the transformation with the following code.

import numpy as np
from scipy.stats import yeojohnson

data = np.random.normal(loc=0, scale=2, size=1000)
transformed_data = yeojohnson(data, lmbda = 0.5)

3. Rank Transformation

Rank transformation is a non-parametric method that transforms the data by replacing them with the data rank when they are sorted. For example, the smallest data point is transformed into 1, the next smallest is 2, and so on. It’s usually used if the value is less important than its order (rank).

The Rank transformation is useful when our data has many outliers, or the data scale can be ignored. The transformation allows the outlier values' influence to be reduced, as the popular transformation, normal, would be affected by the outlier. The rank transformation is also often used in conjunction with the parametrical statistical test.

We can perform the Rank transformation in Python with the following code:

from scipy.stats import rankdata
import numpy as np

data = np.random.normal(loc=0, scale=2, size=1000)

ranked_data = rankdata(data)

4. Reciprocal Transformation

Reciprocal transformation is a data transformation technique that replaces the data values with their reciprocal (1/x), where the x is a value you can decide for yourself. It’s useful when you're dealing with a skewed data distribution and most of your data values are large. The reciprocal transformation would minimize the impact of the large values so that the dataset would be suitable for any follow-up methodology.

The transformation is also good for situations where the data contain decreasing relationships, as it could help represent the data more linearly. However, we must remember that reciprocal transformation is not good if the data contains zero or negative values, as the transformation would not represent the information correctly.
For the Python code implementation, you can use the following code:

import numpy as np

data = np.random.exponential(scale=2, size=1000) + 1
reciprocal_transformed_data = np.reciprocal(data)

5. Binning Transformation (Discretization)

Binning Transformation or Discretization is a data transformation technique that divides continuous data values into certain intervals (bins). The transformation replaces the data values with the bin labels. It’s a useful transformation technique that preprocesses data as categorical data with rank (ordinal) and is useful to simplify the data plus reduce noise.

The transformation is useful, especially for techniques that could benefit from categorical input, such as a decision tree. It’s also useful for handling data outliers and minimizing their impact. However, you must decide the binning interval properly, as it could affect the data transformation. There are many rules of thumb you can try to follow; for example, Sturges’ Rule says that the number of bins is equal to log(N) + 1, where N is the number of data.

For the Binning Transformation using Sturges’ Rule, you can use the following code in Python:

import numpy as np
import pandas as pd
data = np.random.normal(loc=0, scale=1, size=1000)num_bins = int(np.ceil(np.log2(len(data)) + 1))
binned_data = pd.cut(data, bins=num_bins, labels=False)

Conclusion

Data transformation is a data preprocessing technique that converts the original data into another value that follows certain calculations. Data transformation is useful in many situations to follow particular data distribution or gain insight more intuitively. There are many useful data transformations, but they are lesser-known. So, this article will explore five transformation techniques that you should know.

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.