The Challenges of Creating Features for Machine Learning
What are the challenges of creating features for machine learning and how can we mitigate them.
Image by Monsterkoi from Pixabay
When I decided to leave academia and re-train as a data scientist, I quickly found out that I had to learn R or Python, or well… both. That’s probably the first time I heard about Python. I never imagined that 3 years later I would be maintaining an increasingly popular open source Python library for feature engineering: Feature-engine.
In this article, I want to discuss the challenges of feature engineering and selection both from the technical and operational side, and then lay out how Feature-engine, an open source Python library, can help us mitigate those challenges. I will also highlight the advantages and shortcomings of Feature-engine in the context of other Python libraries. Let’s dive in.
What is Feature-engine?
Feature-engine is an open-source Python library for feature engineering and feature selection. It works like Scikit-learn, with methods fit() and transform() that learn parameters from the data and then use those parameters to transform the data.
With Feature-engine, we can carry out plenty of feature transformations, like imputing missing data, encoding categorical variables, discretizing numerical variables, and transforming variables with math functions like logarithm, square root, and exponential. We can also combine existing features to create new ones, remove or censor outliers, and extract features from date and time.
Feature-engine has an entire module dedicated to feature selection, featuring feature selection techniques that have been developed in data science competitions or within organizations, and that are not available in Scikit-learn or any other Python library. I discussed some of these techniques in a previous post.
What makes Feature-engine unique?
There are a number of characteristics that make Feature-engine easy to use and a great option for feature engineering. First, Feature-engine contains the most exhaustive battery of feature engineering transformations. Second, Feature-engine can transform a specific group of variables in the dataframe, and we only need to indicate the variables to transform within a Feature-engine’s transformer, without any extra lines of code or helper classes. Third, Feature-engine can automatically recognize numerical, categorical, and datetime variables, allowing us to build a feature engineering pipelines more easily. Fourth, Feature-engine takes in a dataframe and returns a dataframe, making it a suitable library both for data exploration and also to deploy models to production.
Feature-engine has other features that make it a great library, like comprehensive documentation with plenty of examples, easy-to-learn functionality, and compatibility with the Scikit-learn Pipeline, the Grid and the Random hyperparameter search classes.
Feature-engine is designed to help users carry out meaningful data transformations. Feature-engine alerts users when a transformation is not possible, e.g., if applying logarithm to negative variables or divisions by 0, if NaN values are being introduced after the transformation or if the variables entered by the user are not of a suitable type. Thus, you don’t need to wait until the end of the process to find nasty surprises. You will know right-away!
Why is feature engineering challenging?
First, let’s define feature engineering. Feature engineering is the process of using domain knowledge to create or transform variables that are suitable to train machine learning models. It involves everything from filling in or removing missing values, to encoding categorical variables, transforming numerical variables, extracting features from dates, time, GPS coordinates, text, and images, and combining them in meaningful ways. With that said, what are the challenges of this process?
Data preprocessing and feature engineering are time consuming. First, we need to gather a representative dataset of the population that we want to assess. Next, we carry out data exploration to understand the variables we have and their relationship to a target, if we have one. The worse the quality of the data, or the less we know about that data that we are working with, the more time we need to spend on exploring and preparing that data. Finally, we need to work through the data to leave it in a state in which it can be consumed by a machine learning model.
Data preprocessing and feature engineering are repetitive. We perform the same transformations on various variables in the same dataset and across projects, i.e., datasets. And very likely, our colleagues are carrying out the same transformations on their datasets. In a nutshell, the entire team might be carrying out the same transformations, variable after variable, dataset after dataset. This hinges on the next challenge: reproducibility. How do we ensure that different code, from different colleagues, return the same functionality? How can we build on the code that we have already created to accelerate future projects?
Data pre-processing and feature engineering need to be reproducible. The code that we use, project after project, team member after team member, needs to attain the same outcome given the same data. So we need reproducibility across projects and across teams. But more importantly, we need reproducibility between the research and the development environments. What?
Once we create a machine learning model that consumes some pre-processed data, we need to put the model into production. Very often, that process involves re-creating the model and the entire feature-engineering transformations in the production environment. We need to be able to obtain the same results, given the same data, with the model in the production environment and the model we have evaluated in the research environment. This process presents its own challenges. First, is the code in the research environment suitable for production? Or will we need to refactor the code significantly? Feature engineering methods, like machine learning models, can learn parameters from the data. How will we store those parameters for later use?
Feature engineering also poses non-technical challenges, like interpretability. Can we understand the features after the transformation? Very often, the users of the models need to understand why the model made a certain decision. For example, fraud investigators often want to know why a fraud model thinks that a certain application or customer might be fraudulent. This means that we need to be able to interpret the predictions at an observation level, that is, for every customer or every application. To interpret a prediction, we need to use models that we can interpreted, and also features that we can interpret. This will impact the feature transformations that we select. For example, we could combine features among themselves in some sort of polynomial order transformation. But these transformations generate features that are not understandable to us. What does "age squared multiplied by income cube" mean? Seemingly, there are ways of encoding categorical variables that return features that can’t be interpreted by people. So if we need to interpret our models, we need to feed them interpretable features.
Finally, machine learning models need to be fair. And for a model to be fair, they need to make decisions based on fair, non-discriminatory, features. An example of a discriminatory feature would be gender, and in some cases, age. We also need to train models with representative datasets. If our models will be used in Britain, we should probably train our models based on data from the British population.
How does Feature-engine help tackle these challenges?
Feature-engine was designed to reduce time spent on data transformation by avoiding writing repetitive code, while ensuring reproducibility and returning interpretable features. With Feature-engine, data scientists do not need to code every transformation from scratch, they can implement each transformation on several variables at a time, in 1 to 3 lines of code.
Feature-engine is version-controlled, which ensures reproducibility between the research and production environment. Feature-engine’s transformers are thoroughly tested, ensuring they return the expected transformation. By using Feature-engine’s transformers across environments, projects and team members, we ensure that everybody obtains the same results from the same transformations and data, thus, maximising reproducibility.
Feature-engine has been focused on creating variables that are understandable by people. Thus, whenever a model makes a decision, we can understand how that variable contributed to that decision.
Are there not enough feature engineering libraries already?
Certainly, Feature-engine is not the first in the game. There are other great Python libraries for feature transformation, like Pandas, Scikit-learn, Category encoders, Featuretools, tsfresh and more. The Python ecosystem is growing year on year.
Open source Python libraries for feature engineering and selection.
Category encoders, Featuretools and tsfresh tackle very specific problems. The first one is designed to encode categorical variables, while Featuretools and tsfresh work with time series mostly for classification problems.
Pandas is a great library to combine data transformation with exploratory data analysis. And this is probably why it is the tool of choice. But Pandas, does not allow us to store parameters learned from data out-of-the-box. Feature-engine works on top of Pandas, to add this functionality.
Scikit-learn, on the other hand, allows us to apply a wide array of data transformations and it is able to store parameters in some cases. With Scikit-learn for example, we can apply commonly used data imputation and categorical encoding methods to learn and store parameters. And with the aid of the FunctionTransformer(), we can also, in principle, apply any transformation we want to the variables. Originally, Scikit-learn transformers were designed to transform the entire dataset. This means that if we want to transform only a subset of data, we need to split the data manually or do some sort of workaround. Lately, they have been adding more and more functionality to make this possible. Scikit-learn transformers will also return a Numpy array, optimal for training machine learning models but not great for continuing with data exploration.
Finally, with Pandas and Scikit-learn, we can implement the most commonly used methods to pre-process the data, but additional methods that were developed within organizations or data science competitions are not available.
Back to Feature-engine: how does it work?
Feature-engine transformers work exactly like Scikit-learn transformers. We need to instantiate the class with some parameters, then apply the fit() method so that the transformer learns the required parameters, and finally the transform() method to transform the data.
For example, we can instantiate a missing data imputer by indicating the imputation method to use and which variables to modify as follows:
median_imputer = MeanMedianImputer( imputation_method = 'median', variables = ['variable_a', 'variable_b'] )
We then apply the method fit() over a training set, so that the imputer learns the median values of each variable:
And finally, we apply the transform method over any dataset we want. With that, missing data will be imputed in the indicated variables by the values learned in the previous step:
train_t = median_imputer.transform(X_train) test_t = median_imputer.transform(X_test)
And that’s it. There is no more missing data in the 2 variables that we wanted.
What are the risks of incorrectly applied data transformations?
Feature engineering is an opportunity to extract more value from our features. In this sense, applying "incorrect" data transformations, if such a thing exists, will have a knock-on effect on the performance and interpretability of the machine learning model.
Some models make assumptions about the data. For example, linear models assume a linear relationship between features and the target. If our data transformations do not preserve or otherwise create that linear relationship, then the performance of the model will not be as good as it could be. Similarly, some models are negatively impacted by outliers, so not removing or censoring them, may affect their performance.
How bad is a decrease in model performance? Well, it depends on what we are using the model for. If we are trying to prevent fraud, a decrease in model performance may cause thousands or millions of dollars lost to fraud, with the knock on effect that that causes on other consumers of the products, be it insurance or loans. If we are using a model to optimise advertising, I would argue that it is not a great loss, because well… it is just advertising.
The problem takes on another dimension if we use models to predict disease, decide who will get a visa, or a place at university. In the first case, patients’ health is at stake. In the latter, we are talking about people’s futures and careers, not to mention lack of fairness and discrimination.
The last point brings me back again to interpretability. Are my features fair and neutral? Or are they discriminatory? Do I understand my features and how the data, even after the transformation, affects the predictions the model makes? If I do not understand my variables, I may as well not use them to train the models. Otherwise, I won’t know why decisions are made.
If you want to know more about the consequences of biased models, the book “Weapons of Math Destruction” offers a great recollection of models that have negatively impacted the subjects for whom they made predictions for being unfair. The movie “Coded Bias”, on the other hand, tells us about the consequences of using a model on a population that was not represented in the training dataset.
How does Feature-engine help mitigate those risks?
We mentioned that some models assume linear relationships between features and targets. Feature-engine includes a battery of transformations that return monotonic relationships between the transformed variable and the target. So, even if the linear, or well, monotonic, relationship did not exist in the raw data, it can be attained after a transformation. And, this transformation is not done at the expense of interpretability. With the stored parameters we can, almost always, go back from the transformed variable to the raw data.
I would argue, however, that the main advantage is that Feature-engine "enforces" users to use domain knowledge to decide which transformation to apply. It does so by not centralizing all transformations into one class, but instead grouping related transformations in a transformer. For example, with Scikit-learn’s SimpleImputer(), we can apply all the commonly used imputation techniques. Feature-engine has 3 transformers to cover the entire SimpleImputer() functionality. This is done intentionally, to avoid applying transformations that are appropriate for categorical variables to numerical variables, or transformations that will significantly distort the variable distribution to variables with few missing data points.
The decentralized design is spread throughout the package, so that we can, in each transformer, add functionality and information that helps users understand the advantages and limitations of that transformation, as well as errors that are raised when, for example, the variable is not suitable.
When is Feature-engine not the best option?
Feature-engine has been designed to work with pandas dataframes, and to date, most of its functionality is geared toward tabular or cross-sectional data. These are the optimal conditions to use Feature-engine. If our data cannot be stored in a dataframe or if it is not tabular, for example, if we have time series, then Feature-engine is not the right choice at the moment.
What is next for Feature-engine?
In our latest release in January 2022, we made a massive improvement to the documentation, including more explanations and examples on how to use Feature-engine’s transformers. We also released a new transformer to automatically extracts features from datetime variables, and a new selection algorithm based on the feature’s population stability index, widely used in finance.
Next, we want to expand Feature-engine to create features from time series for forecasting. We want Feature-engine to be able to create lag features and window features using the fit() and transform() functionality that we love. And we also want to expand the functionality of datetime features, for example, by creating features from combinations of datetime variables. An example would be determining age from the difference between the date of birth and the time of application.
How can we support Feature-engine?
If you find that Feature-engine is a great package, there are plenty of ways in which you can support its further development and maintenance. You can contribute code to improve its functionality, donate to support the maintainer (aka, me), suggest new features to be included, write a blog, speak about Feature-engine at a meetup, or with colleagues and students, and any other way you can think of to spread the word.
Feature-engine is an inclusive project. We welcome everyone. If you are looking to make your first contribution to open source, if you are an experienced developer, if you want to learn Python or machine learning, if you want to have fun coding, if you have too much time and don’t know what to do with it, if you want to teach us a couple of things, or if you need specific functionality, jump on board. We are happy to have you.
If you made it this far, well done and thanks for reading!
To contribute to Feature-engine check our contributing guidelines. As tedious as guidelines sound, they will save you a lot of time in setting the environment and troubleshooting git .
To learn more about Feature-engine visit its documentation. To learn more about feature engineering and feature selection in general, check out my courses Feature Engineering for Machine Learning and Feature Selection for Machine Learning.
For an amazing discussion on how machine learning can negatively affect people’s lives when not used correctly, check out the book “Weapons of Math Destruction”.
That’s all, I hope you enjoyed the article.
Soledad Galli, PhD is the Lead Data Scientist and machine learning instructor at Train in Data. Sole teaches intermediate and advanced courses in data science and machine learning. She worked in finance and insurance, received a Data Science Leaders Award in 2018 and was selected as "LinkedIn’s voice" in data science and analytics in 2019. She is also the creator and maintainer of the Python open source library Feature-engine. Sole is passionate about sharing knowledge and helping others succeed in data science.