Why Automated Feature Selection Has Its Risks

Theoretical relevance of features must not be ignored.



By Michael Grogan, Data Science Consultant



Source: Photo by geralt from Pixabay

 

Coming from an economics background, my first introduction to data modelling was through econometrics — which relies heavily on linear regression to establish the relationships between variables.

When configuring a regression model, it was always emphasised that while the process of selecting features that explain the outcome variable could be automated to a degree — theory should always trump statistics, i.e. correlation does not imply causation.

For instance, it has been observed that there is a strong statistical correlation between the number of ice creams consumed and the homicide rate. However, it is hard to argue that consumption of ice cream directly leads to higher homicide rates. It could be the case that hot weather contributes to increased aggression in people which in turn leads to higher homicide rates — increased ice-cream consumption simply being a by-product of this phenomenon.

 

Issues with Automating Feature Selection

 
In this data-driven world that we are living in, one often has to filter through thousands of different features to determine the ones that impact the outcome variable.

It is not feasible to manually observe every feature in isolation and determine whether it adequately explains the outcome variable. In this regard, we must rely on automation to a certain degree.

However, the problem arises when these feature selection tools are merely used to select the features that should be included in a model — without double-checking the selected features manually to ensure they make theoretical sense.

Using the ExtraTreesClassifier, let’s take a look at some examples of how feature selection can go wrong without a trained eye.

 

ExtraTreesClassifier: Hotel Cancellations

 
For this purpose, let’s consider the ExtraTreesClassifier as the feature selection tool.

ExtraTreesClassifier is an ensemble learning method which uses randomized decision trees to select the features that show strong statistical relevance in explaining the variation of the outcome variable. Specifically, random splits of all observations are carried out to ensure that the model does not overfit the data.

Now, imagine this scenario. You are presented with a hotel cancellations dataset, with an outcome of if the customer cancels and if they do not.

The original data in this example from Antonio, Almeida and Nunes (2019) is available here.

Various features are included in the dataset, including the customer’s country of origin, their lead time (time between booking the room and their stay), their reserved room type, among others.



Source: Jupyter Notebook Output

 

You decide to use an ExtraTreesClassifier to rank the features in order of importance (with the lowest value being the least important and the highest value being the most important):



Source: Jupyter Notebook Output

 

The three highest ranked features are feature 21 (deposit type), feature 12 (country of origin) and feature 27 (reservation status).

However, let’s take a closer look at feature 27. This is a categorical variable with the categories:

  • Cancelled
  • No-Show
  • Check-Out

In this regard, any customers who have cancelled their booking will be assigned to the “Cancelled” category. In this regard, the ReservationStatus variable will show near perfect collinearity with the IsCanceled outcome variable. No wonder the ExtraTreesClassifier shows such a high value for this feature — it is effectively describing the exact same thing as the outcome variable!

However, including this feature in a model would be useless in a real-world scenario — a hotel would have no way of knowing whether a customer will cancel until after the fact.

In this regard, the correct decision is to drop this feature from the eventual model.

Had one relied fully on automating the feature selection, this feature would have been kept in the model — vastly skewing the results and bearing no theoretical relevance to real-world scenarios.

 

Another Example: Average Daily Rates

 
Using the same dataset, let’s now take a look at a different variable: ADR (average daily rates). This variable describes the average spend by a customer at a hotel per day.



Source: Jupyter Notebook Output

 

The first inclination might be to use a feature selection tool to determine the features that best explain (at least statistically) the fluctuations in ADR.

However, it is only when we take a closer look at the data that we notice ADR values are included for both the customers who cancelled, and those that followed through with the booking.

In most cases, the ADR reflects the value that the customers who cancelled would have spent had they stayed at the hotel — but in reality the hotel will no longer make any money from these customers.

In this regard, implementing feature selection with ADR as the dependent variable would be erroneous.

Instead, a more reasonable approach would be to assume that the features which indicate whether a customer will cancel their booking or not will also be of relevance in determining that customer’s ADR. Customers that follow through with the booking have already demonstrated greater customer loyalty than those who cancelled. In this regard, even if a customer who cancelled originally had a high ADR value — this is now redundant since the booking will not go ahead.

Taking this into account, a regression-based neural network model was built to predict ADR using the following features:

  • IsCanceled
  • Country of origin
  • Market segment
  • Deposit type
  • Customer type
  • Required car parking spaces
  • Arrival Date: Week Number

The model ultimately demonstrated a mean absolute error of 28 relative to the mean ADR of 105 across the test set. While the model showed a higher RMSE of 43, MAE was judged to be a better gauge of overall performance as RMSE was overly inflated by the presence of a select few customers with a much higher ADR than that of the majority of customers.

There are many important factors that can influence a customer’s ADR such as their yearly income, currency fluctuations, prices of competing chains, among others, that have not been included in the dataset. In this regard, the included features themselves are limited in being able to explain all the variation in ADR values — but the model has performed quite well given this limitation.

 

Conclusion

 
The phrase “garbage in, garbage out” also applies to feature selection. Should the features in the dataset be nonsensical, then feature selection tools will not be able to interpret those features in a meaningful way. Proper understanding of the data is paramount, and automating of feature selection needs to be balanced with domain knowledge to be able to properly judge whether a feature is appropriate for use in a model.

In this article, you have seen:

  • The use of ExtraTreesClassifier in feature selection
  • Shortcomings of automating feature selection
  • Importance of manually interpreting the features in question

Many thanks for your time, and any questions or feedback are greatly appreciated. You can find the GitHub repository for this example here.

Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice in any way.

 
Bio: Michael Grogan is a Data Science Consultant. He posesses expertise in time series analysis, statistics, Bayesian modeling, and machine learning with TensorFlow.

Original. Reposted with permission.

Related: