Feature selection by random search in Python

Feature selection is one of the most important tasks in machine learning. Learn how to use a simple random search in Python to get good results in less time.



By Gianluca Malato, Data Scientist, fiction author, and software developer..

Feature selection has always been a great task in machine learning. According to my experience, I can surely say that feature selection is much more important than model selection itself.

 

Feature selection and collinearity

I have already written an article about feature selection. It was an unsupervised way to measure feature importance in a binary classification model, using Pearson’s chi-square test and correlation coefficient.

Generally speaking, an unsupervised approach is often enough for a simple feature selection. However, each model has its own way of “thinking” the features and treat their correlation with the target variable. Moreover, there are models that do not care too much about collinearity (i.e., the correlation between the features) and other models that show very big problems when it occurs (for example, linear models).

Although it’s possible to rank the features by some kind of relevance metrics introduced by the model (for example, the p-value of the t-test performed on the coefficients of linear regression), taking only the most relevant variables couldn’t be enough. Think about a feature that is equal to another one, just multiplied by two. The linear correlation between these features if 1 and this simple multiplication doesn’t affect the correlation with the target variable, so if we take only the most relevant variables, we’ll take the original feature and the multiplied one. This leads to collinearity, which can be quite dangerous for our model.

That’s why we must introduce some way to better select our features.

 

Random search

Random search is a really useful tool in a data scientist toolbox. It’s a very simple technique very often used, for example, in cross-validation and hyperparameter optimization.

It’s very simple. If you have a multi-dimensional grid and want to look for the point on this grid which maximizes (or minimizes) some objective function, random search works as follows:

  1. Take a random point on the grid and measure the objective function value
  2. If the value is better than the best one achieved so far, keep the point in memory.
  3. Repeat for a certain, pre-defined number of times

That’s it. Just generating random points and look for the best one.

Is this a good way to find the global minimum (or maximum)? Of course, it’s not. The point we are looking for is only one (if we are lucky) in a very large space and we have only a limited number of iterations. The probability of getting that single point in an N-point grid is 1/N.

So, why is random search so much used? Because we never really want to maximize our performance measure; we want a good, reasonably high value that it’s not the highest possible, to avoid overfitting.

That’s why random search works and can be used in feature selection.

 

How to use random search for feature selection

Random search can be used for feature selection with quite good results. An example of a procedure similar to a random search is the Random Forest model which performs a random selection of the features for each tree.

The idea is pretty simple: choose the features randomly, measure the model performances by k-fold cross-validation, and repeat many times. The feature combination that gives the best performances is the one we are looking for.

More precisely, these are the steps to follow:

  1. Generate a random integer number Nbetween 1 and the number of features.
  2. Generate a random sequence of Ninteger numbers between 0 and N-1without repetition. This sequence represents our feature array. Remember that Python arrays start from 0.
  3. Train the model on these features and cross-validate it with k-fold cross-validation, saving the average value of some performance measure.
  4. Repeat from point 1 as many times as you want.
  5. Finally, get the feature array that gives the best performances according to the chosen performance measure.

 

A practical example in Python

For this example, I’ll use the breast cancer dataset included in sklearn module. Our model will be a logistic regression, and we’ll perform a 5-fold cross-validation using accuracy as the performance measure.

First of all, we must import the necessary modules.

import sklearn.datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np

 

Then we can import breast cancer data and break it in input and target.

dataset= sklearn.datasets.load_breast_cancer()
data = dataset.data
target = dataset.target

 

We can now create a logistic regression object.

lr = LogisticRegression()

 

Then, we can measure the average accuracy in k-fold CV with all the features.</p.

# Model accuracy using all the features
np.mean(cross_val_score(lr,data,target,cv=5,scoring="accuracy"))
# 0.9509041939207385

 

It’s 95%. Let’s keep this in mind.

Now, we can implement a random search with, for example, 300 iterations.

result = []

# Number of iterations
N_search = 300

# Random seed initialization
np.random.seed(1)

for i in range(N_search):
    # Generate a random number of features
    N_columns =  list(np.random.choice(range(data.shape[1]),1)+1)
    
    # Given the number of features, generate features without replacement
    columns = list(np.random.choice(range(data.shape[1]), N_columns, replace=False))
    
    # Perform k-fold cross validation
    scores = cross_val_score(lr,data[:,columns], target, cv=5, scoring="accuracy")
    
    # Store the result
    result.append({'columns':columns,'performance':np.mean(scores)})

# Sort the result array in descending order for performance measure
result.sort(key=lambda x : -x['performance'])

 

At the end of the loop and the sorting function, the first element of result list is the object we are looking for.

We can use this value to calculate the new performance measure with this subset of the features.

np.mean(cross_val_score(lr, data[:,result[0][‘columns’]], target, cv=5, scoring=”accuracy”))
# 0.9526741054251634

 

As you can see, accuracy has increased.

 

Conclusions

Random search can be a powerful tool to perform feature selection. It’s not meant to give the reasons why some features are more useful than other ones (as opposed to other feature selection procedures like Recursive Feature Elimination), but it can be a useful tool to reach good results in less time.

 

Original. Reposted with permission.

Related: