Feature selection by random search in Python
Feature selection is one of the most important tasks in machine learning. Learn how to use a simple random search in Python to get good results in less time.
By Gianluca Malato, Data Scientist, fiction author, and software developer..
Feature selection has always been a great task in machine learning. According to my experience, I can surely say that feature selection is much more important than model selection itself.
Feature selection and collinearity
I have already written an article about feature selection. It was an unsupervised way to measure feature importance in a binary classification model, using Pearson’s chisquare test and correlation coefficient.
Generally speaking, an unsupervised approach is often enough for a simple feature selection. However, each model has its own way of “thinking” the features and treat their correlation with the target variable. Moreover, there are models that do not care too much about collinearity (i.e., the correlation between the features) and other models that show very big problems when it occurs (for example, linear models).
Although it’s possible to rank the features by some kind of relevance metrics introduced by the model (for example, the pvalue of the ttest performed on the coefficients of linear regression), taking only the most relevant variables couldn’t be enough. Think about a feature that is equal to another one, just multiplied by two. The linear correlation between these features if 1 and this simple multiplication doesn’t affect the correlation with the target variable, so if we take only the most relevant variables, we’ll take the original feature and the multiplied one. This leads to collinearity, which can be quite dangerous for our model.
That’s why we must introduce some way to better select our features.
Random search
Random search is a really useful tool in a data scientist toolbox. It’s a very simple technique very often used, for example, in crossvalidation and hyperparameter optimization.
It’s very simple. If you have a multidimensional grid and want to look for the point on this grid which maximizes (or minimizes) some objective function, random search works as follows:
 Take a random point on the grid and measure the objective function value
 If the value is better than the best one achieved so far, keep the point in memory.
 Repeat for a certain, predefined number of times
That’s it. Just generating random points and look for the best one.
Is this a good way to find the global minimum (or maximum)? Of course, it’s not. The point we are looking for is only one (if we are lucky) in a very large space and we have only a limited number of iterations. The probability of getting that single point in an Npoint grid is 1/N.
So, why is random search so much used? Because we never really want to maximize our performance measure; we want a good, reasonably high value that it’s not the highest possible, to avoid overfitting.
That’s why random search works and can be used in feature selection.
How to use random search for feature selection
Random search can be used for feature selection with quite good results. An example of a procedure similar to a random search is the Random Forest model which performs a random selection of the features for each tree.
The idea is pretty simple: choose the features randomly, measure the model performances by kfold crossvalidation, and repeat many times. The feature combination that gives the best performances is the one we are looking for.
More precisely, these are the steps to follow:
 Generate a random integer number Nbetween 1 and the number of features.
 Generate a random sequence of Ninteger numbers between 0 and N1without repetition. This sequence represents our feature array. Remember that Python arrays start from 0.
 Train the model on these features and crossvalidate it with kfold crossvalidation, saving the average value of some performance measure.
 Repeat from point 1 as many times as you want.
 Finally, get the feature array that gives the best performances according to the chosen performance measure.
A practical example in Python
For this example, I’ll use the breast cancer dataset included in sklearn module. Our model will be a logistic regression, and we’ll perform a 5fold crossvalidation using accuracy as the performance measure.
First of all, we must import the necessary modules.
import sklearn.datasets from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score import numpy as np
Then we can import breast cancer data and break it in input and target.
dataset= sklearn.datasets.load_breast_cancer() data = dataset.data target = dataset.target
We can now create a logistic regression object.
lr = LogisticRegression()
Then, we can measure the average accuracy in kfold CV with all the features.</p.
# Model accuracy using all the features np.mean(cross_val_score(lr,data,target,cv=5,scoring="accuracy")) # 0.9509041939207385
It’s 95%. Let’s keep this in mind.
Now, we can implement a random search with, for example, 300 iterations.
result = [] # Number of iterations N_search = 300 # Random seed initialization np.random.seed(1) for i in range(N_search): # Generate a random number of features N_columns = list(np.random.choice(range(data.shape[1]),1)+1) # Given the number of features, generate features without replacement columns = list(np.random.choice(range(data.shape[1]), N_columns, replace=False)) # Perform kfold cross validation scores = cross_val_score(lr,data[:,columns], target, cv=5, scoring="accuracy") # Store the result result.append({'columns':columns,'performance':np.mean(scores)}) # Sort the result array in descending order for performance measure result.sort(key=lambda x : x['performance'])
At the end of the loop and the sorting function, the first element of result list is the object we are looking for.
We can use this value to calculate the new performance measure with this subset of the features.
np.mean(cross_val_score(lr, data[:,result[0][‘columns’]], target, cv=5, scoring=”accuracy”)) # 0.9526741054251634
As you can see, accuracy has increased.
Conclusions
Random search can be a powerful tool to perform feature selection. It’s not meant to give the reasons why some features are more useful than other ones (as opposed to other feature selection procedures like Recursive Feature Elimination), but it can be a useful tool to reach good results in less time.
Original. Reposted with permission.
Related:
Top Stories Past 30 Days

