# Feature selection by random search in Python

Feature selection is one of the most important tasks in machine learning. Learn how to use a simple random search in Python to get good results in less time.

**By Gianluca Malato, Data Scientist, fiction author, and software developer.**.

**Feature selection** has always been a great task in machine learning. According to my experience, I can surely say that feature selection **is much more important** than model selection itself.

### Feature selection and collinearity

I have already written an article about feature selection. It was an **unsupervised **way to measure feature importance in a **binary classification** model, using Pearson’s chi-square test and correlation coefficient.

Generally speaking, an unsupervised approach is often enough for a simple feature selection. However, each model has **its own way** of “thinking” the features and treat their correlation with the target variable. Moreover, there are models that do not care too much about **collinearity **(i.e., the correlation between the features) and other models that show **very big problems** when it occurs (for example, linear models).

Although it’s possible to **rank **the features by some kind of **relevance** metrics introduced by the model (for example, the p-value of the t-test performed on the coefficients of linear regression), taking only the most relevant variables couldn’t be enough. Think about a feature that is equal to another one, just multiplied by two. The linear correlation between these features if 1 and this simple multiplication doesn’t affect the correlation with the target variable, so if we take only the most relevant variables, we’ll take the original feature and the multiplied one. This leads to **collinearity**, which can be quite dangerous for our model.

That’s why we must introduce some way to better select our features.

### Random search

Random search is a really useful tool in a data scientist toolbox. It’s a very simple technique very often used, for example, in cross-validation and **hyperparameter optimization**.

It’s very simple. If you have a multi-dimensional grid and want to look for the point on this grid which maximizes (or minimizes) some **objective function**, random search works as follows:

- Take a random point on the grid and measure the objective function value
- If the value is better than the best one achieved so far, keep the point in memory.
- Repeat for a certain, pre-defined number of times

That’s it. Just generating random points and look for the best one.

Is this a **good way** to find the global minimum (or maximum)? Of course, it’s not. The point we are looking for is only one (if we are lucky) in a very large space and we have only a limited number of iterations. The probability of getting that single point in an *N-*point grid is *1/N*.

So, why is random search so much used? Because we **never really want** to maximize our performance measure; we want a good, **reasonably high value** that it’s not the highest possible, to avoid overfitting.

That’s why random search works and can be used in feature selection.

### How to use random search for feature selection

Random search can be used for feature selection with quite **good results**. An example of a procedure similar to a random search is the **Random Forest** model which performs a random selection of the features for each tree.

The idea is pretty simple: choose the features **randomly**, measure the model performances by **k-fold cross-validation**, and repeat many times. The feature combination that gives the best performances is the one we are looking for.

More precisely, these are the steps to follow:

- Generate a random integer number
*N*between 1 and the number of features. - Generate a random sequence of
*N*integer numbers between 0 and*N-1*without repetition. This sequence represents our feature array. Remember that Python arrays start from 0. - Train the model on these features and cross-validate it with k-fold cross-validation, saving the
**average value**of some performance measure. - Repeat from point 1 as many times as you want.
- Finally, get the feature array that gives the best performances according to the chosen performance measure.

### A practical example in Python

For this example, I’ll use the **breast cancer dataset** included in **sklearn** module. Our model will be a **logistic regression**, and we’ll perform a 5-fold cross-validation using **accuracy** as the performance measure.

First of all, we must import the necessary modules.

import sklearn.datasets from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score import numpy as np

Then we can import breast cancer data and break it in input and target.

dataset= sklearn.datasets.load_breast_cancer() data = dataset.data target = dataset.target

We can now create a logistic regression object.

lr = LogisticRegression()

Then, we can measure the average accuracy in k-fold CV with all the features.</p.

# Model accuracy using all the features np.mean(cross_val_score(lr,data,target,cv=5,scoring="accuracy")) # 0.9509041939207385

It’s 95%. Let’s keep this in mind.

Now, we can implement a random search with, for example, 300 iterations.

result = [] # Number of iterations N_search = 300 # Random seed initialization np.random.seed(1) for i in range(N_search): # Generate a random number of features N_columns = list(np.random.choice(range(data.shape[1]),1)+1) # Given the number of features, generate features without replacement columns = list(np.random.choice(range(data.shape[1]), N_columns, replace=False)) # Perform k-fold cross validation scores = cross_val_score(lr,data[:,columns], target, cv=5, scoring="accuracy") # Store the result result.append({'columns':columns,'performance':np.mean(scores)}) # Sort the result array in descending order for performance measure result.sort(key=lambda x : -x['performance'])

At the end of the loop and the sorting function, the first element of *result *list is the object we are looking for.

We can use this value to calculate the new performance measure with this subset of the features.

np.mean(cross_val_score(lr, data[:,result[0][‘columns’]], target, cv=5, scoring=”accuracy”)) # 0.9526741054251634

As you can see, accuracy has increased.

### Conclusions

Random search can be a powerful tool to perform feature selection. It’s not meant to give the reasons why some features are more useful than other ones (as opposed to other feature selection procedures like Recursive Feature Elimination), but it can be a useful tool to reach **good results** in less time.

Original. Reposted with permission.

**Related:**