# Feature Reduction using Genetic Algorithm with Python

This tutorial discusses how to use the genetic algorithm (GA) for reducing the feature vector extracted from the Fruits360 dataset in Python mainly using NumPy and Sklearn.

After preparing the features, class labels, and the GA parameters, we can go through the iterations of the GA according to the next code. At first, the fitness value for all solutions is calculated by calling the fitness function named **cal_pop_fitness()** defined in the GA file. This function accepts the current population, the extracted features, the class labels, the train indices, and the test indices. The function returns the fitness value for all solutions in a variable named **fitness**. Remember that the fitness value represents the classification accuracy. The best (i.e. highest) classification accuracy is saved into the **best_outputs** list.

Based on the calculated fitness values, the best solutions which has the highest classification accuracy are selected as parents in the mating pool using the **select_mating_pool()** function defined in the GA.py file. It accepts the current population, fitness values, and the number of parents to return. It returns the selected parents into the **parents** variable.

for generation in range(num_generations): print("Generation : ", generation) # Measuring the fitness of each chromosome in the population. fitness = GA.cal_pop_fitness(new_population, data_inputs, data_outputs, train_indices, test_indices) best_outputs.append(numpy.max(fitness)) # The best result in the current iteration. print("Best result : ", best_outputs[-1]) # Selecting the best parents in the population for mating. parents = GA.select_mating_pool(new_population, fitness, num_parents_mating) # Generating next generation using crossover. offspring_crossover = GA.crossover(parents, offspring_size=(pop_shape[0]-parents.shape[0], num_feature_elements)) # Adding some variations to the offspring using mutation. offspring_mutation = GA.mutation(offspring_crossover, num_mutations=num_mutations) # Creating the new population based on the parents and offspring. new_population[0:parents.shape[0], :] = parents new_population[parents.shape[0]:, :] = offspring_mutation

Next is to apply the crossover operation over the selected parents to create the offspring. This is done inside the **crossover()** function defined in the GA.py file. It accepts the parents and the shape of the offspring array to be returned later into the **offspring_crossover** variable. The mutation operation is then applied over that array using the **mutation()** function which is also available within the GA.py file. In addition to the crossover results, this function accepts the number of mutations.

Because the new population consists of the selected parents in addition to the offspring, both the **parents** and the **offspring_mutation** arrays are saved into the **new_population** variable. After that, a new generation is applied over the new population.

After all generations complete, the next code is executed in order to return the best selected set of feature elements and the number of selected elements. After the 100 generations complete, the algorithm used 174 feature elements in order to reach an accuracy of 99.59%.

fitness = GA.cal_pop_fitness(new_population, data_inputs, data_outputs, train_indices, test_indices) # Then return the index of that solution corresponding to the best fitness. best_match_idx = numpy.where(fitness == numpy.max(fitness))[0] best_match_idx = best_match_idx[0] best_solution = new_population[best_match_idx, :] best_solution_indices = numpy.where(best_solution == 1)[0] best_solution_num_elements = best_solution_indices.shape[0] best_solution_fitness = fitness[best_match_idx] print("best_match_idx : ", best_match_idx) print("best_solution : ", best_solution) print("Selected indices : ", best_solution_indices) print("Number of selected elements : ", best_solution_num_elements) print("Best solution fitness : ", best_solution_fitness) matplotlib.pyplot.plot(best_outputs) matplotlib.pyplot.xlabel("Iteration") matplotlib.pyplot.ylabel("Fitness") matplotlib.pyplot.show()

The above code also displays a figure showing the progress of the GA over all generations which is shown below.

Here is the complete code of the main file.

import numpy import GA import pickle import matplotlib.pyplot f = open("dataset_features.pkl", "rb") data_inputs = pickle.load(f) f.close() f = open("outputs.pkl", "rb") data_outputs = pickle.load(f) f.close() num_samples = data_inputs.shape[0] num_feature_elements = data_inputs.shape[1] train_indices = numpy.arange(1, num_samples, 4) test_indices = numpy.arange(0, num_samples, 4) print("Number of training samples: ", train_indices.shape[0]) print("Number of test samples: ", test_indices.shape[0]) """ Genetic algorithm parameters: Population size Mating pool size Number of mutations """ sol_per_pop = 8 # Population size. num_parents_mating = 4 # Number of parents inside the mating pool. num_mutations = 3 # Number of elements to mutate. # Defining the population shape. pop_shape = (sol_per_pop, num_feature_elements) # Creating the initial population. new_population = numpy.random.randint(low=0, high=2, size=pop_shape) print(new_population.shape) best_outputs = [] num_generations = 100 for generation in range(num_generations): print("Generation : ", generation) # Measuring the fitness of each chromosome in the population. fitness = GA.cal_pop_fitness(new_population, data_inputs, data_outputs, train_indices, test_indices) best_outputs.append(numpy.max(fitness)) # The best result in the current iteration. print("Best result : ", best_outputs[-1]) # Selecting the best parents in the population for mating. parents = GA.select_mating_pool(new_population, fitness, num_parents_mating) # Generating next generation using crossover. offspring_crossover = GA.crossover(parents, offspring_size=(pop_shape[0]-parents.shape[0], num_feature_elements)) # Adding some variations to the offspring using mutation. offspring_mutation = GA.mutation(offspring_crossover, num_mutations=num_mutations) # Creating the new population based on the parents and offspring. new_population[0:parents.shape[0], :] = parents new_population[parents.shape[0]:, :] = offspring_mutation # Getting the best solution after iterating finishing all generations. # At first, the fitness is calculated for each solution in the final generation. fitness = GA.cal_pop_fitness(new_population, data_inputs, data_outputs, train_indices, test_indices) # Then return the index of that solution corresponding to the best fitness. best_match_idx = numpy.where(fitness == numpy.max(fitness))[0] best_match_idx = best_match_idx[0] best_solution = new_population[best_match_idx, :] best_solution_indices = numpy.where(best_solution == 1)[0] best_solution_num_elements = best_solution_indices.shape[0] best_solution_fitness = fitness[best_match_idx] print("best_match_idx : ", best_match_idx) print("best_solution : ", best_solution) print("Selected indices : ", best_solution_indices) print("Number of selected elements : ", best_solution_num_elements) print("Best solution fitness : ", best_solution_fitness) matplotlib.pyplot.plot(best_outputs) matplotlib.pyplot.xlabel("Iteration") matplotlib.pyplot.ylabel("Fitness") matplotlib.pyplot.show()

### GA.py Implementation

The implementation of the GA.py file is listed below. Within the **cal_pop_fitness()** function, the SVC is trained according to the selected feature elements by each solution. Before being trained, the features are filtered according to the selected elements whose genes are given a value of 1. This is done inside the **reduce_features()** function. It accepts the current solution in addition to the complete features for all samples.

After being trained, its classification accuracy is calculated using the **classification_accuracy()** function. This function returns the accuracy which is stored into an array named accuracies within the **cal_pop_fitness()** function.

The implementation of the **crossover()** and **mutation()** functions are very similar to what is discussed in my previous tutorial titled "**Genetic Algorithm Implementation in Python**". One major difference is that the **mutation()** function changes the randomly selected genes by flipping their values because we are using binary representation.

import numpy import sklearn.svm def reduce_features(solution, features): selected_elements_indices = numpy.where(solution == 1)[0] reduced_features = features[:, selected_elements_indices] return reduced_features def classification_accuracy(labels, predictions): correct = numpy.where(labels == predictions)[0] accuracy = correct.shape[0]/labels.shape[0] return accuracy def cal_pop_fitness(pop, features, labels, train_indices, test_indices): accuracies = numpy.zeros(pop.shape[0]) idx = 0 for curr_solution in pop: reduced_features = reduce_features(curr_solution, features) train_data = reduced_features[train_indices, :] test_data = reduced_features[test_indices, :] train_labels = labels[train_indices] test_labels = labels[test_indices] SV_classifier = sklearn.svm.SVC(gamma='scale') SV_classifier.fit(X=train_data, y=train_labels) predictions = SV_classifier.predict(test_data) accuracies[idx] = classification_accuracy(test_labels, predictions) idx = idx + 1 return accuracies def select_mating_pool(pop, fitness, num_parents): # Selecting the best individuals in the current generation as parents for producing the offspring of the next generation. parents = numpy.empty((num_parents, pop.shape[1])) for parent_num in range(num_parents): max_fitness_idx = numpy.where(fitness == numpy.max(fitness)) max_fitness_idx = max_fitness_idx[0][0] parents[parent_num, :] = pop[max_fitness_idx, :] fitness[max_fitness_idx] = -99999999999 return parents def crossover(parents, offspring_size): offspring = numpy.empty(offspring_size) # The point at which crossover takes place between two parents. Usually, it is at the center. crossover_point = numpy.uint8(offspring_size[1]/2) for k in range(offspring_size[0]): # Index of the first parent to mate. parent1_idx = k%parents.shape[0] # Index of the second parent to mate. parent2_idx = (k+1)%parents.shape[0] # The new offspring will have its first half of its genes taken from the first parent. offspring[k, 0:crossover_point] = parents[parent1_idx, 0:crossover_point] # The new offspring will have its second half of its genes taken from the second parent. offspring[k, crossover_point:] = parents[parent2_idx, crossover_point:] return offspring def mutation(offspring_crossover, num_mutations=2): mutation_idx = numpy.random.randint(low=0, high=offspring_crossover.shape[1], size=num_mutations) # Mutation changes a single gene in each offspring randomly. for idx in range(offspring_crossover.shape[0]): # The random value to be added to the gene. offspring_crossover[idx, mutation_idx] = 1 - offspring_crossover[idx, mutation_idx] return offspring_crossover

### For Contacting the Author

- E-mail: [email protected]
- LinkedIn: https://linkedin.com/in/ahmedfgad/
- KDnuggets: https://kdnuggets.com/author/ahmed-gad
- YouTube: https://youtube.com/AhmedGadFCIT
- TowardsDataScience: https://towardsdatascience.com/@ahmedfgad
- GitHub: https://github.com/ahmedfgad

**Bio: Ahmed Gad** received his B.Sc. degree with excellent with honors in information technology from the Faculty of Computers and Information (FCI), Menoufia University, Egypt, in July 2015. For being ranked first in his faculty, he was recommended to work as a teaching assistant in one of the Egyptian institutes in 2015 and then in 2016 to work as a teaching assistant and a researcher in his faculty. His current research interests include deep learning, machine learning, artificial intelligence, digital signal processing, and computer vision.

Original. Reposted with permission.

**Related:**

- A Quick Guide to Feature Engineering
- Artificial Neural Network Implementation using NumPy and Image Classification
- Genetic Algorithm Implementation in Python