Feature Reduction using Genetic Algorithm with Python

This tutorial discusses how to use the genetic algorithm (GA) for reducing the feature vector extracted from the Fruits360 dataset in Python mainly using NumPy and Sklearn.



After preparing the features, class labels, and the GA parameters, we can go through the iterations of the GA according to the next code. At first, the fitness value for all solutions is calculated by calling the fitness function named cal_pop_fitness() defined in the GA file. This function accepts the current population, the extracted features, the class labels, the train indices, and the test indices. The function returns the fitness value for all solutions in a variable named fitness. Remember that the fitness value represents the classification accuracy. The best (i.e. highest) classification accuracy is saved into the best_outputs list.

Based on the calculated fitness values, the best solutions which has the highest classification accuracy are selected as parents in the mating pool using the select_mating_pool() function defined in the GA.py file. It accepts the current population, fitness values, and the number of parents to return. It returns the selected parents into the parents variable.

for generation in range(num_generations):

    print("Generation : ", generation)

    # Measuring the fitness of each chromosome in the population.

    fitness = GA.cal_pop_fitness(new_population, data_inputs, data_outputs, train_indices, test_indices)



    best_outputs.append(numpy.max(fitness))

    # The best result in the current iteration.

    print("Best result : ", best_outputs[-1])



    # Selecting the best parents in the population for mating.

    parents = GA.select_mating_pool(new_population, fitness, num_parents_mating)



    # Generating next generation using crossover.

    offspring_crossover = GA.crossover(parents, offspring_size=(pop_shape[0]-parents.shape[0], num_feature_elements))



    # Adding some variations to the offspring using mutation.

    offspring_mutation = GA.mutation(offspring_crossover, num_mutations=num_mutations)



    # Creating the new population based on the parents and offspring.

    new_population[0:parents.shape[0], :] = parents

    new_population[parents.shape[0]:, :] = offspring_mutation


Next is to apply the crossover operation over the selected parents to create the offspring. This is done inside the crossover() function defined in the GA.py file. It accepts the parents and the shape of the offspring array to be returned later into the offspring_crossover variable. The mutation operation is then applied over that array using the mutation() function which is also available within the GA.py file. In addition to the crossover results, this function accepts the number of mutations.

Because the new population consists of the selected parents in addition to the offspring, both the parents and the offspring_mutation arrays are saved into the new_population variable. After that, a new generation is applied over the new population.

After all generations complete, the next code is executed in order to return the best selected set of feature elements and the number of selected elements. After the 100 generations complete, the algorithm used 174 feature elements in order to reach an accuracy of 99.59%.

fitness = GA.cal_pop_fitness(new_population, data_inputs, data_outputs, train_indices, test_indices)

# Then return the index of that solution corresponding to the best fitness.

best_match_idx = numpy.where(fitness == numpy.max(fitness))[0]

best_match_idx = best_match_idx[0]



best_solution = new_population[best_match_idx, :]

best_solution_indices = numpy.where(best_solution == 1)[0]

best_solution_num_elements = best_solution_indices.shape[0]

best_solution_fitness = fitness[best_match_idx]



print("best_match_idx : ", best_match_idx)

print("best_solution : ", best_solution)

print("Selected indices : ", best_solution_indices)

print("Number of selected elements : ", best_solution_num_elements)

print("Best solution fitness : ", best_solution_fitness)



matplotlib.pyplot.plot(best_outputs)

matplotlib.pyplot.xlabel("Iteration")

matplotlib.pyplot.ylabel("Fitness")

matplotlib.pyplot.show()


The above code also displays a figure showing the progress of the GA over all generations which is shown below.

genetic-algorithm-progress

Here is the complete code of the main file.

import numpy
import GA
import pickle
import matplotlib.pyplot

f = open("dataset_features.pkl", "rb")
data_inputs = pickle.load(f)
f.close()

f = open("outputs.pkl", "rb")
data_outputs = pickle.load(f)
f.close()

num_samples = data_inputs.shape[0]
num_feature_elements = data_inputs.shape[1]

train_indices = numpy.arange(1, num_samples, 4)
test_indices = numpy.arange(0, num_samples, 4)
print("Number of training samples: ", train_indices.shape[0])
print("Number of test samples: ", test_indices.shape[0])

"""
Genetic algorithm parameters:
    Population size
    Mating pool size
    Number of mutations
"""
sol_per_pop = 8 # Population size.
num_parents_mating = 4 # Number of parents inside the mating pool.
num_mutations = 3 # Number of elements to mutate.

# Defining the population shape.
pop_shape = (sol_per_pop, num_feature_elements)

# Creating the initial population.
new_population = numpy.random.randint(low=0, high=2, size=pop_shape)
print(new_population.shape)

best_outputs = []
num_generations = 100
for generation in range(num_generations):
    print("Generation : ", generation)
    # Measuring the fitness of each chromosome in the population.
    fitness = GA.cal_pop_fitness(new_population, data_inputs, data_outputs, train_indices, test_indices)

    best_outputs.append(numpy.max(fitness))
    # The best result in the current iteration.
    print("Best result : ", best_outputs[-1])

    # Selecting the best parents in the population for mating.
    parents = GA.select_mating_pool(new_population, fitness, num_parents_mating)

    # Generating next generation using crossover.
    offspring_crossover = GA.crossover(parents, offspring_size=(pop_shape[0]-parents.shape[0], num_feature_elements))

    # Adding some variations to the offspring using mutation.
    offspring_mutation = GA.mutation(offspring_crossover, num_mutations=num_mutations)

    # Creating the new population based on the parents and offspring.
    new_population[0:parents.shape[0], :] = parents
    new_population[parents.shape[0]:, :] = offspring_mutation

# Getting the best solution after iterating finishing all generations.
# At first, the fitness is calculated for each solution in the final generation.
fitness = GA.cal_pop_fitness(new_population, data_inputs, data_outputs, train_indices, test_indices)
# Then return the index of that solution corresponding to the best fitness.
best_match_idx = numpy.where(fitness == numpy.max(fitness))[0]
best_match_idx = best_match_idx[0]

best_solution = new_population[best_match_idx, :]
best_solution_indices = numpy.where(best_solution == 1)[0]
best_solution_num_elements = best_solution_indices.shape[0]
best_solution_fitness = fitness[best_match_idx]

print("best_match_idx : ", best_match_idx)
print("best_solution : ", best_solution)
print("Selected indices : ", best_solution_indices)
print("Number of selected elements : ", best_solution_num_elements)
print("Best solution fitness : ", best_solution_fitness)

matplotlib.pyplot.plot(best_outputs)
matplotlib.pyplot.xlabel("Iteration")
matplotlib.pyplot.ylabel("Fitness")
matplotlib.pyplot.show()


 

GA.py Implementation

 
The implementation of the GA.py file is listed below. Within the cal_pop_fitness() function, the SVC is trained according to the selected feature elements by each solution. Before being trained, the features are filtered according to the selected elements whose genes are given a value of 1. This is done inside the reduce_features() function. It accepts the current solution in addition to the complete features for all samples.

After being trained, its classification accuracy is calculated using the classification_accuracy() function. This function returns the accuracy which is stored into an array named accuracies within the cal_pop_fitness() function.

The implementation of the crossover() and mutation() functions are very similar to what is discussed in my previous tutorial titled "Genetic Algorithm Implementation in Python". One major difference is that the mutation() function changes the randomly selected genes by flipping their values because we are using binary representation.

import numpy

import sklearn.svm



def reduce_features(solution, features):

    selected_elements_indices = numpy.where(solution == 1)[0]

    reduced_features = features[:, selected_elements_indices]

    return reduced_features



def classification_accuracy(labels, predictions):

    correct = numpy.where(labels == predictions)[0]

    accuracy = correct.shape[0]/labels.shape[0]

    return accuracy



def cal_pop_fitness(pop, features, labels, train_indices, test_indices):

    accuracies = numpy.zeros(pop.shape[0])

    idx = 0



    for curr_solution in pop:

        reduced_features = reduce_features(curr_solution, features)

        train_data = reduced_features[train_indices, :]

        test_data = reduced_features[test_indices, :]



        train_labels = labels[train_indices]

        test_labels = labels[test_indices]



        SV_classifier = sklearn.svm.SVC(gamma='scale')

        SV_classifier.fit(X=train_data, y=train_labels)



        predictions = SV_classifier.predict(test_data)

        accuracies[idx] = classification_accuracy(test_labels, predictions)

        idx = idx + 1

    return accuracies



def select_mating_pool(pop, fitness, num_parents):

    # Selecting the best individuals in the current generation as parents for producing the offspring of the next generation.

    parents = numpy.empty((num_parents, pop.shape[1]))

    for parent_num in range(num_parents):

        max_fitness_idx = numpy.where(fitness == numpy.max(fitness))

        max_fitness_idx = max_fitness_idx[0][0]

        parents[parent_num, :] = pop[max_fitness_idx, :]

        fitness[max_fitness_idx] = -99999999999

    return parents



def crossover(parents, offspring_size):

    offspring = numpy.empty(offspring_size)

    # The point at which crossover takes place between two parents. Usually, it is at the center.

    crossover_point = numpy.uint8(offspring_size[1]/2)



    for k in range(offspring_size[0]):

        # Index of the first parent to mate.

        parent1_idx = k%parents.shape[0]

        # Index of the second parent to mate.

        parent2_idx = (k+1)%parents.shape[0]

        # The new offspring will have its first half of its genes taken from the first parent.

        offspring[k, 0:crossover_point] = parents[parent1_idx, 0:crossover_point]

        # The new offspring will have its second half of its genes taken from the second parent.

        offspring[k, crossover_point:] = parents[parent2_idx, crossover_point:]

    return offspring



def mutation(offspring_crossover, num_mutations=2):

    mutation_idx = numpy.random.randint(low=0, high=offspring_crossover.shape[1], size=num_mutations)

    # Mutation changes a single gene in each offspring randomly.

    for idx in range(offspring_crossover.shape[0]):

        # The random value to be added to the gene.

        offspring_crossover[idx, mutation_idx] = 1 - offspring_crossover[idx, mutation_idx]

    return offspring_crossover


 

For Contacting the Author

 

 
Bio: Ahmed Gad received his B.Sc. degree with excellent with honors in information technology from the Faculty of Computers and Information (FCI), Menoufia University, Egypt, in July 2015. For being ranked first in his faculty, he was recommended to work as a teaching assistant in one of the Egyptian institutes in 2015 and then in 2016 to work as a teaching assistant and a researcher in his faculty. His current research interests include deep learning, machine learning, artificial intelligence, digital signal processing, and computer vision.

Original. Reposted with permission.

Related: