Design by Evolution: How to evolve your neural network with AutoML

The gist ( tl;dr): Time to evolve! I’m gonna give a basic example (in PyTorch) of using evolutionary algorithms to tune the hyper-parameters of a DNN.



Implementation

 
The implementation uses PyTorch to create an agent that will explore DNNs for solving a simple classification task. MNIST is used for this experiment since its small and fast to train even on the CPU. We’ll create a population of DNN models and evolve it for N number of steps.

I assume that you already know the basics of what a DNN is and are familiar with PyTorch. If not, a good starting point for PyTorch can be found here on the official page.

The evolution theme is implementing the tournament selection method. The high level algorithm is this:

new_population = []
while size(new_population) < population_size:
  choose k(tournament) individuals from the population at random
  choose the best from pool/tournament with probability p1
  choose the second best individual with probability p2
  choose the third best individual with probability p3
  mutate and append selected to the new_population


Side Note: The crossover problem. Crossover can be complicated when it comes to blending structures. How do you merge the architecture of two parents? How is this affected by training from scratch and finetuning scenarios? One of the recent papers from Miikkulainen et. al, tackle this issue with a very elegant solution called CoDeepNEAT. Based on the idea of Evolino, an architecture is composed of modules, and each module is subject to evolution itself. The architecture is a blueprint that combines components. In such setting it makes sense to mix the components of the parents since each component is a complete micro-network. For simplicity I opted not to include crossover in this implementation, but a technique like NEAT / CoDeepNEAT can definatelly be advantageous. (I’m planning to post a follow up post on these techniques.)

Basic building blocks

The first thing we need to define is the solution space for each model. Each individual represents an architecture. For simplicity we stack n-layers; each layer has three parameters: a) number of hidden units, b) activation type and c) dropout rate. As for the global parameters we choose between different optimizers, learning rates, weight decay and the number of layers.

Now that we have defined what is the space which our models live we need to create 3 basic functions:

  • randomise a network:

Initially, we randomly sample the number of layers and parameters of each layer. Sampled values fall within a predefined margin. When initialising a parameter we also generate a random parameter id. Currently it is not utilised, but one can keep track of all the layers. When a new model is mutated the old layers can be fine-tuned while initialising only the mutated ones. This should provide significant speed-ups and stabilise the solutions.

Note: Depending on the problem we might require different limitations, for example the total number of parameters, or total number of layers or FLOPs per cycle.

  • mutate a network:

Each network element has been assigned a probability of mutation. Each mutation will alter the parameter by resampling the parameter space.

  • make the network:

The above class will instantiate the model “genome”.

So now we have the basic buildings - how to create a random network, mutate its architecture, and train it. The next step is to create the genetic algorithm that will perform the selection and mutation of the best performing individuals. Every model is trained in parallel and doesn’t require any information sharing with the other agents. This enables the optimisation process to scale linearly with the amount of processing nodes available.
Code for the GP optimiser:

Evolutionary algorithms look super simple, right? Well it is! It can be very successful - especially if you define find good mutation / crossover functions for the individuals.

The repository includes some extra utlity classes like the worker class and scheduler class enable the GP optimizer do the model training and evaluation in parallel.