Training with Keras-MXNet on Amazon SageMaker

In this post, you will learn how to train Keras-MXNet jobs on Amazon SageMaker. I’ll show you how to build custom Docker containers for CPU and GPU training, configure multi-GPU training, pass parameters to a Keras script, and save the trained models in Keras and MXNet formats.

By Julien Simon, AWS Technical Evangelist

As previously discussedApache MXNet is now available as a backend for Keras 2, aka Keras-MXNet.

In this post, you will learn how to train Keras-MXNet jobs on Amazon SageMaker. I’ll show you how to:

  • build custom Docker containers for CPU and GPU training,
  • configure multi-GPU training,
  • pass parameters to a Keras script,
  • save the trained models in Keras and MXNet formats.

As usual, you’ll find my code on Github :)

That’s how it feels when your custom container runs without error :)

Configuring Keras for MXNet

All it takes is really setting the ‘backend’ to ‘mxnet’ in .keras/keras.json, but setting ‘image_data_format’ to ‘channels_first’ will make MXNet training faster.

When working with image data, the input shape can either be ‘channels_first’, i.e. (number of channels, height, width), or ‘channels_last’, i.e. (height, width, number of channels). For MNIST, this would either be (1, 28, 28) or (28, 28, 1) : one channel (black and white pictures), 28 pixels by 28 pixels. For ImageNet, it would be (3, 224, 224) or (224, 224, 3): three channels (red, green and blue), 224 pixels by 224 pixels.

Here’s the configuration file we’ll use for our container.


Building custom containers

SageMaker provides a collection of built-in algorithms as well as environments for TensorFlow and MXNet… but not for Keras. Fortunately, developers have the option to build custom containers for training and prediction.

Obviously, a number of conventions need to be defined for SageMaker to successfully invoke a custom container:

  • Name of the training and prediction scripts: by default, they should respectively be set to ‘train’ and ‘serve’, be executable and have no extension. SageMaker will start training by running ‘docker run your_container train’.
  • Location of hyper parameters in the container: /opt/ml/input/config/hyperparameters.json.
  • Location of input data parameters in the container: /opt/ml/input/data.

This will require some changes in our Keras script, the well-known example of learning MNIST with a simple CNN. As you will see in a moment, they are quite minor and you won’t have any trouble adding them to your own code.

Building a CPU-based Docker container

Here’s the Docker file.

We start from an Ubuntu 16.04 image and install :

  • Python 3 as well as native dependencies for MXNet.
  • the latest and greatest packages of MXNet and Keras-MXNet.

You don’t have to install pre-releases packages. I just like to live dangerously and add extra spice to my oh-so-quiet everyday life :*)

Once this is done, we clean up various caches to shrink the container size a bit. Then, we copy :

  • the Keras script to /opt/program with the proper name (‘train’) and we make it executable.

For more flexibility, we could write a generic launcher that would fetch the actual training script from an S3 location passed as an hyper parameter. This is left as an exercise for the reader ;)

  • the Keras configuration file to /root/.keras/keras.json.

Finally, we set the directory of our script as the work directory and add it to the path.

It’s not a long file, but as usual with these things, every detail counts.

Building a GPU-based Docker container

Now let’s build its GPU counterpart. It differs in only two ways:

  • we start from the CUDA 9.0 image, which is also based on Ubuntu 16.04. This one has all the CUDA libraries that MXNet needs (unlike the smaller 9.0-base, don’t bother trying it).
  • we install the CUDA 9.0-enabled MXNet package.

Everything else is the same as before.

Creating a Docker repository in Amazon ECR

SageMaker requires that the containers it fetches are hosted in Amazon ECR. Let’s create a repo and login to it.

Building and pushing our containers to ECR

OK, now it’s time to build both containers and push them to their repos. We’ll do this separately for the CPU and GPU versions. Strictly Docker stuff. Please refer to the notebook for details on variables.

Once we’re done, things should look like this and you should also see your two containers in ECR.

The Docker part is over. Now let’s configure our training job in SageMaker.