Training with Keras-MXNet on Amazon SageMaker

In this post, you will learn how to train Keras-MXNet jobs on Amazon SageMaker. I’ll show you how to build custom Docker containers for CPU and GPU training, configure multi-GPU training, pass parameters to a Keras script, and save the trained models in Keras and MXNet formats.

Configuring the training job

This is actually quite underwhelming, which is great news: nothing really differs from training with a built-in algorithm!

First we need to upload the MNIST data set from our local machine to S3. We’ve done this many times before, nothing new here.

Then, we configure the training job by:

  • selecting one of the containers we just built and setting the usual parameters for SageMaker estimators,
  • passing hyper parameters to the Keras script,
  • passing input data to the Keras script.

That’s it for training. The last part we’re missing is adapting our Keras script for SageMaker. Let’s get to it.


Adapting the Keras script for SageMaker

We need to take care of hyper parameters, input data, multi-GPU configuration, loading the data set and saving models.

Passing hyper parameters and input data configuration

As mentioned earlier, SageMaker copies hyper parameters to /opt/ml/input/config/hyperparameters.json. All we have to do is read this file, extract parameters and set default values if needed.

In a similar fashion, SageMaker copies the input data configuration to /opt/ml/input/data. We’ll handle things in exactly the same way.

In this example, I don’t need this configuration info, but this is how you’d read it if you did :)

Loading the training and validation set

When training in file mode (which is the case here), SageMaker automatically copies the data set to /opt/ml/input/<channel_name>: here, we defined the train and validation channels, so we’ll simply read the MNIST files from the corresponding directories.

Configuring multi-GPU training

As explained in a previous post, Keras-MXNet makes it very easy to set up multi-GPU training. Depending on the gpu_count hyper parameter, we just need to wrap our model with a bespoke Keras API before compiling it:

Ain’t life grand?

Saving models

The very last thing we need to do once training is complete is to save the model in /opt/ml/model: SageMaker will grab all artefacts present in this directory, build a file called model.tar.gz and copy it to the S3 bucket used by the training job.

In fact, we’re going to save the trained model in two different formats : the Keras format (i.e. an HDF5 file) and the native MXNet format (i.e. a JSON file and a .params file). This will allow us to use it with both libraries!

That’s it. As you can see, it’s all about interfacing your script with SageMaker input and output. The bulk of your Keras code doesn’t require any modification.


Running the script

Alright, let’s run the GPU version! We’ll train on 2 GPUs hosted in a p3.8xlarge instance.

Let’s check the S3 bucket.

$ aws s3 ls $BUCKET/keras-mxnet-gpu/output/keras-mxnet-mnist-cnn-2018-05-30-17-39-50-724/output/
2018-05-30 17:43:34    8916913 model.tar.gz
$ aws s3 cp $BUCKET/keras-mxnet-gpu/output/keras-mxnet-mnist-cnn-2018-05-30-17-39-50-724/output/model.tar.gz .
$ tar tvfz model.tar.gz
-rw-r--r-- 0/0   4822688 2018-05-30 17:43 mnist-cnn-10.hd5
-rw-r--r-- 0/0   4800092 2018-05-30 17:43 mnist-cnn-10-0000.params
-rw-r--r-- 0/0      4817 2018-05-30 17:43 mnist-cnn-10-symbol.json

Wunderbar, as they say on the other side of the Rhine ;) We can now use these models anywhere we like.

That’s it for today. Another (hopefully) nice example of using SageMaker to train your custom jobs on fully-managed infrastructure!

Happy to answer questions here or on Twitter. For more content, please feel free to check out my YouTube channel.

Time to burn… some clock cycles :)

Bio: Julien Simon (@julsimon) is hands-on technology executive. Expert in web architecture & infrastructure. Scalability addict. Agile warrior.

Original. Reposted with permission.