Training with Keras-MXNet on Amazon SageMaker
In this post, you will learn how to train Keras-MXNet jobs on Amazon SageMaker. I’ll show you how to build custom Docker containers for CPU and GPU training, configure multi-GPU training, pass parameters to a Keras script, and save the trained models in Keras and MXNet formats.
Configuring the training job
This is actually quite underwhelming, which is great news: nothing really differs from training with a built-in algorithm!
First we need to upload the MNIST data set from our local machine to S3. We’ve done this many times before, nothing new here.
Then, we configure the training job by:
- selecting one of the containers we just built and setting the usual parameters for SageMaker estimators,
- passing hyper parameters to the Keras script,
- passing input data to the Keras script.
That’s it for training. The last part we’re missing is adapting our Keras script for SageMaker. Let’s get to it.
Adapting the Keras script for SageMaker
We need to take care of hyper parameters, input data, multi-GPU configuration, loading the data set and saving models.
Passing hyper parameters and input data configuration
As mentioned earlier, SageMaker copies hyper parameters to /opt/ml/input/config/hyperparameters.json. All we have to do is read this file, extract parameters and set default values if needed.
In a similar fashion, SageMaker copies the input data configuration to /opt/ml/input/data. We’ll handle things in exactly the same way.
In this example, I don’t need this configuration info, but this is how you’d read it if you did :)
Loading the training and validation set
When training in file mode (which is the case here), SageMaker automatically copies the data set to /opt/ml/input/<channel_name>: here, we defined the train and validation channels, so we’ll simply read the MNIST files from the corresponding directories.
Configuring multi-GPU training
As explained in a previous post, Keras-MXNet makes it very easy to set up multi-GPU training. Depending on the gpu_count hyper parameter, we just need to wrap our model with a bespoke Keras API before compiling it:
Ain’t life grand?
The very last thing we need to do once training is complete is to save the model in /opt/ml/model: SageMaker will grab all artefacts present in this directory, build a file called model.tar.gz and copy it to the S3 bucket used by the training job.
In fact, we’re going to save the trained model in two different formats : the Keras format (i.e. an HDF5 file) and the native MXNet format (i.e. a JSON file and a .params file). This will allow us to use it with both libraries!
That’s it. As you can see, it’s all about interfacing your script with SageMaker input and output. The bulk of your Keras code doesn’t require any modification.
Running the script
Alright, let’s run the GPU version! We’ll train on 2 GPUs hosted in a p3.8xlarge instance.
Let’s check the S3 bucket.
Wunderbar, as they say on the other side of the Rhine ;) We can now use these models anywhere we like.
That’s it for today. Another (hopefully) nice example of using SageMaker to train your custom jobs on fully-managed infrastructure!
Time to burn… some clock cycles :)
Original. Reposted with permission.
- An Introduction to the MXNet Python API
- A Crash Course in MXNet Tensor Basics & Simple Automatic Differentiation
- The Search for the Fastest Keras Deep Learning Backend