Data preprocessing for deep learning with nuts-ml

Nuts-ml is a new data pre-processing library in Python for GPU-based deep learning in vision. It provides common pre-processing functions as independent, reusable units. These so called ‘nuts’ can be freely arranged to build data flows that are efficient, easy to read and modify.

By Stefan Maetschke, PhD.

Data preprocessing is a fundamental part of any machine learning project and often more time is spent on the data preparation than on the actual machine learning. While some preprocessing tasks are problem specific many others such as partitioning data into training and test folds, stratifying samples or building mini-batches are generic. The following Canonical Pipeline shows the processing steps common for deep-learning in vision.

A Reader reads sample data stored in text files, Excel or Pandas tables. The Splitter then partitions data into training, validation and test folds and performs stratification if needed. Usually not all image data can be loaded into memory and a Loader loads images on demand. These images are often processed by a Transformer, for resizing, cropping or other adjustments. Furthermore, to increase the training set additional images are synthesized by randomly augmenting (flipping, rotating, …) images using an Augmenter. Efficient, GPU-based machine learning demands that image and label data are grouped in mini-batches via a Batcher before passed on to the Network for training or inference. Finally, to keep track of the training progress, usually a Logger is employed to write training losses or accuracies to a log file.

Some machine learning frameworks such as Keras provide (some of) these preprocessing components hidden behind an API that considerably simplify network training if it fits the task at hand. See the following excerpt of a Keras example to train a model with augmentation.

datagen = ImageDataGenerator(  # augment images

model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),
    validation_data=(x_test, y_test))

However, what if an image format, augmentation or other preprocessing capability is needed that is not provided by the API? Extending a library such as Keras or others is not trivial and the common approach is to (re)implement the required functionality – often in a quick-and-dirty fashion. But implementing a robust data pipeline that loads, transforms, augments and processes images on demand is challenging and time consuming.
nuts-ml is a Python library that provides common preprocessing functions as so called nuts, which can be freely arranged and easily extended to construct efficient data processing pipelines. The following excerpt from a nuts-ml example shows a pipeline for network training, where the >>operator defines the flow of data.

t_loss = (train_samples >> augment >> rerange >> Shuffle(100) >> 
          build_batch >> network.train() >> Mean())
print "training loss  :", t_loss

In the example above training images are augmented, pixel values re-ranged, and the samples shuffled before building batches for network training. Finally, the mean over the batch-wise training losses is computed and printed. The nuts this data flow is composed of can be defined as follows

rerange = TransformImage(0).by('rerange', 0, 255, 0, 1, 'float32')

augment = (AugmentImage(0)
           .by('identical', 1.0)
           .by('brightness', 0.1, [0.7, 1.3])
           .by('fliplr', 0.1)))
build_batch = (BuildBatch(BATCH_SIZE)
               .by(0, 'image', 'float32')
               .by(1, 'one_hot', 'uint8', NUM_CLASSES))           

network = KerasNetwork(model)

where rerange is an image transformation that converts pixel values in range [0, 255] to range [0, 1], augment generates additional images for training by randomly flipping horizontally and changing the image brightness, build_batch constructs batches composed of images and one-hot encoded class labels, and network wraps an existing Keras model in a nut that can be plugged into the pipeline. The complete code for this example can be found here

nuts-ml helps to build data preprocessing pipelines for deep-learning more quickly. The resulting code is more readable and can readily be modified to experiment with different preprocessing schemes. Task-specific functions can easily be implement as nuts and added to the data flow. For instance, here a simple nut to adjust the brightness of an image

def AdjustBrightness(image, c):
  return image * c
... images >> AdjustBrightness(1.1) >> ...  

nuts-ml does not perform network training itself but uses existing libraries such as Keras or Theano for this purpose. Any machine learning library that accepts mini-batches of Numpy arrays for training or inference is compatible. For more information about nuts-ml see the Introduction and have a look at the Tutorial.

Bio: Stefan Maetschke (PhD) is a research scientist at IBM Research Australia where he develops machine learning infrastructure and models for medical image analysis.