Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet

Training deep neural nets can take precious time and resources. By leveraging an existing distributed batch processing framework, SparkNet can train neural nets quickly and efficiently.



Spark + Caffe

Using SparkNet

To properly implement SparkNet, you first need a Spark cluster. SparkNet jobs are then submitted via spark-submit.

Building the software from its repo requires the Scala Build Tool (SBT) and CUDA Version 7, along with the aforementioned existing Apache Spark cluster. You get Caffe for free during the SparkNet build. The reasonably straightforward instructions for building SparkNet are located on the GitHub repo (here and here).

Models are defined in a Caffe NetParameter objects, while solvers are defined in Caffe SolverParameter objects. Via SparkNet's Scala interface, a NetParameter object describing a particular deep neural network specification can be defined as in the following code (from the SparkNet repo):

val netParam = NetParam ("LeNet",
  RDDLayer("data", shape=List(batchsize, 1, 28, 28), None),
  RDDLayer("label", shape=List(batchsize, 1), None),
  ConvolutionLayer("conv1", List("data"), kernel=(5,5), 
      numOutput=20),
  PoolingLayer("pool1", List("conv1"), pooling=Pooling.Max, 
      kernel=(2,2), stride=(2,2)),
  ConvolutionLayer("conv2", List("pool1"), kernel=(5,5), 
      numOutput=50),
  PoolingLayer("pool2", List("conv2"), pooling=Pooling.Max, 
      kernel=(2,2), stride=(2,2)),
  InnerProductLayer("ip1", List("pool2"), numOutput=500),
  ReLULayer("relu1", List("ip1")),
  InnerProductLayer("ip2", List("relu1"), numOutput=10),
  SoftmaxWithLoss("loss", List("ip2", "label"))
)

More info on each of these layers and their definitions can be found in the Caffe Layer Catalogue. However, for folks with experience with Caffe, or even with an understanding of neural networks, the above should likely not prove difficult to understand.

The following is sample Scala code of a distributed training example (from the SparkNet paper):

var trainData = loadData(. . .)
var trainData = preprocess(trainData).cache()
var nets = trainData.foreachPartition(data => {
  var net = Net(netParams)
  net.setTrainingData(data)
  net)
var weights = initialWeights(. . .)
for (i <- 1 to 1000) {
  var broadcastWeights = broadcast(weights)
  nets.map(net => net.setWeights(broadcastWeights.value))
  weights = nets.map(net => {
    net.train(50)
    // an average of WeightCollection objects
    net.getWeights()}).mean()
}

Like any good deep learning project, SparkNet includes the de facto machine learning equivalent of Hello World in the form of the CIFAR app. To get it up and running is painless. First get the CIFAR data:

$ SPARKNET_HOME/caffe/data/cifar10/get_cifar10.sh

Then submit the job with spark-submit:

$ SPARK_HOME/bin/spark-submit --class apps.CifarApp 
   SparkNetPreview/target/scala-2.10/sparknetpreview-assembly-0.1-SNAPSHOT.jar 5

Discussion

As one of a small but growing number of options for clustered deep learning at the moment, you are likely in good hands if you decide to train your neural networks with SparkNet, given that it is developed by the lab responsible for its main constituent components (Spark and Caffe).

If you decide to try it out, it would do you well to remember that SparkNet is not competing with Spark (or any other distributed batch processing system) or Caffe (or any other deep learning framework); it instead has the goal of providing an alternate paradigm for fitting the training of neural networks into grander data processing pipelines. If this is an itch you are looking to scratch, SparkNet may be worth checking out.

It should be pointed out that SparkNet is not the only existing tool for deep network training on Spark. HeteroSpark, demonstrated at Spark Summit 2015 this past March, is "a heterogeneous CPU/GPU Spark platform for deep learning algorithms." The slides from the demonstration are available here, while a video can be viewed here.

Sparkling Water Likely more well-known, gaining traction, and playing in the Spark-slash-deep-learning space is H2O's Sparkling Water. The self-described "killer app for Spark," Sparkling Water allows for scalable machine learning, including deep learning, inside a Spark cluster. A write-up on using Sparkling Water for fighting crime can be found in this KDnuggets article. Sparkling Water can be found on H2O's GitHub.

SparkNet's paper outlines additional related and previous works. While there are a number to consider, it bears repeating that SparkNet is developed by the same lab that has successfully developed and maintained its 2 core constituent components, specifically Spark and Caffe. While I have no personal interest in AMP Labs, nor do I have any evidence supporting this (or otherwise), it seems logical that the group having crafted these specific deep learning and distributed processing tools would have just as good of a chance of successfully integrating them as would any other party. On the strength of this point alone I plan to further test SparkNet.

Related: