Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet
Training deep neural nets can take precious time and resources. By leveraging an existing distributed batch processing framework, SparkNet can train neural nets quickly and efficiently.
Using SparkNet
To properly implement SparkNet, you first need a Spark cluster. SparkNet jobs are then submitted via spark-submit.
Building the software from its repo requires the Scala Build Tool (SBT) and CUDA Version 7, along with the aforementioned existing Apache Spark cluster. You get Caffe for free during the SparkNet build. The reasonably straightforward instructions for building SparkNet are located on the GitHub repo (here and here).
Models are defined in a Caffe NetParameter objects, while solvers are defined in Caffe SolverParameter objects. Via SparkNet's Scala interface, a NetParameter object describing a particular deep neural network specification can be defined as in the following code (from the SparkNet repo):
val netParam = NetParam ("LeNet",
RDDLayer("data", shape=List(batchsize, 1, 28, 28), None),
RDDLayer("label", shape=List(batchsize, 1), None),
ConvolutionLayer("conv1", List("data"), kernel=(5,5),
numOutput=20),
PoolingLayer("pool1", List("conv1"), pooling=Pooling.Max,
kernel=(2,2), stride=(2,2)),
ConvolutionLayer("conv2", List("pool1"), kernel=(5,5),
numOutput=50),
PoolingLayer("pool2", List("conv2"), pooling=Pooling.Max,
kernel=(2,2), stride=(2,2)),
InnerProductLayer("ip1", List("pool2"), numOutput=500),
ReLULayer("relu1", List("ip1")),
InnerProductLayer("ip2", List("relu1"), numOutput=10),
SoftmaxWithLoss("loss", List("ip2", "label"))
)
More info on each of these layers and their definitions can be found in the Caffe Layer Catalogue. However, for folks with experience with Caffe, or even with an understanding of neural networks, the above should likely not prove difficult to understand.
The following is sample Scala code of a distributed training example (from the SparkNet paper):
var trainData = loadData(. . .)
var trainData = preprocess(trainData).cache()
var nets = trainData.foreachPartition(data => {
var net = Net(netParams)
net.setTrainingData(data)
net)
var weights = initialWeights(. . .)
for (i <- 1 to 1000) {
var broadcastWeights = broadcast(weights)
nets.map(net => net.setWeights(broadcastWeights.value))
weights = nets.map(net => {
net.train(50)
// an average of WeightCollection objects
net.getWeights()}).mean()
}
Like any good deep learning project, SparkNet includes the de facto machine learning equivalent of Hello World in the form of the CIFAR app. To get it up and running is painless. First get the CIFAR data:
$ SPARKNET_HOME/caffe/data/cifar10/get_cifar10.sh
Then submit the job with
spark-submit
:
$ SPARK_HOME/bin/spark-submit --class apps.CifarApp
SparkNetPreview/target/scala-2.10/sparknetpreview-assembly-0.1-SNAPSHOT.jar 5
Discussion
As one of a small but growing number of options for clustered deep learning at the moment, you are likely in good hands if you decide to train your neural networks with SparkNet, given that it is developed by the lab responsible for its main constituent components (Spark and Caffe).
If you decide to try it out, it would do you well to remember that SparkNet is not competing with Spark (or any other distributed batch processing system) or Caffe (or any other deep learning framework); it instead has the goal of providing an alternate paradigm for fitting the training of neural networks into grander data processing pipelines. If this is an itch you are looking to scratch, SparkNet may be worth checking out.
It should be pointed out that SparkNet is not the only existing tool for deep network training on Spark. HeteroSpark, demonstrated at Spark Summit 2015 this past March, is "a heterogeneous CPU/GPU Spark platform for deep learning algorithms." The slides from the demonstration are available here, while a video can be viewed here.
SparkNet's paper outlines additional related and previous works. While there are a number to consider, it bears repeating that SparkNet is developed by the same lab that has successfully developed and maintained its 2 core constituent components, specifically Spark and Caffe. While I have no personal interest in AMP Labs, nor do I have any evidence supporting this (or otherwise), it seems logical that the group having crafted these specific deep learning and distributed processing tools would have just as good of a chance of successfully integrating them as would any other party. On the strength of this point alone I plan to further test SparkNet.
Related: