Yahoo! CaffeOnSpark: Distributed Deep Learning on Big Data Clusters

Get an overview of Yahoo!'s CaffeOnSpark, the latest entrant into the world of distributed deep learning, directly from the developers.

By Andy Feng(@afeng76), Jun Shi and Mridul Jain (@mridul_jain), Yahoo Big ML Team.


Deep learning (DL) is a critical capability required by Yahoo product teams (ex. Flickr, Image Search) to gain intelligence from massive amounts of online data. Many existing DL frameworks require a separated cluster for deep learning, and multiple programs have to be created for a typical machine learning pipeline (see Figure 1). The separated clusters require large datasets to be transferred among them, and introduce unwanted system complexity and latency for end-to-end learning.

ML Pipeline with multiple programs on separated clusters

Figure 1: ML Pipeline with multiple programs on separated clusters

As discussed in our earlier Tumblr post, we believe that deep learning should be conducted in the same cluster along with existing data processing pipelines to support feature engineering and traditional (non-deep) machine learning. We created CaffeOnSpark to allow deep learning training and testing to be embedded into Spark applications (see Figure 2).

ML Pipeline with single program on one cluster

Figure 2: ML Pipeline with single program on one cluster

CaffeOnSpark: API & Configuration and CLI

CaffeOnSpark is designed to be a Spark deep learning package. Spark MLlib supported a variety of non-deep learning algorithms for classification, regression, clustering, recommendation, and so on. Deep learning is a key capacity that Spark MLlib lacks currently, and CaffeOnSpark is designed to fill that gap. CaffeOnSpark API supports dataframes so that you can easily interface with a training dataset that was prepared using a Spark application, and extract the predictions from the model or features from intermediate layers for results and data analysis using MLLib or SQL.

CaffeOnSpark as a Spark Deep Learning package

Figure 3: CaffeOnSpark as a Spark Deep Learning package

def main(args: Array[String]): Unit = {

   val ctx = new SparkContext(new SparkConf())
   val cos = new CaffeOnSpark(ctx)
   val conf = new Config(ctx, args).init()
   val dl_train_source = DataSource.getSource(conf, true)


   val lr_raw_source = DataSource.getSource(conf, false)
   val extracted_df = cos.features(lr_raw_source)
   val lr_input_df = extracted_df.withColumn(“Label”, cos.floatarray2doubleUDF(extracted_df(conf.label)))
      .withColumn(“Feature”, cos.floatarray2doublevectorUDF(extracted_df(conf.features(0))))
   val lr = new LogisticRegression().setLabelCol(“Label”).setFeaturesCol(“Feature”)
   val lr_model =



Figure 4: Scala application using CaffeOnSpark both MLlib

Scala program in Figure 4 illustrates how CaffeOnSpark and MLlib work together:

  • L1-L4 … You initialize a Spark context, and use it to create CaffeOnSpark and configuration object.
  • L5-L6 … You use CaffeOnSpark to conduct DNN training with a training dataset on HDFS.
  • L7-L8 …. The learned DL model is applied to extract features from a feature dataset on HDFS.
  • L9-L12 … MLlib uses the extracted features to perform non-deep learning (more specifically logistic regression for classification).
  • L13 … You could save the classification model onto HDFS.

As illustrated in Figure 4, CaffeOnSpark enables deep learning steps to be seamlessly embedded in Spark applications. It eliminates unwanted data movement in traditional solutions (as illustrated in Figure 1), and enables deep learning to be conducted on big-data clusters directly. Direct access to big-data and massive computation power are critical for DL to find meaningful insights in a timely manner.

CaffeOnSpark uses the configuration files for solvers and neural network as in standard Caffe uses. As illustrated in our example, the neural network will have a MemoryData layer with 2 extra parameters:

  1. source_class specifying a data source class
  2. source specifying dataset location

The initial CaffeOnSpark release has several built-in data source classes (including for LMDB databases and for Hadoop sequence files). Users could easily introduce customized data source classes to interact with the existing data formats.

CaffeOnSpark applications will be launched by standard Spark commands, such as spark-submit. Here are 2 examples of spark-submit commands. The first command uses CaffeOnSpark to train a DNN model saved onto HDFS. The second command is a custom Spark application that embedded CaffeOnSpark along with MLlib.

First command:

spark-submit \
   –files caffenet_train_solver.prototxt,caffenet_train_net.prototxt \
   –num-executors 2  \
   –class  \
      caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
      -train -persistent \
      -conf caffenet_train_solver.prototxt \
      -model hdfs:///sample_images.model \
      -devices 2

Second command:

spark-submit \
   –files caffenet_train_solver.prototxt,caffenet_train_net.prototxt \
   –num-executors 2  \
   –class \ 

       caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \

       -features fc8 \
       -label label \
       -conf caffenet_train_solver.prototxt \
       -model hdfs:///sample_images.model  \
       -output hdfs:///image_classifier_model \
       -devices 2

System Architecture

System Architecture

Figure 5: System Architecture

Figure 5 describes the system architecture of CaffeOnSpark. We launch Caffe engines on GPU devices or CPU devices within the Spark executor, via invoking a JNI layer with fine-grain memory management. Unlike traditional Spark applications, CaffeOnSpark executors communicate to each other via MPI allreduce style interface via TCP/Ethernet or RDMA/Infiniband. This Spark+MPI architecture enables CaffeOnSpark to achieve similar performance as dedicated deep learning clusters.

Many deep learning jobs are long running, and it is important to handle potential system failures. CaffeOnSpark enables training state being snapshotted periodically, and thus we could resume from previous state after a failure of a CaffeOnSpark job.

Open Source

In the last several quarters, Yahoo has applied CaffeOnSpark on several projects, and we have received much positive feedback from our internal users. Flickr teams, for example, made significant improvements on image recognition accuracy with CaffeOnSpark by training with millions of photos from the Yahoo Webscope Flickr Creative Commons 100M dataset on Hadoop clusters.

CaffeOnSpark is beneficial to deep learning community and the Spark community. In order to advance the fields of deep learning and artificial intelligence, Yahoo is happy to release CaffeOnSpark at under Apache 2.0 license.

CaffeOnSpark can be tested on an AWS EC2 cloud or on your own Spark clusters. Please find the detailed instructions at Yahoo github repository, and share your feedback at Our goal is to make CaffeOnSpark widely available to deep learning scientists and researchers, and we welcome contributions from the community to make that happen.

Original. Reposted with permission.